Parsing Precinct Election Results PDFs Using LLMs

Derek Willis, OpenElections

View sample PDFs →

What We Tested

For Texas and Mississippi, we tested Claude Haiku 4.5, Claude Sonnet 4.5, Gemini 3 Flash, Gemini 2.5 Pro, and Gemini 3 Pro.

For Pennsylvania, we used Claude Sonnet 4.5 to write a custom Python parser.

State Sample Size Baseline (Reference Data)
Texas 8 counties from 2024 general Web UI LLM OCR + Python parsers
Mississippi 9 counties OCR + manual data manipulation
Pennsylvania Multiple counties from 2024 and 2025 Custom Python parsers (Electionware)
We're comparing LLM extraction to verified data extracted last year, not raw PDFs. These reference files were validated through multiple methods. We're measuring: can LLMs match results that took us weeks to produce?

Sample selection: Deliberately chose counties with different formats, complexity levels (4-47 precincts), and layout styles.

TX extraction code | MS extraction code | PA parser code

The Results - What Works

Best Performance Against Reference Data:

Model Accuracy Sample Baseline Method
Gemini 2.5 Pro 99.1% 9 MS counties OCR + manual cleanup
Claude Haiku 100% Scurry County, TX Google Gemini (Human Verified)
Claude Haiku 99.9% Limestone County, TX (21 precincts) Google Gemini (Human Verified)

Takeaways

  • In the right conditions, LLMs can do great work
  • But they aren't perfect: 77.8% match across 8 TX counties

Best predictors of success:

  • Clean, well-formatted PDFs (even if scanned)
  • Consistent table structures

Texas Results: Claude Haiku 4.5

County-by-County Performance vs. Reference Data

County Precincts Votes Checked Vote Accuracy Precinct Name Errors
Scurry 11 321 100.0% 0
Limestone 21 870 99.9% 0
San Saba 6 72 91.7% 0
Foard 4 146 84.9% 0
Lynn 8 376 71.5% 0
Jones 4 240 68.3% 0
Cottle 4 106 64.2% 0
Panola 19 364 16.2% 0

High Success

  • Standard table structures
  • Consistent formatting

Low Success

  • Panola: Many 0-vote candidates
  • Same problems across models

Full Texas comparison report →

What Doesn't Work

Common Failure Patterns When Comparing to Reference:

1. Missing zero-value rows (all models)

  • Reference data includes candidates with 0 votes; LLMs consistently omit these
  • Panola County, TX: 6 minor party candidates with 0 votes, missing

2. Incomplete extraction (default max tokens too small)

  • Reference has 3,936 rows; LLMs extracted only 3,131-3,697 rows
  • Missing Supreme Court races in Jones, Lynn, Cottle counties

3. Vote count errors (PDF-specific)

  • Panola County, TX: Claude Haiku 16.2% match to reference
  • Same counties fail across different models

4. Precinct name OCR errors (vertical vs horizontal)

  • Mississippi: 2,898 precinct name errors vs. reference (Claude Haiku, 80 counties)
  • Texas: 0 precinct errors across 2,495 verified votes
When models disagree with reference data, the same problem counties appear consistently.

Pennsylvania Custom PDF Parser Software

Baseline Method: Extract Text and Parse

Electionware system (used by many PA counties):

Issues discovered building the reference parsers:

  • Hard-coded values (offices, parties, etc)
  • Format variations between election years (small changes break code)
  • Missing party codes (DAR = cross-filing)
  • Loop control bugs (candidates misattributed to wrong offices)
  • Office header variations (ALL-CAPS vs. Mixed-Case)
The old parser works. But:
  • Required Python expertise
  • Not portable to other states
  • Ongoing maintenance for format changes

PA parser development guide →

Don't Trust, Verify

You can't just trust LLM output. Here's how we validate against reference:

1. Direct comparison to reference data

  • Vote count matching (what we measured)
  • Precinct name validation

2. County-level total checks

  • Extract precinct data with LLM
  • Compare sums to official county totals

3. Multi-model extraction on samples

  • Run 2-3 models on representative counties
  • Where they agree, confidence is high

4. Automated validation patterns

  • Row count checks (expected vs. actual)
  • Zero-vote pattern detection

5. Targeted manual review

  • Focus on counties with low match rates (<80%)
  • Spot-check high-confidence extractions (>95%)

Recommendations

Replace existing extraction methods (for clean formats)

  • PDFs like Limestone, Scurry, San Saba (TX)
  • Use LLMs instead of OCR + manual cleanup
  • Validate with county totals

LLM as first pass (for more complex formats)

  • Faster than manual, but needs verification
  • Multi-model extraction
  • Systematic spot-checking
LLMs will not work well for some PDFs.

Figuring out which ones is super important

How You Should Use LLMs

A sampling approach:
  • Test LLMs on 3-5% of your counties
  • Compare to reference data (or build reference from LLMs + validation)
  • Identify which formats work well

Workflow:

Questions?

1 / 9