Parsing Precinct Election Results PDFs Using LLMs

Derek Willis, OpenElections

PDFs of official precinct results from counties in Texas, Mississippi, and Pennsylvania
Each has unique PDF formats (image scans, multi-column layouts, varying structures)
Can LLMs be better/faster than OCR + manual cleanup or custom Python parsers?

What We Tested

For Texas and Mississippi, we tested Claude Haiku 4.5, Claude Sonnet 4.5, Gemini 3 Flash, Gemini 2.5 Pro, and Gemini 3 Pro.

For Pennsylvania, we used Claude Sonnet 4.5 to write a custom Python parser.

State	Sample Size	Baseline (Reference Data)
Texas	8 counties from 2024 general	Web UI LLM OCR + Python parsers
Mississippi	9 counties	OCR + manual data manipulation
Pennsylvania	Multiple counties from 2024 and 2025	Custom Python parsers (Electionware)

We're comparing LLM extraction to verified data extracted last year, not raw PDFs. These reference files were validated through multiple methods. We're measuring: can LLMs match results that took us weeks to produce?

Sample selection: Deliberately chose counties with different formats, complexity levels (4-47 precincts), and layout styles.

TX extraction code | MS extraction code | PA parser code

The Results - What Works

Best Performance Against Reference Data:

Model	Accuracy	Sample	Baseline Method
Gemini 2.5 Pro	99.1%	9 MS counties	OCR + manual cleanup
Claude Haiku	100%	Scurry County, TX	Google Gemini (Human Verified)
Claude Haiku	99.9%	Limestone County, TX (21 precincts)	Google Gemini (Human Verified)

Takeaways

In the right conditions, LLMs can do great work
But they aren't perfect: 77.8% match across 8 TX counties

Best predictors of success:

Clean, well-formatted PDFs (even if scanned)
Consistent table structures

Texas Results: Claude Haiku 4.5

County-by-County Performance vs. Reference Data

County	Precincts	Votes Checked	Vote Accuracy
Scurry	11	321	100.0%
Limestone	21	870	99.9%
San Saba	6	72	91.7%
Foard	4	146	84.9%
Lynn	8	376	71.5%
Jones	4	240	68.3%
Cottle	4	106	64.2%
Panola	19	364	16.2%

High Success

Standard table structures
Consistent formatting

Low Success

Panola: Many 0-vote candidates
Same problems across models

Full Texas comparison report →

What Doesn't Work

Common Failure Patterns When Comparing to Reference:

1. Missing zero-value rows (all models)

Reference data includes candidates with 0 votes; LLMs consistently omit these
Panola County, TX: 6 minor party candidates with 0 votes, missing

2. Incomplete extraction (default max tokens too small)

Reference has 3,936 rows; LLMs extracted only 3,131-3,697 rows
Missing Supreme Court races in Jones, Lynn, Cottle counties

3. Vote count errors (PDF-specific)

Panola County, TX: Claude Haiku 16.2% match to reference
Same counties fail across different models

4. Precinct name OCR errors (vertical vs horizontal)

Mississippi: 2,898 precinct name errors vs. reference (Claude Haiku, 80 counties)
Texas: 0 precinct errors across 2,495 verified votes

When models disagree with reference data, the same problem counties appear consistently.

Pennsylvania Custom PDF Parser Software

Baseline Method: Extract Text and Parse

Electionware system (used by many PA counties):

Documented significant issues encountered during development
Result: High-quality, reliable extraction... eventually

Issues discovered building the reference parsers:

Hard-coded values (offices, parties, etc)
Format variations between election years (small changes break code)
Missing party codes (DAR = cross-filing)

Loop control bugs (candidates misattributed to wrong offices)
Office header variations (ALL-CAPS vs. Mixed-Case)

The old parser works. But:

Required Python expertise
Not portable to other states
Ongoing maintenance for format changes

PA parser development guide →

Don't Trust, Verify

You can't just trust LLM output. Here's how we validate against reference:

1. Direct comparison to reference data

Vote count matching (what we measured)
Precinct name validation

2. County-level total checks

Extract precinct data with LLM
Compare sums to official county totals

3. Multi-model extraction on samples

Run 2-3 models on representative counties
Where they agree, confidence is high

4. Automated validation patterns

Row count checks (expected vs. actual)
Zero-vote pattern detection

5. Targeted manual review

Focus on counties with low match rates (<80%)
Spot-check high-confidence extractions (>95%)

Recommendations

Replace existing extraction methods (for clean formats)

PDFs like Limestone, Scurry, San Saba (TX)
Use LLMs instead of OCR + manual cleanup
Validate with county totals

LLM as first pass (for more complex formats)

Faster than manual, but needs verification
Multi-model extraction
Systematic spot-checking

LLMs will not work well for some PDFs.

Figuring out which ones is super important

How You Should Use LLMs

Some PDF formats: LLMs match days/weeks of manual work
Zero-value rows problematic across most models
Format and prompts matter more than models, although models do matter

A sampling approach:

Test LLMs on 3-5% of your counties
Compare to reference data (or build reference from LLMs + validation)
Identify which formats work well

Workflow:

Claude Code
GitHub Copilot (academics get expanded access to models)

Questions?

openelections@gmail.com
Links to code/data at OpenElections.net