April 23, 2026 · 8 min read

Five bank statement layouts that break naive PDF parsers

Notes from parsing thousands of statements across 24 US and UK banks. The failure modes that separate a weekend-project converter from one you'd trust with tax prep — with concrete examples for each.

"Just use pdftotext and parse the rows" is the first-pass intuition every engineer has when they look at a bank statement for the first time. It works for about 15 minutes, which is how long it takes to process one well-behaved Chase PDF. Then you hit a second bank, and a third, and the assumption that a statement is a table of date/description/amount rows collapses in five specific and entertaining ways.

These are the five layouts we hit most often across our production corpus of US and UK bank statements — the ones that force a parser to go from "regex over pdfplumber output" to "LLM with a reconciliation check." Each example is real; the specific numbers are anonymised.

1. Split-by-direction multi-section layouts (Bank of America)

A Bank of America personal checking statement does not print transactions in chronological order. It splits them across four labelled sections: "Deposits and other additions," "Withdrawals and other subtractions," "Checks," and "Service fees." Each section is sorted by date internally, but the sections themselves appear in a fixed order regardless of when the transactions occurred. A naive row-parser that grabs everything under the transaction header in the order it's printed produces a sequence that isn't chronological and can't be trivially re-sorted — the same date can appear in every section, and some entries (checks) identify themselves only by number.

The correct handling is to parse each section independently, tag each row with its section semantics, then merge by posting date. The reconciliation check then runs on the merged stream. Get the merge wrong — most commonly by forgetting the Checks block or summing withdrawals with the wrong sign — and the arithmetic doesn't close. BoA is the statement that taught us to never trust the visible order of rows in a PDF.

2. Per-day running balances (Chase)

Chase statements print a running balance column on every transaction row. So does the layout printed by many other banks. Except Chase's running balance is a per-day ending balance — if three transactions happen on March 14, only the last row of March 14 shows the balance; the first two show blank. A parser that naively pulls the running balance column and tries to use it as a per-row check misreads two-thirds of the activity.

We handle this by ignoring the printed running balance for reconciliation entirely and computing our own — beginning balance plus the cumulative sum of signed amounts, row by row. The printed column becomes a secondary integrity check: at the end of each day, our reconstructed balance should match the last printed balance of that day. When it doesn't, we know something is wrong at the row level, not just at end-of-statement.

3. Cycle-close dates that aren't month-aligned (credit cards)

Credit-card statements — Chase Sapphire, Amex, Barclaycard, Discover it — close on a fixed day of the month, not on month-end. A "March statement" might cover transactions from February 14 to March 13. If your bookkeeping system assumes every statement covers exactly one calendar month, you'll either miss two days or double-count them depending on which direction you round.

The fix is to parse the cycle start and end directly from the statement header and treat the period as first-class metadata on the extracted rows — never derive a period from the filename or the statement month. Our QBO exporter uses the cycle-close date as the batch date, which matches how QuickBooks actually expects credit-card statements to arrive, and makes the reconciliation for multi-statement imports work without calendar-math surprises.

4. DD/MM vs MM/DD date parsing ambiguity

A date written as "03/04/25" is either March 4 or April 3, depending on whether the statement is US or UK. For the first twelve days of every month, both interpretations are structurally valid, and the rows themselves give you nothing to disambiguate — the transactions sort and reconcile identically under either reading.

The bank you're parsing tells you the answer, but only if the parser knows which bank. A statement from Barclays is DD MMM YY ("03 Apr 25"); a statement from Chase is MM/DD; a statement from a co-brand card issued jointly between a US and UK entity can go either way. Getting this wrong doesn't break reconciliation — the arithmetic still closes — but it silently shifts every transaction to the wrong calendar quarter, which is the kind of error that hurts when it's time to file.

We detect bank identity early in the pipeline and pin the date format per bank. When the layout is ambiguous, we look at the statement period header: a document covering "01 Jan – 31 Jan" with a row dated "14/01" can only be 14 January, not 01 April, because the period bounds it. Small check, high leverage.

5. The off-by-penny display rounding

Some banks — Lloyds, older HSBC UK formats, a handful of US regional banks — display currency amounts rounded to the cent but hold internal sub-cent precision for interest calculations. Summing the displayed amounts gives you an ending balance that's off from the printed ending balance by one or two cents across a 200-row statement.

This one is not a parser error. The statement really does add up to what it prints; the rounding artefacts are real. But a parser that treats any penny-level discrepancy as a failure will refuse to verify statements that are, arithmetically speaking, fine. A parser that ignores discrepancies will miss real errors that happen to sit at the penny scale.

The reconciliation check we ship uses a $0.01 tolerance — anything within a penny is treated as "verified." Anything larger is flagged. This threshold has turned out to be the right one across the 24 banks we currently support; we've never seen a real error that was under a penny, and we've seen plenty of real displays that are off by exactly one. If we ever hit a bank where the display rounding legitimately exceeds that, we'll raise the tolerance per-bank rather than globally — preserving the invariant that "Verified" means something specific.

The punchline

Each of the five layouts above is solvable individually; several of them have been solved for years by the desktop accounting software market. What's new isn't that they're tricky — it's that a modern LLM, handed any of these statements, will happily return a plausible-looking spreadsheet that gets at least one of them wrong. Sometimes more than one. Without a reconciliation check at the end of the pipeline, there's no way to tell which.

This is why every export from our pipeline runs the check, and why the product is organised around the two outcomes — Verified, or a specific flagged row — rather than a continuous "confidence score." You don't need to know how often we're right in aggregate; you need to know whether this specific file adds up to what the bank printed. The parser either says yes, or it tells you where the delta is.

Try the converter — 15 free pages, no card

Other guides