[D] How to Automate parsing of Bank Statement PDFs to extract transaction level data

By skyforbes Nov 28, 2025 No Comments

I am working on a project where I need to extract transaction data from Bank Statement PFs. 80% of my working PFs are digitally generated so to handle those I put the Regex approach, where I first extract the text into a txt file and then run Regex on this data to extract data in a meaningful format [ate, Particulars, Credit/ebit amount, Balance]. The challenge is that the Regex approach is brittle, and very sensitive to formats. So every bank requires a new Regex plus any little change in the format tomorrow by the bank will break the pipeline.

I want to make a pipeline which is agnostic to bank-format and is capable of extracting the info from the PFs. I cannot use any 3rd party APIs as the bank data is sensitive and we want to keep everything on internal servers.

Hence, I have been exploring ways in Open Source models to built this pipeline. After doing some research, I landed on LayoutLMv3 Model which can essentially label the Tokens based on their location on the page so if we are able to train the model on our data it should be able to tag every token on the page and that should do it, but the challenge here is that this model is sensitive to reading order and fails on few bank formats.

Since then I have explored MinerU but that failed as well, it isolated the transaction content table but later failed to extract data in orderly fashion as it could not differentiate between multiple lines of transactions.

Now I am working with YOLOv8 which I am training to identify transaction rows and amount columns using BBox and then I will pull the info from these BBox intersection. But the confidence here is not very high.

Has anyone here faced similar challenge? Can anyone help me with some solution or approach. It would be a great help!

Know that the most of the PFs don't have any defined table, it's just text hanging in air with lot of whitespace. I need a solve for Scanned PFs as well [integrated with OCR]

By skyforbes

MachineLearning

[D] How to Automate parsing of Bank Statement PDFs to extract transaction level data

Like this:

By skyforbes

Leave a ReplyCancel reply

You Missed

I struggle to find old messages in ChatGPT conversations

“Benchmarking” GPT-5.1, Gemini 3 Pro, and Opus 4.5 on designing emotional states for 12B Local Models (Unity Integration)

Why are so many complex games so unbelievable poor at onboarding new players?

Peru to declare state of emergency to block Chile border crossings

Archives

[D] How to Automate parsing of Bank Statement PDFs to extract transaction level data

Like this:

By skyforbes

Related Posts

[D] Inverse hyperbolic sine as an activation function and its anti-derivative as a loss function

[D] Question and Answer Position Detection

[D] Got burned by an Apple ICLR paper — it was withdrawn after my Public Comment.

Leave a ReplyCancel reply

You Missed

I struggle to find old messages in ChatGPT conversations

“Benchmarking” GPT-5.1, Gemini 3 Pro, and Opus 4.5 on designing emotional states for 12B Local Models (Unity Integration)

Why are so many complex games so unbelievable poor at onboarding new players?

Peru to declare state of emergency to block Chile border crossings