[D] How to Automate parsing of Bank Statement PDFs to extract transaction level data

I am working on a project where I need to extract transaction data from Bank Statement PFs. 80% of my working PFs are digitally generated so to handle those I put the Regex approach, where I first extract the text into a txt file and then run Regex on this data to extract data in a meaningful format [ate, Particulars, Credit/ebit amount, Balance]. The challenge is that the Regex approach is brittle, and very sensitive to formats. So every bank requires a new Regex plus any little change in the format tomorrow by the bank will break the pipeline.

I want to make a pipeline which is agnostic to bank-format and is capable of extracting the info from the PFs. I cannot use any 3rd party APIs as the bank data is sensitive and we want to keep everything on internal servers.

Hence, I have been exploring ways in Open Source models to built this pipeline. After doing some research, I landed on LayoutLMv3 Model which can essentially label the Tokens based on their location on the page so if we are able to train the model on our data it should be able to tag every token on the page and that should do it, but the challenge here is that this model is sensitive to reading order and fails on few bank formats.

Since then I have explored MinerU but that failed as well, it isolated the transaction content table but later failed to extract data in orderly fashion as it could not differentiate between multiple lines of transactions.

Now I am working with YOLOv8 which I am training to identify transaction rows and amount columns using BBox and then I will pull the info from these BBox intersection. But the confidence here is not very high.

Has anyone here faced similar challenge? Can anyone help me with some solution or approach. It would be a great help!

Know that the most of the PFs don't have any defined table, it's just text hanging in air with lot of whitespace. I need a solve for Scanned PFs as well [integrated with OCR]

Leave a Reply