I want to make a pipeline which is agnostic to bank-format and is capable of extracting the info from the PFs. I cannot use any 3rd party APIs as the bank data is sensitive and we want to keep everything on internal servers.
Hence, I have been exploring ways in Open Source models to built this pipeline. After doing some research, I landed on LayoutLMv3 Model which can essentially label the Tokens based on their location on the page so if we are able to train the model on our data it should be able to tag every token on the page and that should do it, but the challenge here is that this model is sensitive to reading order and fails on few bank formats.
Since then I have explored MinerU but that failed as well, it isolated the transaction content table but later failed to extract data in orderly fashion as it could not differentiate between multiple lines of transactions.
Now I am working with YOLOv8 which I am training to identify transaction rows and amount columns using BBox and then I will pull the info from these BBox intersection. But the confidence here is not very high.
Has anyone here faced similar challenge? Can anyone help me with some solution or approach. It would be a great help!
Know that the most of the PFs don't have any defined table, it's just text hanging in air with lot of whitespace. I need a solve for Scanned PFs as well [integrated with OCR]