ImageEn Forum - Reading PDF files with

Profile Join Active Topics Forum FAQ

Forum membership is Free! Click Join to sign-up
Username:	Password:
Save Password
Forgot your Password?

All Forums

ImageEn Library for Delphi, C++ and .Net

ImageEn and IEvolution Support Forum

Reading PDF files with

New Topic

Reply to Topic

Author

Topic

JobelMurta

Brazil
1 Posts

Posted - Feb 13 2025 : 12:29:29

Hi,

I have some PDF documents with similar layout. This documents are a technical specification that has being changed and improved since a few years ago. Now I have more than 14k documents and would like to have a way to automate the process of tagging those pdf files based on their similarity, besides extract data from specific fields on each document to put the data extracted at the right database field. I'm attaching a few similar files and other that are a little bit different. So the similar ones I could call them as model "A", and other type as model "B", "C", etc. From model "A" data with document number is close to position (top,left,width,height) approximately. As on document "B" the same information is at another position. Please help me on ideas on how to find a solution for this issue. Also with an OCR on each data would help on load all this data correctly to a local Mysql server. If you think that training the model using find objects is the best way please give me some clues on how can I start this project as I used PDF Pages Objects demo, but was not able to find reasonable positions where I could delimiter the data to extract.

attach/JobelMurta/2025213122754_00606.pdf

attach/JobelMurta/2025213122840_000091170000002.pdf

attach/JobelMurta/202521312299_000365170000003.pdf

xequte

38939 Posts

Posted - Feb 13 2025 : 14:40:17

Hi

I'm not sure what basis by which you would consider the documents similar, for example, if it was based on layout you could treat the documents as images as compare their similarity:

http://www.imageen.com/help/TImageEnProc.CompareWith.html
http://www.imageen.com/help/TImageEnProc.ComputeImageEquality.html

Otherwise if it related to text, you could extract the text or object positions and create some algorithm to compare those.

These PDF files are mostly text anyway, so OCR should would only be needed if you want to parse their image content (e.g. the yellow areas of the first document).

Nigel
Xequte Software
www.imageen.com

Topic

New Topic

Reply to Topic