ImageEn for Delphi and C++ Builder ImageEn for Delphi and C++ Builder

 

ImageEn Forum
Profile    Join    Active Topics    Forum FAQ    Search this forumSearch
Forum membership is Free!  Click Join to sign-up
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 ImageEn Library for Delphi, C++ and .Net
 ImageEn and IEvolution Support Forum
 Reading PDF files with
 New Topic  Reply to Topic
Author Previous Topic Topic Next Topic  

JobelMurta

Brazil
1 Posts

Posted - Feb 13 2025 :  12:29:29  Show Profile  Reply
Hi,

I have some PDF documents with similar layout. This documents are a technical specification that has being changed and improved since a few years ago. Now I have more than 14k documents and would like to have a way to automate the process of tagging those pdf files based on their similarity, besides extract data from specific fields on each document to put the data extracted at the right database field. I'm attaching a few similar files and other that are a little bit different. So the similar ones I could call them as model "A", and other type as model "B", "C", etc. From model "A" data with document number is close to position (top,left,width,height) approximately. As on document "B" the same information is at another position. Please help me on ideas on how to find a solution for this issue. Also with an OCR on each data would help on load all this data correctly to a local Mysql server. If you think that training the model using find objects is the best way please give me some clues on how can I start this project as I used PDF Pages Objects demo, but was not able to find reasonable positions where I could delimiter the data to extract.

attach/JobelMurta/2025213122754_00606.pdf

attach/JobelMurta/2025213122840_000091170000002.pdf

attach/JobelMurta/202521312299_000365170000003.pdf

xequte

38796 Posts

Posted - Feb 13 2025 :  14:40:17  Show Profile  Reply
Hi

I'm not sure what basis by which you would consider the documents similar, for example, if it was based on layout you could treat the documents as images as compare their similarity:

http://www.imageen.com/help/TImageEnProc.CompareWith.html
http://www.imageen.com/help/TImageEnProc.ComputeImageEquality.html

Otherwise if it related to text, you could extract the text or object positions and create some algorithm to compare those.


These PDF files are mostly text anyway, so OCR should would only be needed if you want to parse their image content (e.g. the yellow areas of the first document).




Nigel
Xequte Software
www.imageen.com
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
Jump To: