T O P I C R E V I E W |
JobelMurta |
Posted - Feb 13 2025 : 12:29:29 Hi,
I have some PDF documents with similar layout. This documents are a technical specification that has being changed and improved since a few years ago. Now I have more than 14k documents and would like to have a way to automate the process of tagging those pdf files based on their similarity, besides extract data from specific fields on each document to put the data extracted at the right database field. I'm attaching a few similar files and other that are a little bit different. So the similar ones I could call them as model "A", and other type as model "B", "C", etc. From model "A" data with document number is close to position (top,left,width,height) approximately. As on document "B" the same information is at another position. Please help me on ideas on how to find a solution for this issue. Also with an OCR on each data would help on load all this data correctly to a local Mysql server. If you think that training the model using find objects is the best way please give me some clues on how can I start this project as I used PDF Pages Objects demo, but was not able to find reasonable positions where I could delimiter the data to extract.
attach/JobelMurta/2025213122754_00606.pdf
attach/JobelMurta/2025213122840_000091170000002.pdf
attach/JobelMurta/202521312299_000365170000003.pdf |
1 L A T E S T R E P L I E S (Newest First) |
xequte |
Posted - Feb 13 2025 : 14:40:17 Hi
I'm not sure what basis by which you would consider the documents similar, for example, if it was based on layout you could treat the documents as images as compare their similarity:
http://www.imageen.com/help/TImageEnProc.CompareWith.html http://www.imageen.com/help/TImageEnProc.ComputeImageEquality.html
Otherwise if it related to text, you could extract the text or object positions and create some algorithm to compare those.
These PDF files are mostly text anyway, so OCR should would only be needed if you want to parse their image content (e.g. the yellow areas of the first document).
Nigel Xequte Software www.imageen.com
|
|
|