Detecting individual multiple documents in a pdf

Question

I need to solve a problem whereas a scan of multiple documents (contracts, invoices, bank extracts) is stored into a PDF and I need to identify how many individual documents are contained in the PDF and which pages of the PDF belong to which document.

This scenario presents itself, for example, when a person feeds a bunch of documents into an automatic scanner that then creates a single PDF from these documents. Each document is just an image and might have one or more pages and may have different layouts.

What would be an intelligent AI approach to attacking this problem?

score 0 · Answer 1 · answered Oct 20 '22 at 23:01

This is an interesting question.

Let's make the problem more abstract:

Suppose we have n docs(each page in the PDF is one doc). How can we classify these docs?

My approach:

This could be treated as an NLP task; my approach is. First, you have to extract texts from each doc, then compute the similarity between each doc, if the similarity is great than the threshold you give, they should be considered as same docs.

Useful link:

Compute similarity: https://towardsdatascience.com/calculating-document-similarities-using-bert-and-other-models-b2c1a29c9630

OCR: https://nanonets.com/blog/attention-ocr-for-text-recogntion/

Detecting individual multiple documents in a pdf

1 Answers1