Batch recognition of image-only PDF files and text-containing PDF files generated by Microsoft engines

Question

Batch recognition of image-only PDF files and text-containing PDF files generated by Microsoft engines

Anonymous

Hello, I was reading the last answer provided on this forum to this question. The author of the answer, ¡Firedog, stated that two different Microsoft engines produce two different types of PDF files, one that contains only images (and than will need to have OCR performed for the text contained in it to become searchable) and one that already contains searchable text :

"The Microsoft Print to PDF engine in Edge produces a sort of photographic reproduction of the material printed. Each page is an image, which means that it's not possible to select items (text, images) on a page, nor can the text be searched in a PDF reader. The Edge Save as PDF engine on the other hand produces a 'normal' sort of PDF file, where text can be selected and searched."

My question is whether there is a way to scan a large set of pdf files to distinguish which ones are of the first type, so that one can limit performing OCR processing to only that subset.

If this question is too wide and does not fall fully inside the scope of a Microsoft forum, I would appreciate an answer to the smaller question of whether there is a way to scan a large set of pdf files to distinguish which ones have been generated by the Microsoft Print to PDF engine (and equivalent Microsoft engines that produce PDF files of the image-only kind) and which ones by the Edge Save as PDF engine (and equivalent Microsoft engines that produce PDF files with searchable text).

Thank you

Locked Question. This question was migrated from the Microsoft Support Community. You can vote on whether it's helpful, but you can't add comments or replies or follow the question.

0 comments

2 answers

Answer 1

HI Derrick,

Thank you for your reply. I am a bit confused by the second part of it.

In the first part you propose a way to distinguish image-only pdf files from pdf files containing text by using Python tools.

In the second part however you say that this method does not allow to distinguish between pdf files created by the Microsoft Print to PDF engine and pdf files created by the Edge Save as PDF engine.

From the answer to that old question that I quoted it seems as if the first engine generates always pdf files that belong to the "image only set" while the second engine generates always pdf files that belong to the "containing text set"

Hence if the Python tools that you mention are able to partition completely the two sets, they should also be able to partition these two specific subsets.

What am I missing here?

Best regards,

Marco

Answer 2

Hi

Welcome to Microsoft community.

You asked about how to distinguish between PDF files that contain only images and those that contain searchable text among a large set of PDF files. This is a very interesting problem and one way to solve it is to extract and analyze the content of the PDF files.

Typically, PDF files containing only images do not return any content when extracting text. On the other hand, PDF files containing text return the text when extracting it. Therefore, you can distinguish between files containing only images and those containing text by extracting the content of the PDF files and analyzing the results.

If you're using Python, you can use libraries like PyPDF2 to extract the content of PDF files. You can extract the content of each page using the 'getPage' method of the 'PdfFileReader' object of PyPDF2, and you can extract the text of the page using the 'extractText' method.

However, using this method, you cannot distinguish between PDF files generated by the Microsoft Print to PDF engine and the Edge Save as PDF engine. This is because there are no separate identifiers in the format of the PDF files created by these two engines.

Your question is more suitable for publishing on Microsoft Learn, you can click on "Ask a question", there are experts who can provide more professional solutions in that place.

Here is a link to the forum where you can raise specific scenarios and share your idea to help solve the problem.

I won't be able to help you, but I'll leave that question open in case one of our amazing volunteers has ideas for you.

Best regards

Derrick Qian | Microsoft Community Support Specialist

Share via

Batch recognition of image-only PDF files and text-containing PDF files generated by Microsoft engines

2 answers