Extracting Text from PDF documents can be relatively easy especially if it is created by converting a word processing document to it such as Exporting to PDF from Pages or Microsoft Word. These type of PDF documents would have a text layer.

However PDFs created from TIFF, PNG or JPEG images would not contain a text layer and would be made of raster images. Under the older version of Batch TIFF & PDF Converter, extracting Text from such document would yield no results.

A simple test to see if the PDF document contain text is to open it in a PDF Viewer such as Preview or Adobe Acrobat and use the Select All or select a portion of the text. If it can be selected then it contains a text layer.

Beginning from v4.62, we incorporated macOS Vision Framework which allows our app to perform Optical Character Recognition (OCR) to those PDF documents to recognize and extract the text from it. It is the same framework which the iPhone uses to recognize street names from photos. This is definitely a godsend. Because OCR works on all image formats, we extended the supported image formats.

To perform the extraction, for the Image Format, select Extract to Multiple Text Files which would extract the text from each PDF documents or images (TIFF, PNG or JPEG). For PDF, it will first check if there is a text layer and if so, would extract from it. If not, it will begin the OCR process. However for other image formats, text layers are not supported and thus the app would perform OCR on them.

Paragraphs would be supported but currently font attributes are not supported so only pure text are being extracted. As macOS Vision Framework is new, only romanized languages are supported. This includes English and Cyrillic languages.