Computers, bikes and things I’d like to remember.

From PDF to TIFF to ASCII

November 27th, 2009 Posted in General

This is one of those so I remember next time blog posts.

Yesterday I was asked to help someone who wanted to take some PDF files and make them into text for OCR purposes. These particular PDFs were made from some TIFF files created by scanning lots of paper. The OCR software that I have to hand is Google’s Tesseract free and open source OCR engine and it likes images to be monochrome TIFFs with a three letter TIF file name extension. So I needed to extract TIFF images from the PDFs at a high enough resolution that the OCR can take place, convert them from RGB colour to 1 bit TIFF and feed them to tesseract to extract some text. There must be a nicer way, but here’s how I eventually did it:

To extract the RGB TIFF data from the PDF as monochrome at a high resolution, I used the ‘convert’ command from the open source imagemagick library.

convert -monochrome -units PixelsPerInch -density 300×300 Navy_List-October-1905-1.pdf image%02d.tif

This results in 34 individual TIFF files, one for each page of a 34 page PDF. Then, to turn these into one big TIFF file with a three letter extension, I used the convert command again:

convert -adjoin image* bigtiff.tif

Finally, I used tesseract to OCR the resulting image file and extract the text into a file I called bigout.txt (tesseract adds the txt extension automatically).

tesseract bigtiff.tif bigout

The result is awful if the purpose is to read the text, but as the basis for a full text search of the documents, given the quality of the scanning, it’s actually pretty good.

  1. One Response to “From PDF to TIFF to ASCII”

  2. By Brad Hards on Nov 27, 2009

    You could also have got the files out using pdfimages from poppler.

Post a Comment