How to OCR a PDF and Enable Text Selection and Search

How to OCR a PDF and enable text selection and search

Suppose you have a PDF that was created using a scanner, or that was passed to you but it contains the information in the form of an image. The procedure to which we must submit our beloved PDF is called OCR: a process that automatically identifies symbols or characters that belong to a certain alphabet, from an image to store it in the form of data with which we can interact using a text editing program or similar.

pdfocr is a simple tool that creates a new PDF with an embedded text layer, allowing the user to select text and search for words in it, without changing the final appearance of the PDF.

What pdfocr is NOT for:

This is only useful if the PDF contains the information in image form; if you exported the PDF from OpenOffice, it already has an embedded text layer, so this procedure is unnecessary.

How to install pdfocr:

sudo add-apt-repository ppa: gezakovacs / pdfocr
sudo apt-get update
sudo apt-get install pdfocr

How to use pdfocr:

Open a terminal, go to the directory where the PDF you want to convert is located, and enter the following (replacing input.pdf with the PDF you want to convert and output.pdf by the name of the new file with the embedded text layer)

pdfocr -i input.pdf -o output.pdf

Wait for each page of your PDF to be OCR practiced and the final modified file to be created. This should take a few seconds per page, depending on the resolution of your PDF.

Leave a Comment Cancel reply

Rudolph Lara said
ago 11 years

rodolfo @ rodolfo-desktop: ~ $ sudo apt-get install pdfocr
Reading package list ... Done
Creating dependency tree
Reading the status information ... Done
E: The pdfocr package could not be located
rodolfo @ rodolfo-desktop: ~ $

Reply to Rodolfo Lara
Let's use Linux said
ago 11 years

Did you make sure to add the corresponding PPA?
This PPA likely has versions of pdfocr for older Ubuntu versions. Think that this post is already several months old. Anyway, the idea is the same. Go to Launchpad and look for a PPA that contains versions of pdfocr for Maverick.
Cheers! Paul.

Respond to Let's Use Linux
jvare said
ago 11 years

Well, it will be a matter of testing it to see how it works

Reply to Jvare
Let's use Linux said
ago 11 years

Go ahead! Let us know if you were successful !! If it doesn't work we can also try to help you! Cheers! Paul.

Respond to Let's Use Linux
a01653 said
ago 11 years

Hello,
I have tested the program on a pdf and the result is not very good. I'm used to the professional acrobat 8 and was looking for something similar. Acrobat passes utilities to the files to clean and straighten the scanned pdfs and thus obtain a better source for the ocr. You know if there is a solution for this.

All the best

Reply to a01653
Let's use Linux said
ago 11 years

Hello! I've heard that Tesseract is the best opensource OCR. I don't know if it will be good. Also, you have to get your hands a bit dirty to make it work. Here are some instructions. If you are successful, please let me know since, if it works, it will probably end up becoming a post.

First install the packages "tesseract 2.03-4" and "imagemagick" using Synaptic, "xsane2tess" from "http://download.tuxfamily.org/guadausers/guadaV4/".

Then create the tmp folder in: / home / yourusername / tmp

Then open Xsane to configure it, Preferences–> Configuration–> OCR tab and fill in the following:

OCR command -> xsane2tess -l spa
Input file option -> -i
Output file option -> -o
Output option -fd interface -> -x

In Xsane configurations in the "save" tab in the part where it says temporary directory, make sure there is the "tmp" folder that you created in "/ home / yourusername"

I also leave you a page with details on how to do OCR in Ubuntu: https://help.ubuntu.com/community/OCR

Respond to Let's Use Linux
Let's use Linux said
ago 11 years

Another method that I discovered x there is the following:

Assuming the scanner has already been connected and recognized by the system

1. I open System> Administration> Synaptic Package Manager (in GNOME)

2. Search and framework to install tesseract-ocr-spa (to scan in Spanish) and gscan2pdf

3. To scan I open Applications> Graphics> gscan2pdf

And ready.

Respond to Let's Use Linux
Troubadour said
ago 10 years

Hey friend, thank you very much, the truth is that tesseract is a good tool, but very limited compared to books with "problematic" scanning. On the other hand, this software adapts more easily ... 😀

Reply to Trovadordebarro
Juan Anez said
ago 10 years

In a process of digitizing Images, PDF-A files are being converted, these must be OCRed. How sensitive to the result is scanning in Black & White or Grayscale? What is recommended?

Reply to juan anez