How to OCR a PDF and enable text selection and search

Suppose you have a PDF that was created using a scanner, or that was passed to you but it contains the information in the form of an image. The procedure to which we must submit our beloved PDF is called OCR: a process that automatically identifies symbols or characters that belong to a certain alphabet, from an image to store it in the form of data with which we can interact using a text editing program or similar.


pdfocr is a simple tool that creates a new PDF with an embedded text layer, allowing the user to select text and search for words in it, without changing the final appearance of the PDF.

What pdfocr is NOT for:

This is only useful if the PDF contains the information in image form; if you exported the PDF from OpenOffice, it already has an embedded text layer, so this procedure is unnecessary.

How to install pdfocr:

sudo add-apt-repository ppa: gezakovacs / pdfocr
sudo apt-get update
sudo apt-get install pdfocr

How to use pdfocr:

Open a terminal, go to the directory where the PDF you want to convert is located, and enter the following (replacing input.pdf with the PDF you want to convert and output.pdf by the name of the new file with the embedded text layer)

pdfocr -i input.pdf -o output.pdf

Wait for each page of your PDF to be OCR practiced and the final modified file to be created. This should take a few seconds per page, depending on the resolution of your PDF.


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.

  1.   Rudolph Lara said

    rodolfo @ rodolfo-desktop: ~ $ sudo apt-get install pdfocr
    Reading package list ... Done
    Creating dependency tree
    Reading the status information ... Done
    E: The pdfocr package could not be located
    rodolfo @ rodolfo-desktop: ~ $

  2.   Let's use Linux said

    Did you make sure to add the corresponding PPA?
    This PPA likely has versions of pdfocr for older Ubuntu versions. Think that this post is already several months old. Anyway, the idea is the same. Go to Launchpad and look for a PPA that contains versions of pdfocr for Maverick.
    Cheers! Paul.

  3.   jvare said

    Well, it will be a matter of testing it to see how it works

  4.   Let's use Linux said

    Go ahead! Let us know if you were successful !! If it doesn't work we can also try to help you! Cheers! Paul.

  5.   a01653 said

    Hello,
    I have tested the program on a pdf and the result is not very good. I'm used to the professional acrobat 8 ​​and was looking for something similar. Acrobat passes utilities to the files to clean and straighten the scanned pdfs and thus obtain a better source for the ocr. You know if there is a solution for this.

    All the best

  6.   Let's use Linux said

    Hello! I've heard that Tesseract is the best opensource OCR. I don't know if it will be good. Also, you have to get your hands a bit dirty to make it work. Here are some instructions. If you are successful, please let me know since, if it works, it will probably end up becoming a post.

    First install the packages "tesseract 2.03-4" and "imagemagick" using Synaptic, "xsane2tess" from "http://download.tuxfamily.org/guadausers/guadaV4/".

    Then create the tmp folder in: / home / yourusername / tmp

    Then open Xsane to configure it, Preferences–> Configuration–> OCR tab and fill in the following:

    OCR command -> xsane2tess -l spa
    Input file option -> -i
    Output file option -> -o
    Output option -fd interface -> -x

    In Xsane configurations in the "save" tab in the part where it says temporary directory, make sure there is the "tmp" folder that you created in "/ home / yourusername"

    I also leave you a page with details on how to do OCR in Ubuntu: https://help.ubuntu.com/community/OCR

  7.   Let's use Linux said

    Another method that I discovered x there is the following:

    Assuming the scanner has already been connected and recognized by the system

    1. I open System> Administration> Synaptic Package Manager (in GNOME)

    2. Search and framework to install tesseract-ocr-spa (to scan in Spanish) and gscan2pdf

    3. To scan I open Applications> Graphics> gscan2pdf

    And ready.

  8.   Troubadour said

    Hey friend, thank you very much, the truth is that tesseract is a good tool, but very limited compared to books with "problematic" scanning. On the other hand, this software adapts more easily ... 😀

  9.   Juan Anez said

    In a process of digitizing Images, PDF-A files are being converted, these must be OCRed. How sensitive to the result is scanning in Black & White or Grayscale? What is recommended?