Learn how to correctly recognize text from an image with tesseract and ocrfeeder.

Many of you must already know the optical character recognition (OCR) programs, if so, you have come across some that do not recognize characters typical of the Spanish language such as eñe, tílde among others (ñ, ó, ü).

Now thanks to tesseract and to the package tesseract-ocr-eng We will be able to recognize these characters and we will see how to treat certain images where the color or pixel levels are not correct.

First we must install the following programs:

tesseract-ocr
tesseract-ocr-eng
ocrfeeder

In Debian I advise you to install them without installing the recommended softwares:

sudo apt-get --no-install-recommends install ocrfeeder tesseract-ocr-spa tesseract-ocr

If we have an image (scanned document) in which the letter is legible, it will be possible to recognize the text in approximately 90% of the cases, the tables will not be recognized, if the image has 2 columns it will automatically recognize a column first and then the other to maintain the order of the text.

There are 2 ways to recognize the text, one through the command line in a terminal or through ocrfeeder, the latter will require more processing time:

Command line method:

tesseract "/entrada/fichero.jpg" "/salida/fichero.txt" -l spa -psm 3

For the conversion of multiple images we will use the following command:

cd /carpeta/imagenes find ./ -name "*.jpg" | sort | while read file; do tesseract "$file" "`basename "$file" | sed 's/\.[[:alnum:]]*$//'`.txt" -l spa -psm 3; done

To join the resulting text files in said folder we will use the following command with which the paragraphs will be joined correctly.

cd /carpeta/imagenes find ./ -name "*.txt" | sort | while read file; do cat "$file" | sed 's|^$|##|g' | tr '\n' " " | tr '##' "\n" >> Texto-unido.txt; done

Method with ocrfeeder:
1- We open the ocrfeeder program.
2- We edit the engine by clicking on Tools - OCR Engines, select the esseract engine and click on edit, and where it says engine arguments, we change the script for this one:

$IMAGE $FILE -l spa -psm 3 > /dev/null 2> /dev/null; cat $FILE.txt; rm $FILE $FILE.txt

3- We import an image or a folder where there are several images.
4- We click on identify document, once the document is identified, you can manually select which parts of it will be images or text.
5- Before exporting the document we click on Edit - Edit page, we select the desired page, the most common is letter (letter).
6- To export the document we click on File - Export, we select the desired output format, if the document has images I advise you to use the odt or html format, if it is only text it is best to use the Plain Text (txt) format .

This does not end here because there are many photocopies whose quality is not adequate, to repair these we will use the gimp and the embossed filter (This process can be slow):
1- We open the image with the gimp.
2- We click on Filters - Distortions - Embossing, We select the bump map box, we adjust the azimuth levels to approximately 162,25, elevation to 88,73 and depth to 6 or 3. We save the image with 100% quality if it is jpg, in export - name.jpg.

Optionally you can adjust the white levels by clicking on Colors - Levels - auto.

DesdeLinux

Learn how to correctly recognize text in an image with tesseract and ocrfeeder.