How to scan documents and apply OCR in Linux

Did you try Simple Scan, the default Ubuntu program, but were disappointed to see that it doesn't support OCR, etc.? At the same time, is XSANE too complicated for the simple task you set out to do? Do you miss how easy it was to scan documents with Omnipage?

Well, no wonder ... let's see how to scan and perform OCR in the scanned docs in a very, very simple way. You will be amazed with the results.

How to scan in 2 simple steps

1.- Install gscan2pdf & tesseract-ocr (along with its respective language pack). That is, in case you are going to scan documents in English, install tesseract-ocr-eng; If they are in Spanish, install tesseract-ocr-eng and so.

sudo apt-get install gscan2pdf tesseract-ocr tesseract-ocr-eng

2.- The rest is pretty straightforward for those who have ever scanned and OCR a document in Windows. I opened gscan2pdf, scan the document, go to Options> OCR and select tesseract as an OCR engine. There are other engines, but Tesseract is by far the best performing engine. Finally, you can save the final document as PDF, DJVU, etc. going to File> Save.

Note: when saving scanned documents it is best to save them in DJVU format (the quality is the same as a PDF but there is a very important difference in size).

The following video is in English but it is enough to see it to understand how everything works.

Leave a Comment Cancel reply

Anonymous said
ago 11 years

Alex: Many gamers have a problem getting “friend zoned” with girls they like.
After explaining to a confused Melissa that he is not Waldo,
but The Hon Ludovick Watson, she agrees to go to
England. Your question also needs to be SIMPLE enough
for her to respond without a tone of thought.

Here is my web blog - Tao of Badass Review

Reply to Anonymous
bachitux said
ago 11 years

Notice that the packages are also available in Fedora. 🙂

Reply to BachiTux
chapel said
ago 11 years

I have two scanners, one is the Canon Scan 5000f for A4 documents, and the other is the Braun NovoScan, for scanning negatives and slides. After installing the gscan2 utility, and rebooting, you don't see any of the scanners. what happened? Why don't you see the scanners?

Reply to chapela
Let's use Linux said
ago 11 years

No offense friends, but there is no point in OCRing math functions.

In any case, they should do OCR to the surrounding text (which explains those functions or whatever) and that the functions remain as images.
Cheers! Paul.

Respond to Let's Use Linux
NotFromBrooklyn said
ago 11 years

Hey, if you've come up with a solution to your problem, I'd like to know.

Reply to NotFromBrooklyn
Juan Vallejo said
ago 11 years

I think I'm a little late but I have a question. I'm an engineering student and I'm looking for a way to digitize and clean my notes, but the problem is that most of those notes are full of mathematical symbols, graphs, and functions. Is there currently something that can help me?

Reply to Juan Vallejo
Let's use Linux said
ago 11 years

Great! Good date! In Arch Tesseract it is in the official repositories, but not gscan2pdf. You have to install it through yaourt.

Respond to Let's Use Linux
elcaliman13142 said
ago 11 years

Thank you very much it helped me a lot, make linux more friendly grace again

Reply to Elcaliman13142
Let's use Linux said
ago 11 years

You're welcome! It is a pleasure to have been able to help.
A hug! Paul.

Respond to Let's Use Linux
Martin said
ago 11 years

Very good I was looking for it, I'll try and I'll tell how this is going.

Reply to Martin
Mauro Nicolas Ybanez Girard said
ago 11 years

Thanks, I'll try!

Reply to Mauro Nicolás Ybáñez Girard
Leonard Hernandez said
ago 10 years

When I go to run the OCR with the Tesseract engine it only gives me the option of the process in English even though I installed the tesseract-ocr-spa package. What I can do?

Reply to Leonardo Hernandez
jaime and isabel said
ago 5 years

download gnscaner2pdf but it does not scan, it only searches for devices and does not stop searching after 15 min. What's up?

Reply to jaime and isabel