Scanning And Recognizing Text On The Console

The usual problem seems to be that there are a lot of tools which do the job, but the available front ends don't fulfil the personal requirements. Look at the current scanning front ends:

XSANE does not use Tesseract / Cuneiform (as OCR back-ends) by default which are the best (where I tend to take Tesseract, it's fully free and gives better results). Furthermore batch scanning is not comfortable (to many steps in between).
There is LIOS. The main disadvantage for me is that it is poorly accessible (at least at the first glance with Orca) and not very intuitive. Please note that this are my impressions.

Therefore I've decided to reinvent the wheel and write another front end, just a little, functional one. The result is Papierzutext which is a little program written in Python to scan, recognize and read text. Currently it scans the text, recognizes it with Tesseract and displays / and saves the text into a file. Features include:

scan pages with options like density, etc.
when set, recognize text during the scanner draws back the lamp (otherwise the computer is idle)
use exactimage to refurbish the picture
save recognized text and images so that text can be reviewed, pages deleted etc. after restarting the program
set OCR language

Please try it out and send me feedback. You can obtain the program by checking out the git repository:

git clone http://www.crustulus.de/git/papierzutext

For Installation hints, look into the file INSTALL.

And now some critics. While this program is suitable for scanning books with only texts, tables and more complicated layouts are difficult. Also the orientation recognition (how the paper lies on the scanner) does not work satisfying. These are problems of Tesseract and this program is only useful if you use Tesseract > Version 3.

Have fun!

Comments