Linux OCR Software Comparison
Over the last weeks I spent some time with researching available OCR (Optical Character Recognition) tools for Linux.
I wanted to see how recognition rates differ between the tools and created some very simple images. I took the last stanza of Edgar Allan Poe's “The Raven” and put in an image using different fonts. To make it a tiny bit more complicated I also created a gray scale version with lesser contrast of the same images.
This is the original text:
And the raven, never flitting, still is sitting, still is sitting On the pallid bust of Pallas just above my chamber door; And his eyes have all the seeming of a demon's that is dreaming, And the lamp-light o'er him streaming throws his shadow on the floor; And my soul from out that shadow that lies floating on the floor Shall be lifted - nevermore!
And this is how the resulting images looked like:
They all have 300 dpi, the text isn't distorted or arranged in multiple columns, the language is English in pure ASCII-7 and there is no image noise at all. Okay, the “Justy” font isn't your everyday printed font, but resembles a really clean handwriting. Overall this is a really basic task for OCR. Or so I thought.
Let's have a look at the results first:
abbyyocr | cuneiform | gocr | ocrad | tesseract | |
---|---|---|---|---|---|
License | Proprietary | BSD | GPL2 | GPL3 | Apache 2.0 |
Version | 8.0 | 0.9.0 | 0.48 | 0.19 | SVN r402 |
Input-Format | PNG1) | PNM | PNM | PNM | TIF2) |
Recognition rates and time spent: | |||||
courier/black | 100% (2.92s) | 61% (1.11s) | 67% (0.09s) | 21% (0.02s) | 81% (0.63s) |
courier/gray | 100% (2.85s) | 67% (0.09s) | 21% (0.03s) | 81% (0.63s) |
|
justy/black | 11% (3.62s) | 3% (1.14s) | 31% (0.11s) | 1% (0.02s) | 15% (0.61s) |
justy/gray | 14% (3.45s) | 31% (0.10s) | 1% (0.02s) | 15% (0.60s) |
|
times/black | 100% (2.80s) | 96% (1.07s) | 76% (0.16s) | 82% (0.03s) | 92% (0.74s) |
times/gray | 100% (2.87s) | 76% (0.16s) | 82% (0.03s) | 92% (0.74s) |
|
verdana/black | 100% (2.90s) | 95% (1.07s) | 98% (0.10s) | 98% (0.03s) | 98% (0.45s) |
verdana/gray | 100% (2.85s) | 98% (0.10s) | 98% (0.02s) | 98% (0.46s) |
Recognition scores where calculated by dwdiff's statistic output comparing the original text with the OCR output.
As you can see, the commercial Abbyy software has absolutely no problems with the printed fonts, but fails at the handwriting. It is the slowest of all tested tools, but keep in mind that it also reads nearly any image format, while you probably need to convert your images for the other tools first.
If you prefer a free OCR software, than tesseract is indeed as good as its reputation. Note that I used the most recent version, built from SVN here. Tesseract was a commercial product that was developed in the early nineties and later was bought and open sourced by Google. It is pretty picky about the input image's format, but once you got that right the results are decent enough.
The handwriting recognition worked best in gocr which delivered only mediocre results for the other images. Of course the result is still far from the original poetry.
I was surprised how far from perfect the results for these really simple images were. I initially intended to try some much more complicated images, but the results would have been unrecognizable then.