Simple Tool to Perform Custom OCR

Posted 6 months, 4 weeks ago | Originally written on 19 Dec 2015

I'm going to use OpenCV 3.0.0 (https://github.com/Itseez/opencv/archive/3.0.0.zip) though I haven't decided whether I'll use C++ or Python. I might start with Python to build the prototype then move on to C++ for the actual production code. The library documentation can be found at http://docs.opencv.org/3.0-beta/modules/text/doc/ocr.html.

In order to get OCR to work, I will need to install OCRTesseract library, which depends on Leptonica (available from http://leptonica.com/source/leptonica-1.72.tar.gz). So I've downloaded and built Leptonica 1.72 from source, which has build surprisingly easily on my Mac. Next, I need to build OCRTesseract.

OCRTesseract 3.04.00 release can be found at https://github.com/tesseract-ocr/tesseract/archive/3.04.00.tar.gz.

My first attempt to build to configure script using ./autogen.sh failed saying that aclocal was not found.

[email protected]:tesseract-3.04.00$ ./autogen.sh

Running aclocal

./autogen.sh: line 60: aclocal: command not found
Something went wrong, bailing out! 

To resolve that I've had to install autoconf and automake

[email protected]:automake-1.15$ ./configure
checking whether make supports nested variables... yes
checking build system type... x86_64-apple-darwin15.0.0
checking host system type... x86_64-apple-darwin15.0.0
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... lib/install-sh -c -d
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
checking whether make sets $(MAKE)... yes
checking whether ln -s works... yes
checking for perl... /usr/bin/perl
checking for tex... tex
checking for yacc... yacc
checking for lex... lex
checking whether autoconf is installed... no
configure: error: Autoconf 2.65 or better is required.
Please make sure it is installed and in your PATH.

and libtool

[email protected]:tesseract-3.04.00$ ./autogen.sh
Running aclocal
Running libtoolize
./autogen.sh: line 65: libtoolize: command not found
./autogen.sh: line 65: glibtoolize: command not found
Something went wrong, bailing out!

So finally, I have the core build tools installed but

...
checking for mbstate_t... yes
checking for leptonica... yes
checking for pixCreate in -llept... no
configure: error: leptonica library missing

which is strange given that it says that it finds leptonica! So what's going on here? The problem is that ./configure is blind to where the include and lib flags. In fact, running env does not display DYLD_LIBRARY_PATH. Therefore, the solution (as proposed on https://github.com/tesseract-ocr/tesseract) is to run ./configure preceded by CPPFLAG and LDFLAGS set to point to /usr/local/include and /usr/local/lib, respectively.

Finally

...
Configuration is done.
You can now build and install tesseract by running:

$ make
$ sudo make install

You can not build training tools because of missing dependency.
Check configure output for details.

Now to make and make install.

Done!

The first step is to do a practice-run using a test image. So I've taken a photo of some text using my iPhone.

I'll do a naive run using the library's command line interface:

paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract --help
Usage:
tesseract imagename|stdin outputbase|stdout [options...] [configfile...]
paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract FullSizeRender\ 4.jpg ocr_output
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

So there's some work to be done. The instructions provided with the source say that there should be a language file (with <lang>; e.g. eng in the archive name) available. I can't seem to find it.

The problem is that the Github repository does not link to the old data repository on Google Code (?!) which is at https://code.google.com/p/tesseract-ocr/downloads/list.

paulkorir@Pauls-MacBook-Pro:tessdata$ sudo cp -v * /usr/local/share/tessdata/
Password:
eng.cube.bigrams -> /usr/local/share/tessdata/eng.cube.bigrams
eng.cube.fold -> /usr/local/share/tessdata/eng.cube.fold
eng.cube.lm -> /usr/local/share/tessdata/eng.cube.lm
eng.cube.nn -> /usr/local/share/tessdata/eng.cube.nn
eng.cube.params -> /usr/local/share/tessdata/eng.cube.params
eng.cube.size -> /usr/local/share/tessdata/eng.cube.size
eng.cube.word-freq -> /usr/local/share/tessdata/eng.cube.word-freq
eng.tesseract_cube.nn -> /usr/local/share/tessdata/eng.tesseract_cube.nn
eng.traineddata -> /usr/local/share/tessdata/eng.traineddata

Now I run it again:

paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract FullSizeRender\ 4.jpg ocr_output
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Error in pixReadMemJpeg: function not present
Error in pixReadMem: jpeg: no pix returned
Error during processing. 

to get a new error.

It appears this is because I did not run sudo ldconfig after installing OCRTesseract. This is a Linux command. Apparently, there is no Mac equivalent because one is not needed (https://discussions.apple.com/thread/3844649?tstart=0). It turns out that Macs don't have libjpeg or libpng development libraries installed by default. So I downloaded these (just Google them) then rebuilt (in succession) Leptonica and OCRTesseract (with CPPFLAGS and LDFLAGS manually).

Now we have the following:

paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract FullSizeRender\ 4.jpg ocr_output
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemJpeg: work-around: writing to a temp file
paulkorir@Pauls-MacBook-Pro:Downloads$ cat ocr_output.txt
Telephone: 01223 364433

Fax: 01223 315728

Prescriptions: 01223 321673

Management: 01223 321677

Website: www.arburyroadsurgery.nhs.uk

Out of Hours: 01223 446995 (6.00pm-6.30pm Mon-Fri)

1 1 1 (all other times)

PARTNERS (this is not a limited partnership)

Dr. Richard Gant (GMC: 1619562) MB, ChB (1973)

Dr. Andrew Watson (GMC: 2842501) MB, BS (1983), MRCGP, DGM, DRCOG

Dr. Friederike Fisher (GMC: 3143502) Dr. Med (1986), MRCGP, DLO, DCH, DFSRi
Dr. Oria McGuinness (GMC: 8038799) MB, BS (2001)

Dr. Jaana Karttunen (GMC: 8111932) MB, BChir (2004), PhD, BA
Dr. Joanna Shneerson (GMC:6145487) MB, ChB (2008), 880, DRCOG, MRCP.

It worked!

Now on to greater things!