I'm going to use OpenCV 3.0.0 (https://github.com/Itseez/opencv/archive/3.0.0.zip) though I haven't decided whether I'll use C++ or Python. I might start with Python to build the prototype then move on to C++ for the actual production code. The library documentation can be found at http://docs.opencv.org/3.0-beta/modules/text/doc/ocr.html.
In order to get OCR to work, I will need to install OCRTesseract library, which depends on Leptonica (available from http://leptonica.com/source/leptonica-1.72.tar.gz). So I've downloaded and built Leptonica 1.72 from source, which has build surprisingly easily on my Mac. Next, I need to build OCRTesseract.
OCRTesseract 3.04.00 release can be found at https://github.com/tesseract-ocr/tesseract/archive/3.04.00.tar.gz.
My first attempt to build to configure script using
./autogen.sh failed saying that
aclocal was not found.
[email protected]:tesseract-3.04.00$ ./autogen.sh
./autogen.sh: line 60: aclocal: command not found Something went wrong, bailing out!
To resolve that I've had to install
[email protected]:automake-1.15$ ./configure checking whether make supports nested variables... yes checking build system type... x86_64-apple-darwin15.0.0 checking host system type... x86_64-apple-darwin15.0.0 checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... lib/install-sh -c -d checking for gawk... no checking for mawk... no checking for nawk... no checking for awk... awk checking whether make sets $(MAKE)... yes checking whether ln -s works... yes checking for perl... /usr/bin/perl checking for tex... tex checking for yacc... yacc checking for lex... lex checking whether autoconf is installed... no configure: error: Autoconf 2.65 or better is required. Please make sure it is installed and in your PATH.
[email protected]:tesseract-3.04.00$ ./autogen.sh Running aclocal Running libtoolize ./autogen.sh: line 65: libtoolize: command not found ./autogen.sh: line 65: glibtoolize: command not found Something went wrong, bailing out!
So finally, I have the core build tools installed but
... checking for mbstate_t... yes checking for leptonica... yes checking for pixCreate in -llept... no configure: error: leptonica library missing
which is strange given that it says that it finds
leptonica! So what's going on here? The problem is that
./configure is blind to where the
lib flags. In fact, running
env does not display
DYLD_LIBRARY_PATH. Therefore, the solution (as proposed on https://github.com/tesseract-ocr/tesseract) is to run
./configure preceded by
LDFLAGS set to point to
... Configuration is done. You can now build and install tesseract by running: $ make $ sudo make install You can not build training tools because of missing dependency. Check configure output for details.
The first step is to do a practice-run using a test image. So I've taken a photo of some text using my iPhone.
I'll do a naive run using the library's command line interface:
paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract --help Usage: tesseract imagename|stdin outputbase|stdout [options...] [configfile...] paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract FullSizeRender\ 4.jpg ocr_output Tesseract Open Source OCR Engine v3.04.00 with Leptonica Error opening data file /usr/local/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
So there's some work to be done. The instructions provided with the source say that there should be a language file (with
eng in the archive name) available. I can't seem to find it.
The problem is that the Github repository does not link to the old data repository on Google Code (?!) which is at https://code.google.com/p/tesseract-ocr/downloads/list.
paulkorir@Pauls-MacBook-Pro:tessdata$ sudo cp -v * /usr/local/share/tessdata/ Password: eng.cube.bigrams -> /usr/local/share/tessdata/eng.cube.bigrams eng.cube.fold -> /usr/local/share/tessdata/eng.cube.fold eng.cube.lm -> /usr/local/share/tessdata/eng.cube.lm eng.cube.nn -> /usr/local/share/tessdata/eng.cube.nn eng.cube.params -> /usr/local/share/tessdata/eng.cube.params eng.cube.size -> /usr/local/share/tessdata/eng.cube.size eng.cube.word-freq -> /usr/local/share/tessdata/eng.cube.word-freq eng.tesseract_cube.nn -> /usr/local/share/tessdata/eng.tesseract_cube.nn eng.traineddata -> /usr/local/share/tessdata/eng.traineddata
Now I run it again:
paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract FullSizeRender\ 4.jpg ocr_output Tesseract Open Source OCR Engine v3.04.00 with Leptonica Error in pixReadMemJpeg: function not present Error in pixReadMem: jpeg: no pix returned Error during processing.
to get a new error.
It appears this is because I did not run
sudo ldconfig after installing OCRTesseract. This is a Linux command. Apparently, there is no Mac equivalent because one is not needed (https://discussions.apple.com/thread/3844649?tstart=0). It turns out that Macs don't have
libpng development libraries installed by default. So I downloaded these (just Google them) then rebuilt (in succession) Leptonica and OCRTesseract (with
Now we have the following:
paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract FullSizeRender\ 4.jpg ocr_output Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemJpeg: work-around: writing to a temp file paulkorir@Pauls-MacBook-Pro:Downloads$ cat ocr_output.txt Telephone: 01223 364433 Fax: 01223 315728 Prescriptions: 01223 321673 Management: 01223 321677 Website: www.arburyroadsurgery.nhs.uk Out of Hours: 01223 446995 (6.00pm-6.30pm Mon-Fri) 1 1 1 (all other times) PARTNERS (this is not a limited partnership) Dr. Richard Gant (GMC: 1619562) MB, ChB (1973) Dr. Andrew Watson (GMC: 2842501) MB, BS (1983), MRCGP, DGM, DRCOG Dr. Friederike Fisher (GMC: 3143502) Dr. Med (1986), MRCGP, DLO, DCH, DFSRi Dr. Oria McGuinness (GMC: 8038799) MB, BS (2001) Dr. Jaana Karttunen (GMC: 8111932) MB, BChir (2004), PhD, BA Dr. Joanna Shneerson (GMC:6145487) MB, ChB (2008), 880, DRCOG, MRCP.
Now on to greater things!