I'm going to use OpenCV 3.0.0 (https://github.com/Itseez/opencv/archive/3.0.0.zip) though I haven't decided whether I'll use C++ or Python. I might start with Python to build the prototype then move on to C++ for the actual production code. The library documentation can be found at http://docs.opencv.org/3.0-beta/modules/text/doc/ocr.html.
In order to get OCR to work, I will need to install OCRTesseract library, which depends on Leptonica (available from http://leptonica.com/source/leptonica-1.72.tar.gz). So I've downloaded and built Leptonica 1.72 from source, which has build surprisingly easily on my Mac. Next, I need to build OCRTesseract.
OCRTesseract 3.04.00 release can be found at https://github.com/tesseract-ocr/tesseract/archive/3.04.00.tar.gz.
My first attempt to build to configure script using ./autogen.sh
failed saying that aclocal
was not found.
paulkorir@Pauls-MacBook-Pro:tesseract-3.04.00$ ./autogen.sh
Running aclocal
./autogen.sh: line 60: aclocal: command not found Something went wrong, bailing out!
To resolve that I've had to install autoconf
and automake
paulkorir@Pauls-MacBook-Pro:automake-1.15$ ./configure checking whether make supports nested variables... yes checking build system type... x86_64-apple-darwin15.0.0 checking host system type... x86_64-apple-darwin15.0.0 checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... lib/install-sh -c -d checking for gawk... no checking for mawk... no checking for nawk... no checking for awk... awk checking whether make sets $(MAKE)... yes checking whether ln -s works... yes checking for perl... /usr/bin/perl checking for tex... tex checking for yacc... yacc checking for lex... lex checking whether autoconf is installed... no configure: error: Autoconf 2.65 or better is required. Please make sure it is installed and in your PATH.
and libtool
paulkorir@Pauls-MacBook-Pro:tesseract-3.04.00$ ./autogen.sh Running aclocal Running libtoolize ./autogen.sh: line 65: libtoolize: command not found ./autogen.sh: line 65: glibtoolize: command not found Something went wrong, bailing out!
So finally, I have the core build tools installed but
... checking for mbstate_t... yes checking for leptonica... yes checking for pixCreate in -llept... no configure: error: leptonica library missing
which is strange given that it says that it finds leptonica
! So what's going on here? The problem is that ./configure
is blind to where the include
and lib
flags. In fact, running env
does not display DYLD_LIBRARY_PATH
. Therefore, the solution (as proposed on https://github.com/tesseract-ocr/tesseract) is to run ./configure
preceded by CPPFLAG
and LDFLAGS
set to point to /usr/local/include
and /usr/local/lib
, respectively.
Finally
... Configuration is done. You can now build and install tesseract by running: $ make $ sudo make install You can not build training tools because of missing dependency. Check configure output for details.
Now to make
and make install
.
Done!
The first step is to do a practice-run using a test image. So I've taken a photo of some text using my iPhone.
I'll do a naive run using the library's command line interface:
paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract --help Usage: tesseract imagename|stdin outputbase|stdout [options...] [configfile...] paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract FullSizeRender\ 4.jpg ocr_output Tesseract Open Source OCR Engine v3.04.00 with Leptonica Error opening data file /usr/local/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
So there's some work to be done. The instructions provided with the source say that there should be a language file (with <lang>
; e.g. eng
in the archive name) available. I can't seem to find it.
The problem is that the Github repository does not link to the old data repository on Google Code (?!) which is at https://code.google.com/p/tesseract-ocr/downloads/list.
paulkorir@Pauls-MacBook-Pro:tessdata$ sudo cp -v * /usr/local/share/tessdata/ Password: eng.cube.bigrams -> /usr/local/share/tessdata/eng.cube.bigrams eng.cube.fold -> /usr/local/share/tessdata/eng.cube.fold eng.cube.lm -> /usr/local/share/tessdata/eng.cube.lm eng.cube.nn -> /usr/local/share/tessdata/eng.cube.nn eng.cube.params -> /usr/local/share/tessdata/eng.cube.params eng.cube.size -> /usr/local/share/tessdata/eng.cube.size eng.cube.word-freq -> /usr/local/share/tessdata/eng.cube.word-freq eng.tesseract_cube.nn -> /usr/local/share/tessdata/eng.tesseract_cube.nn eng.traineddata -> /usr/local/share/tessdata/eng.traineddata
Now I run it again:
paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract FullSizeRender\ 4.jpg ocr_output Tesseract Open Source OCR Engine v3.04.00 with Leptonica Error in pixReadMemJpeg: function not present Error in pixReadMem: jpeg: no pix returned Error during processing.
to get a new error.
It appears this is because I did not run sudo ldconfig
after installing OCRTesseract. This is a Linux command. Apparently, there is no Mac equivalent because one is not needed (https://discussions.apple.com/thread/3844649?tstart=0). It turns out that Macs don't have libjpeg
or libpng
development libraries installed by default. So I downloaded these (just Google them) then rebuilt (in succession) Leptonica and OCRTesseract (with CPPFLAGS
and LDFLAGS
manually).
Now we have the following:
paulkorir@Pauls-MacBook-Pro:Downloads$ tesseract FullSizeRender\ 4.jpg ocr_output Tesseract Open Source OCR Engine v3.04.00 with Leptonica Warning in pixReadMemJpeg: work-around: writing to a temp file paulkorir@Pauls-MacBook-Pro:Downloads$ cat ocr_output.txt Telephone: 01223 364433 Fax: 01223 315728 Prescriptions: 01223 321673 Management: 01223 321677 Website: www.arburyroadsurgery.nhs.uk Out of Hours: 01223 446995 (6.00pm-6.30pm Mon-Fri) 1 1 1 (all other times) PARTNERS (this is not a limited partnership) Dr. Richard Gant (GMC: 1619562) MB, ChB (1973) Dr. Andrew Watson (GMC: 2842501) MB, BS (1983), MRCGP, DGM, DRCOG Dr. Friederike Fisher (GMC: 3143502) Dr. Med (1986), MRCGP, DLO, DCH, DFSRi Dr. Oria McGuinness (GMC: 8038799) MB, BS (2001) Dr. Jaana Karttunen (GMC: 8111932) MB, BChir (2004), PhD, BA Dr. Joanna Shneerson (GMC:6145487) MB, ChB (2008), 880, DRCOG, MRCP.
It worked!
Now on to greater things!