Installing tesseract 4.0 on Ubuntu 16.04

2 minute read

You can download Tesseract from Ubuntu Package Manager simply with:

$ apt-get install tesseract-ocr 

But unfortunately Ubuntu package manager doesn’t contain the Tesseract 4.0 version. It will download Tesseract 3.x (and Leptonica 1.7.0). So if you want the latest version of Tesseract, you have to download it from git repository and compile it manually.

Here I’ll go through the steps I followed to install Tesseract 4.0 on my Ubuntu 16.04 LTS system.

NOTE

Ubuntu 14.04 LTS or prior versions doesn’t support tesseract 4.0 LSTM. You can install tesseract 3.x on it.

Step 1: Install required dependecies

if not already installed you need the following libraries.

$ sudo apt-get install g++ 
$ sudo apt-get install autoconf automake libtool
$ sudo apt-get install autoconf-archive
$ sudo apt-get install pkg-config
$ sudo apt-get install libpng-dev
$ sudo apt-get install libjpeg8-dev
$ sudo apt-get install libtiff5-dev
$ sudo apt-get install zlib1g-dev

If you want to use training tools too :

$ sudo apt-get install libicu-dev
$ sudo apt-get install libpango1.0-dev
$ sudo apt-get install libcairo2-dev

Step 2: Install Lepronica 1.74.x

Install from source:

  • Ubuntu 16.04 repository contains Leptonica 1.71. So you need to Download the latest version from Leptonica.
  • Go to the Download directory and install it.
    $ sudo tar xf leptonica-1.74.tar.gz &&\
    cd leptonica-1.74 &&\
    sudo ./configure &&\
    sudo make &&\
    sudo make install
    

    Or

    Insltall from Unofficial Ubuntu PPA

    Easier way of installing Leptonica is to install it from Unofficial Ubuntu PPA by Alex.

    $ sudo add-apt-repository ppa:alex-p/tesseract-ocr &&\
    sudo apt-get update &&\
    sudo apt-get install -y libleptonica-dev &&\
    sudo ldconfig
    

Step 3: Install Tesseract

  • Download or clone Tesseract 4.0 from github repository
  • Compile and install
    $ sudo sh autogen.sh  &&\
    ./configure  &&\
    LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make  &&\
    sudo make install  &&\
    sudo make install -langs  &&\
    sudo ldconfig
    

Step 5: Verify Installation

  • Check if correct version of Tesseract and Leptonica is installed.
    $ tesseract --version
    

    It should show something like this

  • Check the installed languages
    $ tesseract --list-langs
    

Step 6: Download language files

Initially you’ll find no language data is available. You need to download data files of the laguages you need. You’ll find three types of files availabe. tessdata_fast, tessdata_best, and tessdata. You can download any of them according to your need. ( You’ll find more about them in tesseract github wiki )

  • Move your downloaded language files to the tessdata directory. Default is /usr/local/share/tessdata. Now you can see the available languages using $ tesseract --list-langs command from terminal.

Using tesseract

For example to extract bangla text data from an image file named input.png you can use the following command:

$ tesseract input.png output -l ben

You can find more options and uses with $tesseract --help or $man tesseract

Create manual config files

  • Create a file (supposemyconfig) in /usr/local/share/tessdata/configs directory
  • Add control parameters [To find all the available parameters use tesseract --print-parameters]. For example to get the intermediate preprocessed image tesseract generates add tessedit_write_images to true or use user specified dictionaty instead of default dictionay.
  • Use the configfile name as parameter while running tesseract.
    $ tesseract input.jpg output.txt myconfig
    

Recognize Multiple languages in an image

$ tesseract input.jpg out -l eng+ben #primary-eng 
  #or
$ tesseract input.jpg out -l ben+eng #primary-ben 

All the Best !

Updated:

Leave a Comment