This howto will hopefully get you up and running with a Scan and OCR (optical character recognition), allowing you to scan and read documents on your Linux box.
This howto assumes you have Linux command line experience, and is based off a debian system, however, should be similar for other distros.
What do I need?
You'll want to grab the following:
- sane & scanimage
- Tesseract
- Ocropus
- Kies desktop front end (just to make life easier)
- lynx, for displaying the scanned page
To get sane (and scanimage) run this command: or the right command for your distribution of Linux:
sudo apt-get install sane lynx automake1.4 build-essential subversion
sudo apt-get build-dep tesseract-ocr
Note, you may need other libs, if so please email me so I can update the page... (my email address can be found at the bottom of the page).
Try this first, and also carefully look at output of ./configure... following is a useful command if make fails:
./configure > /path/to/log_file.log 2>&1
grep -i no /path/to/log_file.log
Getting an image from the scanner
This is the important bit, you need to make sure you can get an image from the scanner, so first make sure it can find your scanner:
Turn on your scanner and plug it in.
Reboot the system with the scanner turned on, so it can be detected.
Next, run:
sudo scanimage -L
This should list your scanner, if it does, great! Lets scan something:
sudo scanimage > /tmp/testfile
Make sure you have something in the scanner (in the right way), and see what happens.
If you hear scanning, then it probably worked, if not then try to debug this issue and make sure you can at least scan with scanimage before you continue as scanimage is required by the following programs.
If scanimage doesn't find your scanner, you may need drivers, so you'll unfortunately have to do some debugging.
Finally, make scanimage usable by all users:
sudo chmod +s /usr/bin/scanimage
Lets install it and see what happens!
First we must checkout the code from svn,for tesseract, so do the following:
cd to_a_directory_suitable_for_holding_svn_folders(/tmp will do)
next run:
svn checkout http://tesseract-ocr.googlecode.com/svn/trunk tesseract-ocr
Once it has checked out, you are ready to build and install tesseract:
cd tesseract-ocr
Download my patch to fix compile errors generated from the current tesseract sources:
wget http://members.iinet.net.au/~ddalton/projects/ocr/patches/compile-fix_tesseract.patch
patch -p0 < compile-fix_tesseract.patch
If all goes well, do the following:
./configure
make
sudo make install
Install kies
Grab kies written by Willem as follows:
cd to_dir_to_download_kies_to
wget ftp://ftp.csir.co.za/mi/national_accessibility_portal/wvdwalt/kies-12869.tar.bz2
tar -xvf kies-12869.tar.bz2
cd kies
sudo ./install.sh
Now you should have kies and tesseract installed, next we need to install ocropus.
Installing the ocropus package
The following commands check out a version of ocropus known to compile and run:
cd to_svn_dir(/tmp will do)
svn checkout -r 864 http://ocropus.googlecode.com/svn/trunk/ ocropus
cd ocropus
./configure
jam
sudo jam install
Testing your ocr!
kies_p2t
Now press space and hope for the best, make sure you have a page in your scanner the right way and see what is presented in lynx.
If it fails, carefully examine the errors, most likely a binary is in the wrong place, see if you can track this down otherwise feel free to email me.
Notes
This is very brief and I'm sure I missed something. Please email me with comments, suggestions or problems you are having and I'll try and help.
Feel free to grab the html file, make changes and send it to me, and if they are correct, I'll upload your changes.
Any feedback is appreciated.
daniel.dalton47@gmail.com