This guide explains how to train Tesseract OCR for recognizing MICR (Magnetic Ink Character Recognition) lines on bank checks using real check images and the OCR CLI tool.
- Python 3.x along with the
pillow
library for image processing. - Tesseract 5.x (see Tesseract installation instructions)
- OCR CLI tool (follow installation instructions from the fin-ocr-cli repository)
You can prepare training data either automatically using X9 files or manually.
If you have X9 files, you can use the x9-extract
tool to automate the creation of your training data.
-
Use
x9-extract
to extract check images and metadata:FRB_COMPATIBILITY_MODE=true ./x9-extract $HOME/.fin-ocr/checks $HOME/x9-files/train/*
-
Use the OCR CLI to process and validate the extracted data:
ocr check scan <start-check-num> <end-check-num>
For example, to process checks 1 through 1000:
ocr check scan 1 1000
This command will:
- Process each check image in the
$HOME/.fin-ocr/checks
directory - Compare the OCR results with the metadata from X9 files
- Generate ground truth files for training
- Process each check image in the
Important Note: Some data in the X9 files may not be accurate. The OCR CLI helps identify potential discrepancies by comparing OCR results with X9 metadata. You may need to manually review and correct values in the resulting JSON files to ensure training quality.
The OCR CLI is responsible in this flow for generating ground truth data, which is essential for Tesseract training. Here's what happens during this process:
- For each check, the OCR CLI generates two important files
preprocessedImageFile
(TIFF): Contains the isolated MICR line image.groundTruthFile
(gt.txt): Contains the correct text corresponding to the MICR line.
Note: In some cases, ground truth files may not be created for a particular check.
This can happen if:
- The OCR results do not match the data in the JSON file
If you don't have X9 files or prefer manual setup, you can create the training data files yourself.
-
Create a directory for your training data:
$HOME/.fin-ocr/checks
or set the environment variable CHECKS_DIR to your preferred location. -
For each check, create two files in this directory:
check-<num>.tiff
: The check image in TIFF formatcheck-<num>.json
: A JSON file containing the MICR line data
JSON file schema:
{
"id": "check-<num>",
"fileName": "original_x9_filename",
"fileSeqNo": 1,
"routingNumber": "123456789",
"accountNumber": "1234567",
"checkNumber": "1001",
"auxiliaryOnUs": "1001",
"payorBankRoutingNumber": "12345678",
"payorBankCheckDigit": "9",
"onUs": "1234567/1001"
}
- You can use the OCR CLI to process your manually prepared data:
ocr check scan <start-check-num> <end-check-num>
-
Ensure you're in the
real
directory of this repository. -
Run the training command:
./mgr train <starting-num> <count>
For example, to train on 20,000 checks:
./mgr train 1 20000
Note: This process is CPU-intensive and may take several hours. Consider using
nohup
or a similar tool to run it in the background.
e.g.
nohup ./mgr train 1 20000 > training_output.log 2>&1 &
You can monitor the progress by checking the log file: tail -f training_output.log
Once the training is complete, you can find the training results in $HOME/.fin-ocr/train/results/
-
The
mgr
script sets up the training environment:- Clones the tesstrain repository if not present
- Downloads necessary language data files
- Uses the OCR CLI to generate and validate ground truth data from your check images and JSON files
-
The script then initiates the Tesseract training process using the tesstrain make file.
If you encounter inaccuracies in the X9 data:
- Review the OCR CLI output for any mismatches between OCR results and X9 metadata.
- Manually inspect the check images and JSON files for discrepancies.
- Update the JSON files with correct information if needed.
- Consider creating a
corrections
file to track and apply corrections automatically. (TODO: expand on this)
After training completes:
-
Results are stored in
$HOME/.fin-ocr/train/results/<date>/
:micr_e13b.traineddata
: The trained Tesseract data file$HOME/.fin-ocr/train/train.log
: Detailed logs
-
You can use the resulting
micr_e13b.traineddata
file with the fin-ocr CLI or fin-ocr REST service for MICR line recognition.
- If training fails, check the
train.log
file for error messages ($HOME/.fin-ocr/train/train.log
or$TRAIN_DIR/train.log
) - If using X9 files, verify that the extracted data is correct and complete. Use the OCR CLI to validate and identify potential issues.
- You may need to experiment with different training parameters (e.g., number of iterations, learning rates) to achieve optimal results for your specific dataset.