HOCR: Handwriting OCR project

Architecture

DenseNet + Transformer decoder

The model uses a stacked DenseNet, with residual connections and positional embedding to store spatial informations, and a Transformer decoder to predict the $\LaTeX$ formula.

The model comprises two main components:

Encoder: A stacked DenseNet architecture that processes input images to extract meaningful features.
Decoder: A Transformer-based decoder that translates the extracted features into LaTeX code.

Detailed Architecture

DenseNet Encoder

DenseNetBone: The fundamental building block, consisting of two convolutional layers with bottleneck layers and optional dropout for regularization.
DenseNet: Stacks multiple DenseNetBone blocks and includes transition layers to reduce spatial dimensions and channel numbers.
StackedDenseNetEncoder: Combines multiple DenseNet modules with residual connections, culminating in a final convolution and 2D positional encoding to prepare features for the decoder.

Transformer Decoder

Embedding Layer: Converts token IDs into dense vectors.
Positional Encoding: Adds sinusoidal positional information to embeddings.
TransformerDecoderLayer: Comprises multi-head attention and feed-forward neural networks.
Final Linear Layer: Maps decoder outputs to vocabulary size for token prediction.

Workflow

Image Preprocessing: Converts input images to grayscale, applies binary thresholding to ensure white formulas on black backgrounds, resizes to 224x224, and normalizes.
Feature Extraction: The DenseNet encoder processes the preprocessed image to extract feature maps.
Sequence Generation: The Transformer decoder generates LaTeX tokens based on the encoder's output, utilizing beam search to optimize predictions.

Previous Failed Architecture: ViT + Transformer

Reason why ViT doesn't work well:

Domain differences in pre-training ViT models: ViT (Vision Transformer) is usually pre-trained on large-scale natural image datasets like ImageNet. Your dataset, on the other hand, is an image of handwritten mathematical formulas, which is very different from natural images in terms of feature distribution. A pre-trained ViT model may not be able to extract valid features, resulting in encoder output that is meaningless to the decoder.

Lack of low-level feature extraction: handwritten formula images have complex local features, such as stroke and symbol details. viT slices the image directly into patches and then performs a global self-attention mechanism, which may not be able to capture these critical local features.

Datasets and preprocessing

The dataset combines many different datasets, with a total amount of over 240K images.

The images are onverted to grayscale, applied binary thresholding to ensure white formulas on black backgrounds, resized to 224x224, and normalized.

Dataset Structure

The project expects the dataset to be in a Parquet file (.parquet) with the following columns:

formula: The LaTeX expression of the handwritten formula.
filename: The name of the image file.
image: The binary content of the image.

Installation

Prerequisites

Python: 3.10 or higher PyTorch: 2.1.2 or higher CUDA: cu118 For GPU acceleration

init

The versions of the packages listed in requirements.txt are not guaranteed to work.

pip install -r requirements.txt

Run

Training

Execute the training script by python densenet_model.py [OPTIONS].

The options are:

Argument	Type	Default Value	Description
`--data_pq_file`	`str`	`dataset/train_parquets/training_data.parquet`	Path to the training data parquet file
`--dictionary_dir`	`str`	`dataset/dictionary.txt`	Path to the dictionary file
`--saved_tokenizer_dir`	`str`	`dataset/whole_tokenizer.json`	Path to save/load the custom tokenizer
`--hidden_dim`	`int`	`512`	Hidden dimension size
`--num_layers`	`int`	`8`	Number of transformer decoder layers
`--num_heads`	`int`	`16`	Number of attention heads
`--batch_size`	`int`	`32`	Batch size for training and validation
`--num_epochs`	`int`	`40`	Number of training epochs
`--learning_rate`	`float`	`1e-4`	Learning rate for optimizer
`--weight_decay`	`float`	`1e-4`	Weight decay for optimizer
`--test_size`	`float`	`0.1`	Proportion of dataset for validation split
`--random_state`	`int`	`5525`	Random state for dataset splitting

python densenet_model.py --data_pq_file path/to/training_data.parquet \
                         --dictionary_dir path/to/dictionary.txt \
                         --saved_tokenizer_dir path/to/custom_tokenizer.json \
                         --hidden_dim 256 \
                         --num_layers 6 \
                         --num_heads 8 \
                         --batch_size 64 \
                         --num_epochs 50 \
                         --learning_rate 5e-5 \
                         --weight_decay 1e-4

Inference

Ensure you have the test datasets, python densenet_inference.py.

Options:

Argument	Type	Default Value	Description
`--checkpoint`	`str`	Required	Path to the model checkpoint (.pth file)
`--tokenizer`	`str`	Required	Path to the custom tokenizer (.json file)
`--test_parquet_path`	`str`	Required	Path to the test data Parquet file
`--test_dataset_name`	`str`	Required	The name of the testing dataset
`--output_file`	`str`	Required	Path to save the prediction results
`--hidden_dim`	`int`	`512`	Hidden dimension size
`--num_layers`	`int`	`8`	Number of transformer decoder layers
`--num_heads`	`int`	`16`	Number of attention heads

python densenet_inference.py --checkpoint path/to/model_checkpoint.pth \
                             --tokenizer path/to/custom_tokenizer.json \
                             --test_parquet_path path/to/test_data.parquet \
                             --test_dataset_name my_test_dataset \
                             --output_file path/to/save_predictions.csv \
                             --hidden_dim 256 \
                             --num_layers 6 \
                             --num_heads 8

Keep dimension same with the checkpoints

Results

Loss

The training loss and validation loss from training 40 epochs, using DenseNet + Transformer:

BLEU

The average BLEU score on HME100K is 0.8539.
The average BLEU score on IM2LATEX is 0.7900.
The average BLEU score on M2E is 0.7805.

ExpRate

The average ExpRate scores on HME100K is 47.4946%, (≤1 error): 55.4558%, (≤2 errors): 62.42%.
The average ExpRate scores on IM2LATEX is 9.4567%, (≤1 error): 10.2349%, (≤2 errors): 13.82%.
The average ExpRate scores on M2E is 19.1487%, (≤1 error): 23.1748%, (≤2 errors): 28.15%.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
dataset		dataset
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
cnn_model.py		cnn_model.py
densenet_inference.py		densenet_inference.py
densenet_model.py		densenet_model.py
img_pre_test.ipynb		img_pre_test.ipynb
playground.ipynb		playground.ipynb
requirenments.txt		requirenments.txt
tokenizer_test.ipynb		tokenizer_test.ipynb
vit_inference.py		vit_inference.py
vit_model.py		vit_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HOCR: Handwriting OCR project

Architecture

DenseNet + Transformer decoder

Detailed Architecture

Workflow

Previous Failed Architecture: ViT + Transformer

Datasets and preprocessing

Dataset Structure

Installation

Prerequisites

init

Run

Training

Inference

Results

Loss

BLEU

ExpRate

About

Releases

Packages

Languages

Wooonster/HOCR

Folders and files

Latest commit

History

Repository files navigation

HOCR: Handwriting OCR project

Architecture

DenseNet + Transformer decoder

Detailed Architecture

Workflow

Previous Failed Architecture: ViT + Transformer

Datasets and preprocessing

Dataset Structure

Installation

Prerequisites

init

Run

Training

Inference

Results

Loss

BLEU

ExpRate

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages