https://github.com/smarbal/ocrai-grader
Automatic grader web application using AI OCR built with Flask, Tailwind CSS + Flowbite, PaddleOCR and pyspellchecker.
Run
sudo docker-compose -up --build
in the root directory.
The web page will be available at http://localhost:3000/.
- Users can upload files from storage or directly take a picture with their device (works with laptop too).
- Images can be previewed and cropped before analysis.
- PDF files can be uploaded but can't be previewed or cropped.
- PDF files with multiple pages are also supported.
- History of processed files is available.
- Results for any processed file are viewable.
- PNG or JSON exports are available. The JSON will contain arrays with the coordinates of text zones and their transcription; image will directly show the zones and their corresponding results.
- Results for every word/sentence is given with a confidence score.
- A spellchecker is available to correct the ouput of the OCR tool, as an option.
- French is supported through OCR and spellchecking, as an option.
In order to recognize text across any kind of document, I use PaddleOCR. I've selected this toolkit for a few reasons :
- Performance : It uses latest state-of-the-art algorithms that have great results with great performance.
- Ease of use : Installation is easy to do and the API is pretty simple. A few lines of code are enough to produce results with detection and recognition.
- Open-sourced : Everything is open-sourced and written in Python. The models are also open-sourced, so it's possible to start with any of them and build upon them. This is interesting if needed for a specific use-case (e.g. handwitten text recognition).
- Multi-language OCR : This is an interesting feature, specially knowing that french text would probably be used for testing it.
When booting the Docker container, the latest version of their model (PPOCR-v3), english and french version, are automatically downloaded by the Flask server.
The core framework of PP-OCR contains three modules: text detection, detection frame correction, and text recognition.
- Text detection module: This module is used to detect the various text areas in the image. It has been trained on the DB (Differentiable Binarization) algorithm.
- Detection frame correction module: To prepare for text recognition, the irregular text box is corrected into a rectangular frame and the text direction will be judged and corrected. For example, it can perform rotations to have a straight text. This relies on training a text direction classifier.
- Text recognition module: Finally, the module performs text recognition on the corrected detected box to obtain the content of the text. PP-OCR is trained on a CRNN algorithm.
Since recognition is the most important part here, I'll go in to a bit more detail : Convolutional Recurrent Neural Network use a combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to process the images.
- The CNN is good at extracting the features from the images. There is a large amount of contextual information in the input data when doing OCR. The goal of the CNN) is to focus on local information, so it is difficult to take account of the context by only using CNN. To solve this problem, a bidirectional LSTM (Long Short-Term Memory) is introduced to enhance the context modeling, which has been proven in various projects.
- The ouput of the CNN is fed into the RNN, which processes sequential data. It allows to better capture the context (and therefore better predict abiguous characters), process texts of any length and propagate errors back to the CNN.
It is a Chinese project; developement and documentation seems geared towards the Chinese community : the chat platform is WeChat (popular Chinese communication app) and a few pre-trained models are Chinese only (handwritten text recognition model exists for Chinese but not English).
I finetuned the latest, best performing model (PPOCR-v3) to specialize in handwritten text recognition. I followed the official documentation to train my model on the IAM dataset. The project offers a simple API to train or re-train your models. The main steps were :
- Setting a training and a validation set.
- Format my labels correctly and convert them to be compatible with PaddleOCR.
- Configure the
yml
configuration file. This is done from a available template, I mostly had to configure my pre-trained model name, my data files name and the learning rate. - Use PaddleOCR training tool with the configuration file to start the training.
The configuration files can be found on ./train
.
That model is finally not included because the initial one wasn't having good results (bad generalisation) and unforeseen GPU driver problems made it impossible to train it again.
Results are excellent for printed characters in any kind of context. Handwritten text, on the other hand, is harder to get right, the context has to be very clear and the writing must not be too messy, cursive or special. Performance is also great generally speaking, it gets longer when processing 3+ pages PDF's with lots of content.
Since OCR will often have additions/deletions or badly recognized characters in a word, I added a spellchecker to correct those small mistakes.
Therefore, I use pyspellchecker
. I chose it because it is one of the fastest Python libraries to do it and it supports multiple langages (and even custom dictionnaries of words, which can be useful in the context of automatic grading).
It uses the Levenshtein Distance algorithm to find words within a distance of 2. Which means it will try all different insertions, deletions, and substitutions (with a maximum of 2 operations) and compare the results to a dictionary. If the words exist, they are taken as candidates for the correction. It will then select the most frequent one amongst the ones with the smallest distance in the selected language as correction.
Other libraries were available such as TextBlob (which is used for a wider application range) that use AI but I was having mixed results. Solutions such as ChatGPT were really good at correcting texts, even with a lot of errors, but I wanted to use open-sourced tools and have everything necessary contained in this repository.
- Design could be more responsive (specially on mobile).
- Code needs refactoring (structure/cleanness).
- Would be better to have specific AI's for specific tasks such as handwritten text recognition or digits recognition. Planned but had settle for current model due to a few issues.
- Add possibility to change image after loading one.