This is the official code (partial) for the paper "Reference-Based Post-OCR Processing with LLM for Diacritic Languages" which accepted in AAAI 2025.
The pipeline is done step by step by multiple authors and some of us forget to keep the source for some processings so we only publish part of all pipeline.
Compare with the paper, we made a small change after that to improve the best quality of final dataset like using Sonnet 3.5 for 4.7 section.
To collect the VieBooRead dataset for OCR, please head to hugging face https://huggingface.co/datasets/thaodd11/VieBookRead, sign the form and wait for approval to download.
Thank you