Skip to content

Leveraging LLMs for Post-OCR Correction of Historical Newspapers

License

Notifications You must be signed in to change notification settings

Shef-AIRE/llms_post-ocr_correction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Leveraging LLMs for Post-OCR Correction of Historical Newspapers

This repository contains the code for the paper Leveraging LLMs for Post-OCR Correction of Historical Newspapers, where LLMs are adapted for a prompt-based approach to post-OCR correction. We focus on the post-OCR correction of historical English, using BLN600, a parallel corpus of 19th century newspaper machine/human transcription. The dataset and fine-tuned models can be accessed from the links below.

Data: https://doi.org/10.15131/shef.data.25439023
Models: https://huggingface.co/pykale

Usage

bln600.ipynb - Notebook for loading and preparing BLN600 for model development

bart.py and llama-2.py are minimal scripts for fine-tuning BART and Llama 2 for post-OCR correction, using a YAML configuration file and a CSV file for training data with 'OCR Text' and 'Ground Truth' columns.

results.ipynb - Notebook for generating and examining post-OCR corrections with fine-tuned models

pip install -r requirements.txt
python bart.py --model {bart-base, bart-large} --config CONFIG --data DATA
python llama-2.py --model {llama-2-7b, llama-2-13b, llama-2-70b} --config CONFIG --data DATA

Citation

@inproceedings{thomas-etal-2024-leveraging,
    title = "Leveraging {LLM}s for Post-{OCR} Correction of Historical Newspapers",
    author = "Thomas, Alan and Gaizauskas, Robert and Lu, Haiping",
    editor = "Sprugnoli, Rachele and Passarotti, Marco",
    booktitle = "Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024",
    month = "may",
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lt4hala-1.14",
    pages = "116--121",
}

About

Leveraging LLMs for Post-OCR Correction of Historical Newspapers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published