Skip to content

samprietoserrano/fraktur-ocr-transcription

Repository files navigation

Fraktur German/Peter Kolb transcription project

This repo contains code and output from my project at Stanford Center for Spatial and Textual Analysis (CESTA). The project consisted of transforming German explorer Peter Kolb (or Kolbe)'s 1719 book on the Cape of Good Hope from scanned images to machine-readable and editable text files. Versions of these files are in the output-txt folder.

Key Resources

Summary

In this project I was tasked with taking Kolb's 1719 book —the most-thorough and cited of its kind in being a primary source of the experiences and perspectives of early European explorers and settlers in the Cape of Good Hope— from scanned images of the original print into textual files that would be compatible with modern research and, specificially, digital humanities methods. This presented the need for having the text in machine-readable form.

Text Extraction

The two main requirements for a transcription tool were:

  1. understanding early-modern Fraktur German writing, and
  2. recognizing the reading order between columns and headers

Most of the readily-accessible tools I explored came up short in some way:

Tool Strength Issues
ABBYY FineReader, Google Docs, Adobe Acrobat pretty good at recognizing Fraktur print failed to recognize different column regions
Transkribus pretty good on Fraktur, and customizable too time-consuming if using on the whole book
Gemini, ChatGPT-4o, trOCR, Claude pretty good overall on small subset of pages unreliable generation in large batches; training would be too time-consuming
Tesseract, PyMuPDF good at text region processing lacked language training for Fraktur German

The focus of my task, and purpose of the larger Early Cape Travelers research project, was not to develop high accurary text tools but rather produce the best possible version of this book within my internship time. With this in mind, I ended up going with a combination of GCP Document AI and Transkribus.

My process for text extraction is visualized below, where the deciding factor between a page being tagged as Group A vs Group B was how systematically I could code-up coordinates to crop its text regions.

drawing

Text Cleaning

Next, I moved into post-processing the extracted text with a handful of NLP open-source software. After smaller corrections, I faced the need to spell check the 550k words in the corpus. I found that all other spell checking tools1 seemed to not handle the historical vocabulary, so I ended up creating my own dictionary of German words from 1500-1800 to feed into PySpellChecker.

Finally, I was able to bring all text, images, and tables together into a Docx document that is readable, editable, and searchable for specific content depending on the research goals. This document is found in the output-txt folder, along with the plain text files for every page.

1 e.g. DeepL Translator, LanguageTool, Sapling, Arvin, MentorDunden, Google Translate

Script pipeline

Below you find the order in which I utilized the scripts in this repo during my text processing pipeline.

  1. text-extraction\jpeg_conversion.py
    • convert jp2 files (from Internet Archive) into JPEG
  2. text-extraction\extraction_gcp.py
    • run the GCP processor on the ‘main’ page group JPEG; save the txt files
  3. text-extraction\pdf_splitter.py
    • create subset PDF for each page group, process on Transkribus
  4. text-extraction\jpeg_duplicator.py
    • copy JPEGs into folder for each page group, process on Transkribus
  5. text-processing\reindex_txt_names.py
    • rename Transkribus-exported txt files to be 1-indexed (not 0-indexed)
  6. text-processing\pp_maingroup.py
    • post-process the ‘main’ page group with two levels (line-by-line cleaning, line-by-line correction)
  7. text-processing\pp_index.py
    • post-process the ‘index’ page group in three levels (basic corrections, non-special, special)
  8. text-processing\pp-othergroups.py
    • post-process all other page groups in two levels (line-by-line cleaning, line-by-line correction)
    • NOTE: do each page group at a time; customize the script paths
  9. text-correction\unhyphenate.py
    • unhyphenate txt files as prep for running spellchecker
  10. text-correction\spellchecker.py
    • run spellchecker program
      • run one of the corpus creation options first (to feed the spell checker)
      • once a corpus exists, run a spellchecker option with either hard-coded or fed input/output paths
  11. text-processing\page_order_mapmaker.py
    • create a mapping file for each page group
  12. text-processing\page_order_mapreader.py
    • read each mapping file and copy each page group into a single ‘merged’ folder
  13. text-compilation\create_docx.py
    • create simple/format versions of compiled Docx for each non-blank page
  14. text-correction\manual_check.py
    • optionally, run a manual review on unknown words
  15. text-compilation\manualcheck_update.py
    • optionally, update the list of unknown words after a manual review
  16. text-correction\manualcheck_corpus.py
    • optionally, run a manual review on the list of corpus words
      • since this script auto-updates, there is no need for separate update script

Page Group Breakdown

If interested in knowing what pages I group with what, below see the page numbers for each group. These page numbers are based on their PDF page number from the source PDF.

For example, the first page of the book has page number 1 (not 0), it would belong to the "pg_img-new" list from below, and its txt file would be 01.txt. Except that page is a pure image page with no text, so there was not a txt file created for it.

Note that the page group "main" would consist of every page NOT in one of the lists below.

group_lists = {
    "pg_begin": [i for i in range(9, 23)],
    "pg_toc": [23, 24, 25, 26, 27],
    "pg_starts": [29, 31, 32, 36, 46, 55, 56, 68, 86, 105, 121, 136, 151, 165, 212, 231, 256, 270, 281, 304, 317, 330, 341, 347, 364, 391, 409, 420, 444, 450, 482, 488, 500, 509, 516, 528, 543, 554, 562, 579, 586, 606, 621, 629, 634, 644, 653, 666, 683, 727, 728, 729, 735, 744, 760, 798, 807, 817, 830, 840, 849, 855, 863, 873, 881, 887, 904],
    "pg_tables": [404, 405, 406, 407, 408],
    "pg_index": [i for i in range(905, 985)],
    "pg_other": [706, 874],
    "pg_dict": [985, 986],
    "pg_skip": [58],
    "pg_img": [8, 76, 140, 170, 177, 192, 201, 210, 218, 236, 240, 456, 473, 491, 519, 523, 529, 557, 569, 576, 590, 600, 626, 647, 718],
    "pg_img-new": [1, 2, 5, 8, 76, 140, 170, 177, 192, 201, 210, 218, 236, 240, 456, 473, 491, 519, 523, 529, 557, 569, 576, 590, 600, 626, 647, 718],
    "pg_blank1": [i for i in range(1, 8)],
    "pg_blank2": [10, 18, 28, 77, 141, 171, 176, 193, 200, 211, 219, 237, 241, 474, 492, 520, 524, 530, 558, 570, 575, 589, 599, 625, 648, 719, 987, 988, 989, 990, 991, 992]
}

Spellchecker Options

The spellchecker has verious terminal command options when run. The user can select a combination of the following features for their spellchecker:

  • use hard-coded list of files being spellchecked, or provide a folder with your own list
  • create a brand new word list/frequency dict or use a pre-exiting one
    • save or not save the frequency dict created (if not using a pre-existing)
  • run the spellchecking using either a word list or frequency dict as the corpus (word_frequency.load_words(data) vs word_frequency.load_json(data) in the PySpellChecker API)
  • include or exclude the CLARIN archival data set as corpus for spellchecking (given that dataset comes from a medical context and might be deemed not relevant to the book contents)

I have included a snapshot of the relevant code to help visualize this part of the program.

if path_choice == "y":
    txt_folder = '/Users/samxp/Documents/CESTA-Summer/output-txt/from-transkribus/index/pp1-merged'
    txt_folder_save = '/Users/samxp/Documents/CESTA-Summer/output-txt/from-transkribus/index/pp2-merged-pyspck'
elif path_choice == "n":
    txt_folder = input("Enter path for folder of txt files: ")
    txt_folder_save = input("Enter path to folder for saving: ")

select = int(run_choice)
corpus_path = corpus_path_r # optional, use manually-updated corpus list rather than code-produced one  
if select == 1:
    # Load the corpus/freq, NO CLARIN DATA
    word_list = load_corpus(corpus_path)
    word_freq = load_freq(freq_path)
elif select == 2:
    # Load the corpus/freq, YES CLARIN DATA
    word_list = load_corpus(corpus_path)
    word_freq = load_freq(freq_path, load_clarin=True)
elif select == 3:
    # Load corpus + Update freq, CHECK CLARIN DATA
    word_list = load_corpus(corpus_path)
    word_freq = load_freq(freq_path, load_clarin=True, upd_corpus=word_list, save_update=False)
elif select == 4:
    # Create + Load corpus ONLY
    load_and_preprocess_files(directories, corpus_path, freq_path, save_freq=False)
    word_list = load_corpus(corpus_path)
elif select == 5:
    # Create + Load corpus/freq, CHECK CLARIN DATA
    load_and_preprocess_files(directories, corpus_path, freq_path, save_freq=True)
    word_list = load_corpus(corpus_path)
    word_freq = load_freq(freq_path, load_clarin=True)

Setting Up the Virtual Environment

To ensure that all necessary dependencies are installed and isolated from your global Python environment, it's recommended to use a virtual environment. Follow the steps below to create and use a virtual environment for this project.

Run the following command to create a virtual environment in the project directory:

python -m venv venv

Run the following command to activate the virtual environment (for Mac):

source venv/bin/activate

Run the following command to install requirements in the virtual environment (for Mac):

pip install -r requirements.txt
pip install python-docx

Run the following command to deactivate the virtual environment (for Mac):

deactivate

Recommended Next Steps

  • Run the manual check program with a German-speaker to vet unknown words and then run the updating version of the script.
  • Continue expanding the corpus collection to add to the three file sets used.
  • Possibly create a Python package for the spellchecker to optimize its functionality.

Warnings

  • Some pages that hold illustrations also contain some minor text, serving as descriptor of the illustration. However, of these pages, only page 17 (0017.txt) was run through the text extraction process (and oversight on my part). Future versions of this project should process all pages in the "pg_img-new" group that also have text.
  • The source file from Internet Archive is NOT complete. There is missing content around page number 57-59, as seen by how the printed page numbers at the top of the page margins do not line up in sequence. In my work I decided to skip page 58.

Author

Feel free to contact me by GitHub @samprietoserrano, email, or linkedin/samprietoserrano.