dta-lexdb-applications

formatting and integrating the Deutches Textarchiv dictionary into various applications

Deutsches Textarchiv (DTA) is a large collection of curated and manually corrected reference corpora in New High German from the 17th to 20th century.

LexDB are a collection of lexical databases (i.e. dictionaries) distilled from DTA by the BBAW. They include the full-form, lemmatization, normalized orthography and part-of-speech.

This repository provides scripts to extract and re-format dictionaries for re-use in other applications. The results will be available as Github release assets.

Tesseract OCR models with added language model

Tesseract models (both the originally provided ones, trained on synthetic data, and the community generated ones, finetuned on annotated scan data or trained from scratch) can be amended with a simple language model by providing dictionaries/grammars for punctuation, numbers and words.

We will pick publicly available models for German Antiqua and Fraktur prints, as well as handwriting, and republish them with DTA as language model.

For currently selected models, see

dta-lexdb-applications/Makefile

Lines 13 to 34 in 83e5d5c

    
           TESS_MODELS := frak2021 GT4HistOCR ONB Fraktur_5000000 german_print frk Fraktur 
        
           GT4HistOCR.traineddata: 
        
           	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/GT4HistOCR/tessdata_best/GT4HistOCR.traineddata 
        
           frak2021.traineddata: 
        
           	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata 
        
           Fraktur_5000000.traineddata: 
        
           	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata 
        
           ONB.traineddata: 
        
           	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/ONB_1.195_300718_989100.traineddata 
        
           german_print.traineddata: 
        
           	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/german_print/german_print.traineddata 
        
           frk.traineddata: 
        
           	wget -O $@ https://github.com/tesseract-ocr/tessdata_fast/raw/main/frk.traineddata 
        
           Fraktur.traineddata: 
        
           	wget -O $@ https://github.com/tesseract-ocr/tessdata_fast/raw/main/script/Fraktur.traineddata

Hunspell

Hunspell is a widely used dictionary based, morphology aware spell checker.

We will produce a DTA dictionary for it.

For currently selected rules, see

dta-lexdb-applications/Makefile

Lines 60 to 63 in 83e5d5c

    
           de-dta.dic: dta_lexdb_10.words 
        
           	wc -l < $< > $@ 
        
           	# to do: combine DTA lemmatization and contemporary affixation to a historic affixation system (instead of fixed word list) 
        
           	grep -v -e '^[[:punct:]]' -e '^[[:digit:][:punct:]]*$$' $< | sort -u >> $@

...

Others to come. Please raise an issue if you have ideas!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
combine_tessdata.sh		combine_tessdata.sh
de-dta.aff		de-dta.aff
deu.numbers		deu.numbers
deu.punc		deu.punc
sql2wordlist.sh		sql2wordlist.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dta-lexdb-applications

Tesseract OCR models with added language model

Hunspell

...

About

Releases 4

Packages

Languages

	TESS_MODELS := frak2021 GT4HistOCR ONB Fraktur_5000000 german_print frk Fraktur

	GT4HistOCR.traineddata:
	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/GT4HistOCR/tessdata_best/GT4HistOCR.traineddata

	frak2021.traineddata:
	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata

	Fraktur_5000000.traineddata:
	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata

	ONB.traineddata:
	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/ONB_1.195_300718_989100.traineddata

	german_print.traineddata:
	wget -O $@ https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/german_print/german_print.traineddata

	frk.traineddata:
	wget -O $@ https://github.com/tesseract-ocr/tessdata_fast/raw/main/frk.traineddata

	Fraktur.traineddata:
	wget -O $@ https://github.com/tesseract-ocr/tessdata_fast/raw/main/script/Fraktur.traineddata

	de-dta.dic: dta_lexdb_10.words
	wc -l < $< > $@
	# to do: combine DTA lemmatization and contemporary affixation to a historic affixation system (instead of fixed word list)
	grep -v -e '^[[:punct:]]' -e '^[[:digit:][:punct:]]*$$' $< \| sort -u >> $@

bertsky/dta-lexdb-applications

Folders and files

Latest commit

History

Repository files navigation

dta-lexdb-applications

Tesseract OCR models with added language model

Hunspell

...

About

Topics

Resources

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages