Skip to content

Instruction about building a KenLM model based on Ukrainian Wikipedia data

Notifications You must be signed in to change notification settings

egorsmkv/ukwiki-kenlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ukwiki-kenlm

Instruction about building a KenLM model based on Ukrainian Wikipedia data

Install

You need Python 3 and pip for text processing.

Libraries

pip install mwxml tqdm

Steps

Step 1: Find the latest backup for Ukrainian Wikipedia data

Go to https://dumps.wikimedia.org/backup-index.html and find the "ukwiki: Dump complete" line. Follow the "ukwiki" link.

Step 2: Find "Recombine all pages, current versions only." link

On the page, find the "Recombine all pages, current versions only." link and download BZ2 archive.

Step 3: Uncompress the BZ2 archive

Step 4: Extract text data from the "*-pages-meta-current.xml" file

python extract_text_from_dump.py ukwiki-20220701-pages-meta-current.xml > uncleaned_text.txt

Step 5: Clean uncleaned_text.txt file

python cleaner.py --corpus-path uncleaned_text.txt --corpus-clean cleaned_text.txt --n-workers 5 --min-words 2

Step 6: Install KenLM

sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2

Step 7: Build a KenLM in the ARPA format

kenlm/build/bin/lmplz -o 5 < "cleaned_text.txt" > "uk_wiki.arpa"

Step 8: Fix a KenLM model

python fix_kenlm.py --arpa-file-in uk_wiki.arpa --arpa-file-out uk_wiki_corrected.arpa

Step 9: Build a KenLM in the binary format

kenlm/build/bin/build_binary uk_wiki_corrected.arpa uk_wiki_corrected.bin

About

Instruction about building a KenLM model based on Ukrainian Wikipedia data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages