Learns an n-gram language model given a corpus. The corpus should be text file, with a single word per line, containing no inter-word spaces.
The learned quantities are:
- Probabilities of unigrams, p( gi )
- Probabilities of bigrams, p( gi | gi-1 )
- Probabilities of trigrams, p( gi | gi-1, gi-2 )
Test the script by running with no argument:
python3 ngramModelTrainer
Use the -h flag for details on how to use the tool with proper input:
python3 ngramModelTrainer -h
There are a few example inputs on fixtures/
.
The output is saved as four MATLAB matrices.
- unigrams: u(i) stands for p(i).
- bigrams: b(i, j) stands for p(j | i).
- trigrams: t(i, j, k) stands for p(k | j, i).
- quadgrams (tetragrams): q(i, j, k, l) stands for p(l | k, j, i).
An alphabet of specific acceptable unigrams is required to be defined. By default, we are using an alphabet of 36 possible letters/digits. These are held in a python list called 'alphabet', in the following order:
- Positions 0-25: Latin lowercase alphabet letters, in standard alphabetical order.
- Positions 26-35: Digits 0-9.
Non-'standard' versions of the above alphabet may be used. These include: dutta_extended: a number of extra characters (these are notably encodings of the characters and punctuation found in the George Washington handwritten document set). sophia: polytonic greek characters. dummy: a limited testing set of 3 characters