Includes:
- Index words
- Store ngrams in a Trie data structure
- Efficiently extract ngrams and their frequencies
- Compute out-of-vocabulary (OOV) rate
- Compute ngram probabilities with absolute discounting with interpolation smoothing.
- Compute Perplexity
A statistical language model is the development of probabilistic models to predict the probability of a sequence of words. It is able to predict the next word in a sequence given a history context represented by the preceding words.
The probability that we want to model can be factorized using the chain rule as follows:
where is a special token to denote the start of the sentence.
In practice, we usually use what is called N-Gram models that use Markov process assumption to limit the history context. Examples of N-Grams are:
Using Maximum Likelihood criteria, these probabilities can be estimated using counts. For example, for the bigram model,
However, this can be problamatic if we have unseen data because the counts will be 0 and thus the probability is undefined. To solve this problem, we use smoothing techniques. There are different smoothing techniques and the one that we used is called absolute discounting with interpolation.
To meausre the performance of a language model, we compute the perplexity of the test corpus using trained m-Grams:
Model was tested on europarl dataset (dir data
):
Test PP with bigrams = 130.09
Test PP with trigrams = 94.82