N-Gram-Language-Model

Includes:

Index words
Store ngrams in a Trie data structure
Efficiently extract ngrams and their frequencies
Compute out-of-vocabulary (OOV) rate
Compute ngram probabilities with absolute discounting with interpolation smoothing.
Compute Perplexity

Introduction

A statistical language model is the development of probabilistic models to predict the probability of a sequence of words. It is able to predict the next word in a sequence given a history context represented by the preceding words.

The probability that we want to model can be factorized using the chain rule as follows:

where is a special token to denote the start of the sentence.

In practice, we usually use what is called N-Gram models that use Markov process assumption to limit the history context. Examples of N-Grams are:

Training

Using Maximum Likelihood criteria, these probabilities can be estimated using counts. For example, for the bigram model,

However, this can be problamatic if we have unseen data because the counts will be 0 and thus the probability is undefined. To solve this problem, we use smoothing techniques. There are different smoothing techniques and the one that we used is called absolute discounting with interpolation.

Perplexity

To meausre the performance of a language model, we compute the perplexity of the test corpus using trained m-Grams:

Results

Model was tested on europarl dataset (dir data):

Test PP with bigrams = 130.09

Test PP with trigrams = 94.82

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
index_map.py		index_map.py
main.py		main.py
trie.py		trie.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

N-Gram-Language-Model

Introduction

Training

Perplexity

Results

About

Releases

Packages

Contributors 2

Languages

mmz33/N-Gram-Language-Model

Folders and files

Latest commit

History

Repository files navigation

N-Gram-Language-Model

Introduction

Training

Perplexity

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages