-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlanguagemodels.tex
283 lines (219 loc) · 29.9 KB
/
languagemodels.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
\chapter{Chapter 1\newline Introduction to $n$-gram language modelling}\label{chap:introlm}
\newthought{Statistical language modelling is a technique} that attempts to measure how likely any given piece of text is by building a probabilistic model and assigning a probability value to the text. These models are generally very large and are generated from a very large collection of documents.
A statistical language model is a probability distribution $P(s)$ over possible sentences $S$.\footnote{In this thesis sentences represent linguistic units, such as spoken utterances, or in some case even documents.} Its task can be shown with an example from automated speech recognition (ASR)\footnote{ASR is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program.}. Imagine that one day you walk on the beach, and you find a bottle. Although that in itself is not really surprising, you also find that the bottle contains a note. You can make out the script; the language is unknown to you, but in the address you recognise the sender's country. Eager as you are, you want to decipher the message. With sophisticated translation services as Google Translate you could at least get an idea of what the note says in an instant. These translation services work bidirectionally, but you decide to learn the language, as to write your reply.
One of the first things you will probably do is getting started with some basic words to fill you vocabulary. We call the vocabulary $V$, and the number of words you've learned already $W$.\footnote{In other words $W=|V|$, where $|\cdot|$ denotes the ordinal number of a set.} By learning to count, you first have to learn the translation for \emph{one}. Then you can go on to learn \emph{two}. By the time you get to \emph{ten}, you have counted up to \emph{nine} a lot of times already. Counting is then not only a matter of coming up with the right translation, but it is also about memory. Coming up with \emph{ten} is probably easier if you have just counted to \emph{nine}.
For statistical language models this is the same. Producing the translation of a word $w$ goes right with the probability $p(w)=\frac{1}{W}$, as there is no prior knowledge. If we tell the system that it has to produce the translation of a number between \emph{one} and \emph{ten}, then chance that the system has it right if it memorised all ten digits is $p(w)=\frac{1}{10}$. If $W$ is much larger than $10$, this is really an improvement. If we now ask the system to finish the following sequence: \emph{one} \emph{two} \emph{three} $\ldots$, it can choose between two strategies. The first one is just filling in one of the $W$ words. A smarter approach however, is to use the information of the sequence. Let's call this sequence $s = (w_1, w_2, w_3)$ and we want to predict $w_4$. We want the word $w_4$ that out of all words in $V$ has the highest likelihood of following $s$: $w_4 = \argmax_{w}^{V} p(w|s)$. In such a way with a statistical language model we can investigate how likely a word is given its context, and if we do that for each word we know, we can also pick the best one.
But this method of choosing the most-likely word has also some downsides. First, for languages have an unbounded number of words, it is impossible to know the complete vocabulary. An incomplete vocabulary has the consequence that not only you do not recognise the word, you also cannot evaluate its likelihood. The second problem is that since we model probabilities of words, we are never sure. This is especially true if we also want to model for out-of-vocabulary (OOV) words\footnote{Out-of-vocabulary words are words that we do not know, or choose to ignore when building the model. We often do include them collectively as a special word, ``the unknown word''.}, for which we then must sacrifice a part of the probability space. For a human it is hard to say exactly how likely $p(w_4|s)$ is, but for a computer this is just the result of a computation. Another downside of having an unbounded number of words is that the vocabulary can get very large. As we saw, we do not want to only store the words, but also other information that allow us to make an educated guess. This can range from having knowledge of the previous words, but also syntax rules, or knowing the specific domain for which you are computing the word probabilities help in estimating its probability. All these context information takes a lot of memory, both for the translator as for the computer.
In the remainder of this thesis, we assume that we learn the language by reading a lot. This ignores other valuable resources such as grammar, speech, and cultural influences. But we have a dictionary, so we can at least get an idea of what's in the text. Now we focus on two things. First we want to recognise\footnote{By checking if it is in the dictionary.} the words, and second, we want to recognise patterns that precede a word. This will be our approximation of the language. This approximation of patterns and their words can be modelled in many ways. In the next section we introduce the models that are the foundation of this thesis.
%\begin{itemize}
% \item What is SLM?
% \item Why do we do it?
% \item What else is there?
%\end{itemize}
\section{Patterns in Language}
The goal of a statistical language model to assign a probability to a sequence of words. Say we have a sentence $s$ of $m=5$ words: \emph{Bananas are his favourite fruit.} If we want to compute the query likelihood of $s$, $p(s)$, we have to start
at the first word: \emph{Bananas}. For now assume that we know every word that we might encounter. What is the chance that the first word was \emph{Bananas}? $p(\mathit{Bananas})$. The same goes for the second word: the chance of the second word being \emph{are} is $p(\mathit{are})$. We can do this for every word $m_i$ in $s$. For an unigram language model, this results in the chainrule\footnote{$p(s)=\prod_{i=1}^m p(w_i)$}
\begin{equation}
p(s) = p(\mathit{Bananas})p(\mathit{are})p(\mathit{his})p(\mathit{favourite})p(\mathit{fruit})\label{eq:unigramexample}
\end{equation}
In \cref{eq:unigramexample} we use our knowledge of how likely it is to use each of the words. But as clear from the example, it does not take into relation the word order\footnote{The sentence \emph{his are favourite Bananas fruit} yields the same query likelihood, but lacks semantic and syntactic coherence.}, nor the context in which the words where said.
This simple and naive language model is called a unigram language model\footnote{Henceforth is we refer to \emph{grams} we mean words, unless stated otherwise.}. In practice unigram models are not used, because they are a bad approximation of how texts are generated, and thus yield bad word likelihood scores.
%
The other end of the spectrum is to use all the information that one has, to predict the next word. This is of course not possible to story in memory, let alone do computations with it.
%
This would not only require knowledge of all written and spoken texts even produced, but also insight in the mind of the writer or speaker, alongside the context in which the text was uttered. Texts produced by humans are always sensitive to context, as they are used to convey a message. However hard it is for humans to understand the message (and to be able to predict the probability of the next word), for the computer it is even harder.
%\begin{equation}
%p(s) =
%\end{equation}
As we can see, words as patterns without any context are too limited as a language model.
In this section we will introduce some language models that use more contextual information than the unigrams models, yet keep the space and time complexity feasible.
\subsection{$n$-grams}
$n$-gram language models generalise unigram language models, by explicitly taking the order of words into account. It models a sentence by taking contiguous sequences of $n$ words, where we call $w_1,\ldots,w_{n-1}$ the context $\mathbf{u}$ and $w_n$ the focus word. For the 5-word sentence $s$, \emph{Bananas are my favourite fruit}, we can derive the following $n$-grams:\footnote{The context $\mathbf{u}$ is emphasised in italics.}
\begin{enumerate}
\item Bananas, are, my, favourite, fruit
\item \emph{Bananas} are, \emph{are} my, \emph{my} favourite, \emph{favourite} fruit
\item \emph{Bananas are} my, \emph{are my} favourite, \emph{my favourite} fruit
\item \emph{Bananas are my} favourite, \emph{are my favourite} fruit
\item \emph{Bananas are my favourite} fruit
\end{enumerate}
The problem here is that if you want to predict in a 5-gram language model that \emph{Bananas} is the first word in a sentence, it must take the role of the focus word. The context is then filled with markers\footnote{These tokens are usually called begin of sentence (BOS) markers, and end of sentence (EOS) markers when they signal the end of a sentence.} that denote they are not part of the sentence. Each sentence implicitly ends with an end of sentence marker. %This is useful to query the likelihood of the first $n-1$ words in an $n$-gram language model, but in practice the influence is negligible\footnote{is that so? I believe so.}. In the examples we will mention the markers if they add to the understanding.
Punctuation is also considered a word, except when it is part of an abbreviation.\footnote{When using data that has been processed by others, this may vary, as it is dependent on choices such as how to tokenise the text.}
The joint probability of $s$ with a bigram language model is thus:
\[ p(s) = p(\mathemph{Bananas}|\mathemph{BOS})p(\mathemph{are}|\mathemph{Bananas})p(\mathemph{my}|\mathemph{are})p(\mathemph{favourite}|\mathemph{are})p(\mathemph{fruit}|\mathemph{favourite})p(\mathemph{EOS}|\mathemph{fruit})\]
Or in a more general and abstract sense:
\[ p(s) = \prod_{i=1}^{|s|+1}p(w_i|w_1,\ldots,w_{i-1}) \]
Henceforth we will use the abbreviation $w_{i-n+1}^{i-1} \equiv w_{i-n+1},\ldots,w_{i-1}$.
An $n$-gram probability $p(w_i|w_{i-n+1}^{i-1})$ is (naively) computed by means of its maximum likelihood estimate (MLE), which is a natural procedure to count how often the token $w_i$ follows the context $w_{i-n+1}^{i-1}$, and to divide by the total number of times the history occurs:
\begin{equation} p_{\operatorname{MLE}}\left(w_i|w_{i-n+1}^{i-1}\right) = \frac{c\left(w_{i-n+1}^i\right)}{c\left(w_{i-n+1}^{i-1}\right)} = \frac{c\left(w_{i-n+1}^{i}\right)}{\sum_{w_i}^{V}c\left(w_{i-n+1}^{i}\right)}\label{eq:pmle}
\end{equation}
where $c(\mathbf{u}w)$ is the count function that denotes how often an $n$-gram $\mathbf{u}w$ occurs in the train data.
With the unigram approach, we have to store for each word its probability of occurrence. With $W$ the number of words in the vocabulary $V$, we have to store $W$ probabilities. With $n$-grams, there are exponentially many possibilities: $W^n$. A typical vocabulary size is $W\approx 200,000$, hence storing all possibilities for a trigram is already quite a feat. Fortunately, most of these trigrams never occur:\footnote{Although mentioning a $n$-gram that never occurs, is a paradox.} \emph{elephant vacuum cola}
%Hier plaatjes van een corpus met 3,4,5-grammen en de upperbound
\begin{figure}
\begin{scaletikzpicturetowidth}{\textwidth}
\tikzsetnextfilename{hello}
\begin{tikzpicture}[scale=\tikzscale]
\begin{semilogyaxis}[
%\begin{axis}[%domain=0:100,
title=Growth of the number of types throughout the corpus,
xlabel=relative position in document (tokens in \%),
ylabel=types,
width=11cm,
height=7cm]
%legend=none]
%legend cell align=left,
%legend pos=outer north east,
%legend style={draw=none}]
\addplot table [y=c1, x=c]{ngramtypegrowth.dat};\label{fig:gp1}
%\addlegendentry{1-grams}
\addplot table [y=c2, x=c]{ngramtypegrowth.dat};\label{fig:gp2}
%\addlegendentry{2-grams}
\addplot table [y=c3, x=c]{ngramtypegrowth.dat};\label{fig:gp3}
%\addlegendentry{3-grams}
\addplot table [y=c4, x=c]{ngramtypegrowth.dat};\label{fig:gp4}
%\addlegendentry{4-grams}
\addplot table [y=c5, x=c]{ngramtypegrowth.dat};\label{fig:gp5}
%\addlegendentry{5-grams}
%\addplot (5*x);
\end{semilogyaxis}
%\end{axis}
\end{tikzpicture}
\end{scaletikzpicturetowidth}
\caption{The number of unique $n$-grams (types) (from the \obw corpus, which will be introduced in the next chapter) are plotted against their relative position in the document (the number of $n$-gram tokens encountered). Unigrams (\ref{fig:gp1}), bigrams (\ref{fig:gp2}), trigram (\ref{fig:gp3}), 4-grams (\ref{fig:gp4}), and 5-grams (\ref{fig:gp5}). The number of types is plotted on a log scale, since otherwise (\ref{fig:gp1}) would appear flat.}\label{fig:ngramgrowth}
\end{figure}
\begin{figure}
\begin{scaletikzpicturetowidth}{\textwidth}
\tikzsetnextfilename{byeb}
\begin{tikzpicture}[scale=\tikzscale]
\begin{semilogyaxis}[
%\begin{axis}[%domain=0:100,
title=Maximum growth of the number of types throughout the corpus,
xlabel=relative position in document (tokens in \%),
ylabel=types,
width=11cm,
height=7cm]
%legend=none]
%legend cell align=left,
%legend pos=outer north east,
%legend style={draw=none}]
\addplot table [y=c1, x=c]{ngrammaxgrowth.dat};\label{fig:gp1-1}
%\addlegendentry{1-grams}
\addplot table [y=c2, x=c]{ngrammaxgrowth.dat};\label{fig:gp1-2}
%\addlegendentry{2-grams}
\addplot table [y=c3, x=c]{ngrammaxgrowth.dat};\label{fig:gp1-3}
%\addlegendentry{3-grams}
\addplot table [y=c4, x=c]{ngrammaxgrowth.dat};\label{fig:gp1-4}
%\addlegendentry{4-grams}
\addplot table [y=c5, x=c]{ngrammaxgrowth.dat};\label{fig:gp1-5}
\addplot table [y=c1, x=c]{ngramtypegrowth.dat};\label{fig:gp1}
%\addlegendentry{1-grams}
\addplot table [y=c2, x=c]{ngramtypegrowth.dat};\label{fig:gp2}
%\addlegendentry{2-grams}
\addplot table [y=c3, x=c]{ngramtypegrowth.dat};\label{fig:gp3}
%\addlegendentry{3-grams}
\addplot table [y=c4, x=c]{ngramtypegrowth.dat};\label{fig:gp4}
%\addlegendentry{4-grams}
\addplot table [y=c5, x=c]{ngramtypegrowth.dat};\label{fig:gp5}
%\addlegendentry{5-grams}
%\addplot (5*x);
\end{semilogyaxis}
%\end{axis}
\end{tikzpicture}
\end{scaletikzpicturetowidth}
\caption{The maximum number of $n$-grams, given the same corpus as \cref{fig:ngramgrowth}. The indistinguishable lines in the bottom are the lines from the previous plot. This shows how sparse the possible $n$-gram feature space is, with up to $10^{44}$ possible $5$-grams, and an observed amount of $10^8$.}\label{fig:maxgrowth}
\end{figure}
In \cref{fig:maxgrowth} we added the maximum number of possible $n$-grams to the number of observed $n$-grams. The graph shows that this feature space can be really sparse, and that the observed number of $n$-grams, and more notably the $5$-grams, are even below the maximum number of unigrams. Hence the chance of running into an observed $n$-gram is very unlikely, and that is why further in this chapter we show some countermeasures to improve MLE to deal with this issue.
\subsection{Skipgrams}\marginnote{With the surge of neural network-based language models, the meaning of the word skipgram has shifted from being an $n$-gram with a skipped word, to the skipgram paradigm in word2vec. In our case skipgrams are not in the output of the model, but rather on the input side, and do not have a fixed amount of left and right context.}
A problem of $n$-gram models is that they can only model contiguous sequences of words; this rules out any form of long-distance relationship\footnote{With long-distance we mean spanning over multiple words. Not necessarily the dependency-tree interpretation between words.} between words. A common example is the interjection of an adjective between a determiner and a noun: \emph{the delicious banana}, \emph{the yellow banana}. To give the model expressive power to such examples, we introduce the skipgram language model, where we now allow $n$-grams to contain a abitrary number of skips, where each skip represents skipping one word.
Let us first formally introduce the skipgrams as used in this thesis. Let $\{m\}$ be a skip of length $m$, then \emph{the $\{1\}$ banana} can match \emph{the delicious banana}, or \emph{the yellow banana}, etc. We do not allow skips to be at the beginning or end of the skipgram, so for $n>2$ skipgrams are a generalisation of $n$-grams \autocite{goodman2001bit,shazeer2015sparse,pickhardt2014generalized}.
Now let $\sigma_{\rotatebox[origin=c]{180}{m}}$ be the operator that adds a skip to a pattern $\mathbf{u}$ on the $\rotatebox[origin=c]{180}{m}$th position if there is not already a skip. Then $\boldsymbol\sigma(\mathbf{u}) \left[\sigma_{\rotatebox[origin=c]{180}{m}}(\mathbf{u})\right]_{\rotatebox[origin=c]{180}{m}=2}^{|\mathbf{u}|-1}$ is the set of patterns with one skip more than the number of skips currently in $\mathbf{u}$.
For a 4-gram $\mathbf{u}w$ = \emph{ate the ripe banana}, $\sigma_2(\mathbf{u}w)$ = \emph{ate \{1\} ripe banana}, $\sigma_3(\mathbf{u}w)$ = \emph{ate the \{1\} banana}, and $\boldsymbol\sigma(\mathbf{u}w) = \left[\sigma_2(\mathbf{u}), \sigma_3(\mathbf{u})\right]$. To get to the 4-gram with two skips, we apply the operator on a skipgram that already contains a skip, e.g.~$\sigma_3(\sigma_2(\mathbf{u}w))$ = \emph{the \{1\} \{1\} banana}.\footnote{We can abbreviate this skipgram to \emph{ate \{2\} banana}.}
This definition limits to one additional skip per application of the operator.\footnote{The main reason to limit only adding one skip per application is because in the backoff procedure skipgrams with more skips have a probability that is in a different range compared to skipgrams with less skips. When in a later stage this probabilities are interpolated, this undoes the expressive power, because one of the two completely overpowers the other. In the current implementation at least the number of content-bearing words is the same after the application of the skipgram operator.} However, the skipgram can be generalised further into a flexgram\autocite{gompel2016efficient}, but this is computationally even more expensive than skipgrams, and preliminary results show no further improvements over just using skipgrams.
Remember that we extend a language model to overcome the problem of poor generalisation, especially in scenarios where many OOVs are encountered, this hurts the performance. With skipgrams we can on one hand partially solve the sparseness of $n$-grams, without losing the information of word order, and on the other hand we now have a mechanism to jump over unknown words.
But this is not the end of the story, since we also have to change the models to help in their own way.
\section{Smoothing, Interpolation, and Backoff Methods}
Earlier we saw that even for a small vocabulary of $50000$ words and a trigram language model, we have to model $O(50000^3)$ parameters. With this number of parameters, we can model the train data very precisely, which as a result causes severe overfitting on train data, especially in the context of maximum-likelihood estimation. On the other hand we observe that many of these trigrams have count zero\footnote{Combine the large feature space with relatively few observations, of which the elements mostly occur in standard patterns, and it is clear that we are dealing with a very sparse feature space.}, which causes divisions by zero in \cref{eq:pmle}. For many language modelling applications such as automatic speech recognition, having zero $n$-gram probabilities implies that such word sequence can never be hypothesised. Generally, we want to assign non-zero probabilities to all $n$-grams.\autocite{jevtic2005estimation}
One of the first ways to overcome assigning a zero probability to an $n$-gram, is to smooth the probability distribution, and adjust the empirical counts to expected counts.
The simplest method is \emph{additive smoothing}, or in particular \emph{add-one smoothing}, which prevents having zero-probabilities by modifying the count method for the MLE.\footnote{The MLE will generally underestimate the probability of unseen words $w$ after a known context $\mathbf{u}$.}\footnote{The more general \emph{add-$\alpha$ smoothing} with $\alpha < 1$ limits the strength in non-observed counts: \[ p_{\operatorname{MLE_{+\alpha}}}\left(w_i|w_{i-n+1}^{i-1}\right) = \frac{c\left(w_{i-n+1}^i\right)+\alpha}{c\left(w_{i-n+1}^{i-1}\right)+\alpha W}.\]}
\begin{equation} p_{\operatorname{MLE_{+1}}}\left(w_i|w_{i-n+1}^{i-1}\right) = \frac{c\left(w_{i-n+1}^i\right)+1}{c\left(w_{i-n+1}^{i-1}\right)+W}\label{eq:paddone}
\end{equation}
Another tool is backoff. One of the basic techniques involves subtracting a small constant discount $D < 1$ from the count of each observed $n$-gram, and distribute the withheld counts uniformly to unseen $n$-grams:
\begin{equation}
p_{\operatorname{uni}}(w|\mathbf{u})=\left.
\begin{cases}
\frac{c(\mathbf{u}w)-D}{\sum_{w'}c(\mathbf{u}w')}, & c(\mathbf{u}w) > 0\textnormal{ (observed)} \\
\alpha_{\operatorname{uni}}(\mathbf{u}), & c(\mathbf{u}w) = 0\textnormal{ unseen}
\end{cases}\right.
\end{equation}
which describes uniform backoff. To ensure that $\sum_wp(w|\mathbf{u})=1$, \autocite{chen1996empirical} define $N_{1+}(\mathbf{u}\cdot) = |\{w:c(\mathbf{u}w) > 0\}|$ and
\begin{equation}
\alpha_{\operatorname{uni}}(\mathbf{u}) = \frac{D}{\sum_{w'}c(\mathbf{u}w')}\frac{N_{1+}(\mathbf{u}\cdot)}{W - N_{1+}(\mathbf{u}\cdot)}.
\end{equation}
However, distributing the withheld counts uniformly implies that unseen bigrams are assigned the same probability.
In some cases you have bigrams with the same frequency count, but with different counts for the focus words. In estimating the probability you want to capture this behaviour by interpolating the bigram model with a unigram model. E.g. proportional backoff:~
\begin{equation}\begin{split}
p_{\operatorname{prop}}(w|\mathbf{u})=&\left.
\begin{cases}
\frac{c(\mathbf{u}w)-D}{\sum_{w'}c(\mathbf{u}w')}, & c(\mathbf{u}w) > 0\textnormal{ (observed)} \\
\alpha_{\operatorname{prop}}(\mathbf{u})p_{\operatorname{prop}}(w|\pi(\mathbf{u})), & c(\mathbf{u}w) = 0\textnormal{ unseen}
\end{cases}\right. \\
\alpha_{\operatorname{prop}}(\mathbf{u}) =& \frac{1 - \sum_{N_{1+}}p_{\operatorname{prop}}(w|\mathbf{u})}{1 - \sum_{N_{1+}}p_{\operatorname{prop}}(w|\pi(\mathbf{u}))}
\end{split}\end{equation}
Now we have the issue that with proportional backoff smoothing, equally observed $n$-grams with the same history have the same probability. We want to create a bias towards a more frequently occurring focus word. One approach would be to interpolate all $n$-gram probabilities with the lower-order backoff model, independent of whether the $n$-grams are observed. This method is known as absolute discounting\autocite{ney1994structuring}:
\begin{equation}\begin{split}
p_{\operatorname{abs}}(w|\mathbf{u})=&\frac{\max(c(\mathbf{u}w)-D,0)}{\sum_{w'}c(\mathbf{u}w')} + \alpha_{\operatorname{abs}}(\mathbf{u})p_{\operatorname{abs}}(w|\pi(\mathbf{u})) \\
\alpha_{\operatorname{abs}}(\mathbf{u})=&\frac{D\cdot N_{1+}(\mathbf{u}\cdot)}{\sum_{w'}c(\mathbf{u}w')}
\end{split}\end{equation}
Although the discount parameter $D$ can be tuned to optimise the performance on a held-out set, it is generally estimated using the $n$-gram count statistics as:
\begin{equation}
D = \frac{N_{1}(\cdot)}{N_{1}(\cdot)+2N_{2}(\cdot)},
\end{equation}
where $N_{1}(\cdot)$ and $N_{1}(\cdot)$ are the number of $n$-grams that appear exactly once and twice in the training corpus.
\subsection{Kneser-Ney}
With $n$-gram models we take the context of a word into consideration, but not the diversity of history for a word. For example, the word \emph{bulb} might be fairly frequent in a corpus, though very likely, its preceding word will be \emph{light}. The idea with Kneser-Ney smoothing is then to give less probability to the unigram \emph{bulb} than its raw could suggests, taking into account the diversity of history.
In its simplest form\autocite{kneser1995improved} Kneser-Ney smoothing replaces the raw word count $c(w)$ from MLE with the count of histories for a word. For example,
\begin{equation}
p_{\operatorname{KN}}(w) = \frac{N_{1+} (\cdot w)}{\sum_{w_i}N_{1+} (w_iw)},
\end{equation}
where $N_{1+}(\cdot w^i_{i-n+1}) = |\{w_{i-n} : c(w^i_{i-n} )> 0\}|$, meaning the number of different words that precede the sequence $w^i_{w_{i-n+1}}$.
With interpolated Kneser-Ney (IKN) the probability of a word $w$ following a context $\mathbf{u}$ is estimated by discounting the true count $c_{\mathbf{u}w}$ by a fixed amount $d_{|\mathbf{u}|}$ depending on the length of the $n$-gram ($|\mathbf{u}w|$). Furthermore, the probability of word $w$ is estimated with lower order $n$-gram probabilities, such that:
\begin{equation}
p_{\operatorname{IKN}}(w|\mathbf{u}) = \frac{\max(0, c_{\mathbf{u}w} - d_{|\mathbf{u}|})}{c_{\mathbf{u}\cdot}} + \frac{d_{|\mathbf{u}|}t_{\mathbf{u}\cdot}}{c_{\mathbf{u}\cdot}}p_{\operatorname{IKN}}(w|\pi(\mathbf{u}))
\end{equation}
where $t_{\mathbf{u}\cdot} = |\{ w' : c_{\mathbf{u}w'} > 0 \}|$ is the number of distinct words $w'$ following context $\mathbf{u}$ and $\pi(\mathbf{u})$ is the context consisting of all words in $\mathbf{u}$ except the first. $p_{\operatorname{IKN}}(w|\pi(\mathbf{u}))$ then holds the lower order ($n-1$) $n$-gram probabilities.
Since IKN forms the basis of the Bayesian language models that we will introduce later, and form a specialisation of modified Kneser-Ney, which is a common baseline, we will elaborate a bit on this family of language models.
First let $w'\mathbf{u'}$ be the context formed by concatenating $w'$ and $\mathbf{u'}$. Then for a context $\mathbf{u'}$ of length $m<n-1$, and words $w'$ and $w$,
\begin{equation}
\begin{split}
t_{w'\mathbf{u'}w} = \left.
\begin{cases}
1, & \textnormal{if } c_{w'\mathbf{u'}w} > 0; \\
0, & \textnormal{if } c_{w'\mathbf{u'}w} = 0;
\end{cases}
\right. \\
c_{\mathbf{u'}w} = t_{\cdot\mathbf{u'}w} = \sum_{w'}t_{w'\mathbf{u'}w}
\end{split}
\end{equation}\footnote{In the next chapter $c$ and $t$ are given the names \emph{customers} and \emph{tables} respectively, after the metaphorical Chinese restaurant process which is introduced there. We use the same notation here, and throughout this thesis, for sake of clarity.}
The value of $d_x$ is used for each length $x$, and can be estimated using formulas or by using cross-validation.
Modified Kneser-Ney (MKN) is an improvement and generalisation upon IKN, where the amount of discount is allowed more variability. In \autocite{} they found that the optimal amount of discount changes slowly as a function of the counts $c_{\mathbf{u}w}$. In the same study they also noticed that the optimal discounts for low values of $c_{\mathbf{u}w}$ differ substantially from those with higher values. The formed the basis for MKN, which uses different values of discounts for different counts, one for each $c_{\mathbf{u}w} = 1, 2, \ldots, c^{(\textnormal{max})}-1$ and another for $c_{\mathbf{u}w} \geq c^{(\textnormal{max})}$. MKN reduces to IKN when $c^{(\textnormal{max})} = 1$, while \autocite{} use $c^{(\textnormal{max})} = 3$ as a compromise between diminishing improvements and increasing implementational complexity.
In summary, MKN is defined as follows:
\begin{equation}\begin{split}
p_{\operatorname{MKN}}(w|\mathbf{u}) =& \frac{\max(0, c_{\mathbf{u}w} - d(c(\mathbf{u})))}{c_{\mathbf{u}\cdot}} + \frac{d_1N_1(\mathbf{u}\cdot) + d_2N_2(\mathbf{u}\cdot) + d_{3+}N_{3+}(\mathbf{u}\cdot)}{c_{\mathbf{u}\cdot}}p_{\operatorname{MKN}}(w|\pi(\mathbf{u})) \\
d(c) =& \left.
\begin{cases}
0, & \textnormal{if }c = 0; \\
d_1, & \textnormal{if } c = 1; \\
d_2, & \textnormal{if } c = 2; \\
d_3, & \textnormal{if } c \geq 3,
\end{cases}\right. \\
d_1 =& 1- \frac{2n_1n_2}{n_1(n_1+2n_2)} \\
d_2 =& 2 - \frac{3n_1n_3}{n_2(n_1+2n_2)} \\
d_{3+} =& 3 - \frac{4n_1n_4}{n_3(n_1+2n_2)}
\end{split}\end{equation}
\section{Other $n$-gram-based language models}
An informed reader might notice that we have not mentioned other language models based on the $n$-gram paradigm. Although some of them are mentioned in the course of this thesis, we will give an inherently limit overview of some of the other language models that are out there.
$n$-grams are a natural generalisation of words, and an intermediate step is bag-of-words, which are $n$-grams without the constraint on word order.
Because of the computational costs inherent to skipgram language modelling for example with IKN, until recently researchers only experimented up to trigrams. To circumvent this burden, other long-distance-based language models have been introduced, such as the trigger-based\cite{lau1993trigger}, cache-based\cite{clarkson1997language}, and association pattern-based\cite{chien2006association} language models. To overcome the sparseness other models have been implemented such as class-based $n$-grams\cite{brown1992class}. The implementation varied between classes such as the co-occurrence count, part-of-speech (POS) tags, or even semantic relatedness based on WordNet.
Collocations\cite{tomokiyo2003language} and multi-word expressions (MWE)\cite{blei2009visualizing} in particular were often combined with topic modelling\footnote{A topic model is a type of statistical model for discovering the abstract \emph{topics} that occur in a collection of documents. A technique first mentioned in the 80s with PLSA, with a widespread renewed interest in 2003 with LDA.} techniques such as latent Dirichlet allocation (LDA).
More recently the neural language models have seen a huge revival. Also in the field on language modelling huge advances have been shown. Currently, the state-of-the-art on the One billion word benchmark (see \cref{chap:data}) is an LSTM\cite{kuchaiev2017factorization}. Simpler and computationally less intensive methods and tools such as word2vec\cite{mikolov2013distributed} are now replacing the reigning language model MKN.
A notable mention goes out to sparse non-negative language models\cite{shazeer2015sparse}. The researchers show that these language models are competitive in nature, but also capture different behaviour from count-based $n$-gram methods such as MKN, and neural language models such as recurrent neural networks (RNNs)\footnote{Even though word2vec can be seen as implicit matrix factorisation:\cite{levy2015improving}.} and LSTMs. This means that with an optimal interpolation strategy, by improving either of these methods, even better results can be obtained by the combination.