-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathinterpolation.tex
506 lines (442 loc) · 47 KB
/
interpolation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
\chapter{Chapter 5\newline Improving Bayesian skipgram language models \newline with interpolation factors and backoff strategies}\label{chap:interpol}
Statistical language models have been a staple technology and working horse of natural language processing (NLP) since decades. Many ideas have been proposed, and most have been improvements over existing models. Some of them were revolutionary in either their performance, or in their simplicity. In the nineties Kneser and Ney\autocite{kneser1995improved} published their work on frequentist language models that use count-of-count information to better estimate smoothed backoff probabilities. Two decades later, Mikolov culminated existing work on language models into word2vec,\autocite{mikolov2013distributed} as-of-now one of the most widely used language models.
In this chapter, and more generally, in this thesis, we set out to improve over the more traditional count-based models in the form of their Bayesian generalisation by adding skipgrams to the set of input features, in addition to $n$-grams.
To overcome the traditional problems of overestimating the probabilities of rare occurrences and underestimating the probabilities of unseen events, a range of smoothing algorithms have been proposed in the literature\autocite{goodman2001bit}. Most methods take a heuristic-frequentist approach combining $n$-gram probabilities for various values of $n$, using back-off schemes or interpolation.
In this chapter we expand the hierarchical Pitman-Yor process language model (HPYPLM) with skipgrams,\autocite{onrust2016Improving} introduced in the previous chapter. We add interpolation factors, weighing the relative influence of skipgrams over $n$-grams, and the relative influence of interpolated backoff probabilities.
\section{Interpolation factors}
The use of interpolation factors in a language model is not new. In the literature we find lattice-based language models in
\autocite{dupont1997lattice} and a generalisation called factored language model with generalised parallel backoff \autocite{bilmes2003factored}.
%However, the context sizes are small (2 and 3), % ik snap de relevantie van die laatste zin niet, maar misschien omdat ze niet af is ? (einigt in een komma)
Maximum entropy language models \autocite{ROSENFELD1996187} and distant bigram language models \autocite{bassiou2011long} are other related cases in point. In \autocite{gao2004long} each backoff level has its own weight, fixed for all features. These works are all implicitly using skipgram features, with variable skip sizes, spanning patterns that are larger than $n$.
% In \cite{gao2004long} each backoff level has its own weight, fixed for all features.
% [AB] this sentence moved and copied above
A more recent paper on using skipgram language models only uses uniform linear interpolation with a generalisation of modified Kneser-Ney \autocite{pickhardt2014generalized}. Even more recently, in \autocite{pelemans2016sparse} a sparse non-negative feature weight matrix is computed on the basis of an adjusted version of relative frequency.
Inspired by the previous studies, we use nine interpolation strategies:\marginnote{Hier een plaatje dat voor een run de waarden geeft, en de gemiddelde proporties ofzo.}
\begin{itemize}
\item \textsf{ngram}, where we ignore the skipgram probabilities (and prohibit the backoff step to skipgrams): \\
$I(\mathbf{u}) =
\begin{cases}
1 & \text{if } \mathbf{u} \text{ is }n\text{-gram} \\
0 & \text{if } \mathbf{u} \text{ is skipgram}
\end{cases}$
\item \textsf{Uninformed uniform prior (uni)}, where all the weights are 1:\\
$ I(\mathbf{u}) = 1 $
\item \textsf{Uninformed $n$-gram preference (npref)}, where we give the $n$-grams twice the importance of skipgrams:\footnote{Later in this chapter we do a more in-depth investigation to find the optimal preference ratio.} \\
$I(\mathbf{u}) =
\begin{cases}
2 & \text{if } \mathbf{u} \text{ is }n\text{-gram} \\
1 & \text{if } \mathbf{u} \text{ is skipgram}
\end{cases}$
\item \textsf{Maximum likelihood-based Linear Interpolation (mle)}, based on the maximum likelihood estimate of the context: \\[0.5ex]
$ I(\mathbf{u}) = \displaystyle \frac{c(\mathbf{u})}{c(\mathbf{u}\cdot)} $ \\
\item \textsf{Unnormalised count (count)}, based on the occurrence count of the context: \\[0.5ex]
$ I(\mathbf{u}) = \displaystyle c(\mathbf{u}) $ \\
\item \textsf{Entropy-based Linear Interpolation (ent)}, based on the entropy of the context: \\
$E(\mathbf{u}) = -\displaystyle \sum_{w,c(\mathbf{u}w)>0}^W\frac{c(\mathbf{u})}{c(\mathbf{u}\cdot)}\log\frac{c(\mathbf{u})}{c(\mathbf{u}\cdot)} $ \\
$ I(\mathbf{u}) = \displaystyle \frac{1}{1+E(\mathbf{u})}$ \\
where $c(\mathbf{u}w)$ are the counts as estimated by the model. We use the reciprocal because a higher entropy should yield a lower weight.
% although we did test also tested an increasing function, but this performed worse.
\item \textsf{Perplexity-based Linear Interpolation (ppl)}, raising 2 to the power of the entropy of the context: \\ % shifted into the domain of the counts, by using the entropy: \\
$\textstyle I(\mathbf{u}) = \displaystyle 2^{-E(\mathbf{u})} $
\item \textsf{random}, where weights are uniformly distributed between 0 and 1 and assigned to the terms: \\
$ I(\mathbf{u}) = \text{rand}(0,1) $
\item \textsf{Skipgram-type based Linear Interpolation (value)}, in contrast to many of the interpolation strategies above, and more in line with \textsf{npref}, \textsf{value} assigns a predefined value not based on the content of the context, but on the shape of the context. For example, in \textsf{npref} we only consider the two cases, $n$-gram or skipgram. For \textsf{value} we can assign weights to individual skipgram types such as \emph{a \{1\} c} and \emph{b \{1\} \{1\}}. So if we use the same notation, with \emph{a}, \emph{b}, and \emph{c} as placeholders indicating there is a word in the context on that position\footnote{Positions 1, 2, and 3, respectively.}, and \emph{\{1\}} indicating a single skip, then we can define the function providing the interpolation values as:\footnote{We only outline the parameters for a 4-gram model. Higher-order models are extended analogously.} \\
$\textstyle I(\mathbf{u}) = \begin{cases}
w_{d} & \text{if the context is empty}\\
w_{cd} & \text{if } \mathbf{u} = \text{\emph{c}} \\
w_{bcd} & \text{if } \mathbf{u} = \text{\emph{bc}} \\
w_{b\{1\}d} & \text{if } \mathbf{u} = \text{\emph{b\{1\}}} \\
w_{abcd} & \text{if } \mathbf{u} = \text{\emph{abc}} \\
w_{a\{1\}cd} & \text{if } \mathbf{u} = \text{\emph{a\{1\}c}} \\
w_{ab\{1\}d} & \text{if } \mathbf{u} = \text{\emph{ab\{1\}}} \\
w_{a\{1\}\{1\}d} & \text{if } \mathbf{u} = \text{\emph{a\{1\}\{1\}}} \\
\end{cases}%
$ \\ \noindent
Setting all weights to 1 results in \textsf{uni}; setting $w_{d}$, $w_{cd}$, $w_{bcd}$, and $w_{abcd}$ to 1, and the others to 2,\footnote{With the special case that if you set these values to 0, you end up with \textsf{ngram}.} yields the default \textsf{npref}.
\end{itemize}
The weights for the interpolation strategies \textsl{mle} and \textsl{ppl} are determined at test time, since precomputing and computing all these weights is expensive. To this end we have not ventured into learning the weights during training time, integrated into the Bayesian paradigm of the hierarchical Pitman-Yor process.
As a compromise we have the context-based methods \textsf{ent}, \textsf{ppl}, \textsf{count} and \textsf{mle}, as opposed to the heuristic \textsf{npref}, and the learned \textsf{value}.
We extend \cref{eq:interpolform} for the word probability by adding normalised interpolation weights $I(\cdot)$. The probability of a word $w$ with context $\mathbf{u}$ is then:
\begin{equation}\begin{split}
p(w|\mathbf{u}) &=
\sum_{\mathbf{u}_m\in\boldsymbol\varsigma}
\left[
\frac{I(\mathbf{u}_m)}
{\sum_{\mathbf{x}\in\boldsymbol\varsigma}
I(\mathbf{x})}
\left(\frac{c_{\mathbf{u}_mw\cdot} - d_{|\mathbf{u}_m|}t_{\mathbf{u}_mw\cdot}}
{\theta_{|\mathbf{u}_m|} + c_{\mathbf{u}_m\cdot\cdot}} + \frac{\theta_{|\mathbf{u}_m|} + d_{|\mathbf{u}_m|}t_{\mathbf{u}_m\cdot\cdot}}
{\theta_{|\mathbf{u}_m|} + c_{\mathbf{u}_m\cdot\cdot}}
Z_{\mathbf{u}_mw})
\right)\right]
\end{split}\label{eq:newinterpolform}\end{equation}
with $c_{\mathbf{u}w\cdot}$ being the number of $\mathbf{u}w$ tokens, and $c_{\mathbf{u}\cdot\cdot}$ the number of patterns starting with context $\mathbf{u}$. Similarly, $t_{\mathbf{u}wk}$ is 1 if the $k$th draw from $G_{\mathbf{u}}$ was $w$, 0 otherwise. $t_{\mathbf{u}w\cdot}$ then denotes if there is a pattern $\mathbf{u}w$, and $t_{\mathbf{u}\cdot\cdot}$ is the number of types following context $\mathbf{u}$.
For \textsl{ngram} and \textsl{full}
\begin{equation}
Z_{\mathbf{u}w} = p(w|\pi(\mathbf{u})),
\end{equation}
for \textsl{limited}\sidenote{Computing the normalisation factor is expensive, because for each word $w$ in the vocabulary that occurs after the context $\mathbf{u}$ you have to compute its probability. Combined with the enormous search space for all contexts of length up to three, computing the normalisation factor is best done at runtime, whilst maintaining a cache.}
\begin{equation}
Z_{\mathbf{u}w} = \left.
\begin{cases}
\frac{1 - \sum_{w\in \mathcal{B}} p_{\mathrm{L}}(w|\pi(\mathbf{u}))}{|\mathcal{N}|}, & \text{if } \mathrm{count}(\mathbf{u}w) > 0 \\
p(w|\pi(\mathbf{u})), & \text{otherwise }
\end{cases}
\right.
\end{equation}
where the words $w\in\mathcal{N}$ in the patterns $\mathbf{u}w$ have not been seen in the training data, and the patterns $\mathbf{u}w$ with $w\in\mathcal{B}$ are in the training data.
The main difference between \cref{eq:interpolform} and \cref{eq:newinterpolform} is that in the latter we do not use an explicit discount term over the type counts, but a normalisation term. This ensures a simpler strategy, and it is theoretically sound with proper distributions.\footnote{See \cref{apx:proofinterpolform}.}
Note also that rather than two terms, \cref{eq:newinterpolform} only has one, because the \textsl{ngram} backoff strategy is now interpreted as an interpolation strategy.
\section{Experiments}
In this section we investigate three hypotheses, of which two new hypotheses. First we confirm\footnote{Conform \cref{chap:shpyplm}.} that skipgrams help reduce the perplexity in an intrinsic language model evaluation.\footnote{For the extrinsic counterpart in an automatic speech recognition experiment, we refer to the next chapter.} Second, we investigate whether if we can see an additional effect of different interpolation factors and backoff strategies in a cross-domain setting where the test set is sampled from another text genre as the training data. And finally, we look in a more qualitative way at the effect of skipgrams.
\begin{figure}
\begin{tikzpicture}%[remember picture,overlay]
\node[right] (start) at (0,-0.5) {Worst performance};
\node[left] (end) at (\linewidth-\pgflinewidth,-0.5) {Best performance};
\node[] (mid) at ($(start)!0.5!(end)$) {mid};
\path[left color=worstclr!25, right color=bestclr!25,middle color=avgclr!25]
(0,0) rectangle ++(\linewidth-\pgflinewidth,1);
\end{tikzpicture}
\caption{Throughout this section we use these colours to highlight numbers, to make it easier to compare the numbers. The range is best on both the \textcolor{worstclr!50}{worst performance}, and the \textcolor{bestclr!50}{best performance}. We use a linear scale (even though for perplexity a log scale might be more appropriate).}
\label{fig:colourrange}
\end{figure}
\subsection{Skipgrams and perplexity reductions}
The first comparison is between \textsf{ngram} and \textsf{uni}, since these backoff strategies embody the difference between only $n$-gram features (\textsf{ngram}), and both $n$-gram and skipgram features (\textsf{uni}). We report the perplexities in \cref{tab:ngramsvsskipgrams}, and the relative difference in perplexity when choosing skipgrams\footnote{Read, skipgrams and $n$-grams. In our experiments we never use only skipgrams. We use this convention in the remainder of this thesis, except in cases where there might be some ambiguity otherwise.} over $n$-grams.
\npdecimalsign{.}
\nprounddigits{0}
\begin{table}[]
\centering
\caption{My caption}
\label{tab:ngramsvsskipgrams}
\begin{tabular}{lllllllllllllll}
training & \multicolumn{4}{c}{\obw} & & \multicolumn{4}{c}{\emea} & & \multicolumn{4}{c}{\jrc} \\
test & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp \\ \cline{2-5}\cline{7-10}\cline{12-15}
\textsf{ngram} & \copr{obw}{obw}{129.47} & \copr{obw}{emea}{1123.89}
& \copr{obw}{jrc}{941.4} & \copr{obw}{wp}{456.27} &
& \copr{emea}{obw}{1761.34} & \copr{emea}{emea}{5.63033}
& \copr{emea}{jrc}{898} & \copr{emea}{wp}{1123.58} &
& \copr{jrc}{obw}{1520.1} & \copr{jrc}{emea}{1278.94}
& \copr{jrc}{jrc}{12.85} & \copr{jrc}{wp}{1249.28} \\
\textsf{fulluni} & \copr{obw}{obw}{124.69} & \copr{obw}{emea}{728.27}
& \copr{obw}{jrc}{728.98} & \copr{obw}{wp}{392.04}
& & \copr{emea}{obw}{1393.81} & \copr{emea}{emea}{5.6754}
& \copr{emea}{jrc}{773.116} & \copr{emea}{wp}{907.558} &
& \copr{jrc}{obw}{1303.66} & \copr{jrc}{emea}{1069.64}
& \copr{jrc}{jrc}{13.32} & \copr{jrc}{wp}{1067.99} \\
$\Delta$\% & \numprint{3.1} & \numprint{35.23} & \numprint{22.53} & \numprint{14.04}
& & \numprint{20.840} & \numprint{-0.800}
& \numprint{13.91982} & \numprint{19.217} &
& \numprint{14.21} & \numprint{16.34} & \numprint{-3.65} & \numprint{14.49} \\
\end{tabular}
\end{table}
\subsection{Interpolation between $n$-grams and skipgrams}
The previous results show that if we add skipgrams, we can reduce the perplexity. Since \textsf{uni} is a very naive prior weight, in this section we investigate the effect of adding weights as interpolation factors.
If we would have enough training material, skipgrams might not be necessary, as all information is then captured by the $n$-grams. This hypothesis suggests that $n$-grams carry more information, and in cases where $n$-grams do not cover the encountered patterns, skipgrams are an additional help.
An initial guess for $n$-gram preference was a ratio of 2:1, in favour of $n$-grams. The results for \textsf{fullnpref}\footnote{Unless otherwise noted for \textsf{fullnpref}, the preference ratio is 2.0.} are shown in \cref{tab:fullunivsfullnpref2}, show around 5\% reductions in perplexity compared to \textsf{fulluni}. Although in itself the reductions are not that impressive, they are combined with the reductions in \cref{tab:ngramsvsskipgrams}.
\begin{table}[]
\centering
\caption{My caption}
\label{tab:fullunivsfullnpref2}
\begin{tabular}{lllllllllllllll}
training & \multicolumn{4}{c}{\obw} & & \multicolumn{4}{c}{\emea} & & \multicolumn{4}{c}{\jrc} \\
test & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp \\ \cline{2-5}\cline{7-10}\cline{12-15}
\textsf{fulluni} & \copr{obw}{obw}{124.69} & \copr{obw}{emea}{728.27}
& \copr{obw}{jrc}{728.98} & \copr{obw}{wp}{392.04} &
& \copr{emea}{obw}{1393.81} & \copr{emea}{emea}{5.6754}
& \copr{emea}{jrc}{773.116} & \copr{emea}{wp}{907.558} &
& \copr{jrc}{obw}{1303.66} & \copr{jrc}{emea}{1069.64}
& \copr{jrc}{jrc}{13.32} & \copr{jrc}{wp}{1067.99} \\
\textsf{fullnpref} & \copr{obw}{obw}{118.28} & \copr{obw}{emea}{699.91}
& \copr{obw}{jrc}{694.32} & \copr{obw}{wp}{372.06}
& & \copr{emea}{obw}{1305.9} & \copr{emea}{emea}{5.59}
& \copr{emea}{jrc}{704.94} & \copr{emea}{wp}{852.52} &
& \copr{jrc}{obw}{1215.52} & \copr{jrc}{emea}{1000.72}
& \copr{jrc}{jrc}{12.84} & \copr{jrc}{wp}{1000} \\
$\Delta$\% & \numprint{5.6} & \numprint{3.85} & \numprint{4.80} & \numprint{5.10}
& &\numprint{6.312769}&\numprint{1.5047}
&\numprint{8.796895}&\numprint{6.0572687}&
& \numprint{6.75} & \numprint{6.45} & \numprint{3.6} & \numprint{6.37} \\
\end{tabular}
\end{table}
Even with these positive results, it is hard to not verify whether 2.0 is indeed the optimal value for \textsf{fullnpref}. In \cref{tab:nprefgrid} we show the results of a search for the lowest perplexity, in 25 steps in the logarithmic space from 0.05 through 20.
\begin{table*}\resizebox{\columnwidth}{!}{%
\begin{tabular}{llllllllllllllllllllllllll}
\textsf{fullnpref} & 0.05 & 0.06 & 0.08 & 0.11 & 0.14 & 0.17 & 0.22 & 0.29 & 0.37 & 0.47 & 0.61 & 0.78 & 1 & 1.28 & 1.65 & 2.11 & 2.71 & 3.48 & 4.47 & 5.73 & 7.36 & 9.44 & 12.12 & 15.55 & 19.95 \\
\obw & \wtc{19.4295700804}\numprint{170.844} & \wtc{17.642782244}\numprint{168.288} & \wtc{14.6752883607}\numprint{164.043} & \wtc{11.2023767913}\numprint{159.075} & \wtc{8.46696959105}\numprint{155.162} & \wtc{6.21740650122}\numprint{151.944} & \wtc{3.19328905977}\numprint{147.618} & \btc{0.045438657812}\numprint{142.985} & \btc{2.85424676686}\numprint{138.967} & \btc{5.52114645229}\numprint{135.152} & \btc{8.27123383432}\numprint{131.218} & \btc{10.6641034603}\numprint{127.795} & \btc{12.8381684726}\numprint{124.685} & \btc{14.7165326809}\numprint{121.998} & \btc{16.3257602237}\numprint{119.696} & \btc{17.5567983223}\numprint{117.935} & \btc{18.4781544914}\numprint{116.617} & \btc{19.0772457183}\numprint{115.76} & \btc{19.38273331}\numprint{115.323} & \btc{19.4295700804}\numprint{115.256} & \btc{19.2590003495}\numprint{115.5} & \btc{18.9171618315}\numprint{115.989} & \btc{18.4445997903}\numprint{116.665} & \btc{17.8818594897}\numprint{117.47} & \btc{17.2645927997}\numprint{118.353} \\
\emea & \wtc{19.7839277582}\numprint{1041.55} & \wtc{17.6333258196}\numprint{1022.85} & \wtc{14.0817274739}\numprint{991.968} & \wtc{9.95820701901}\numprint{956.113} & \wtc{6.74000947645}\numprint{928.13} & \wtc{4.1170801496}\numprint{905.323} & \wtc{0.633565030982}\numprint{875.033} & \btc{3.03234873333}\numprint{843.157} & \btc{6.13979602633}\numprint{816.137} & \btc{9.01194216606}\numprint{791.163} & \btc{11.8682175535}\numprint{766.327} & \btc{14.2343397077}\numprint{745.753} & \btc{16.2455550393}\numprint{728.265} & \btc{17.8203246834}\numprint{714.572} & \btc{18.9673890542}\numprint{704.598} & \btc{19.6067043578}\numprint{699.039} & \btc{19.7839277582}\numprint{697.498} & \btc{19.5042345007}\numprint{699.93} & \btc{18.802701248}\numprint{706.03} & \btc{17.737405753}\numprint{715.293} & \btc{16.3469898473}\numprint{727.383} & \btc{14.7095422323}\numprint{741.621} & \btc{12.8258679461}\numprint{758.0} & \btc{10.8975715449}\numprint{774.767} & \btc{8.84000901643}\numprint{792.658} \\
\jrc & \wtc{20.9393339638}\numprint{1053.15} & \wtc{18.8263550482}\numprint{1034.75} & \wtc{15.30204402}\numprint{1004.06} & \wtc{11.1624427185}\numprint{968.012} & \wtc{7.90156503985}\numprint{939.616} & \wtc{5.22577581638}\numprint{916.315} & \wtc{1.647606793}\numprint{885.156} & \btc{2.14955412126}\numprint{852.09} & \btc{5.39688117422}\numprint{823.812} & \btc{8.42774272415}\numprint{797.419} & \btc{11.4793895558}\numprint{770.845} & \btc{14.0498743409}\numprint{748.461} & \btc{16.2861869171}\numprint{728.987} & \btc{18.1019707548}\numprint{713.175} & \btc{19.5143363897}\numprint{700.876} & \btc{20.4275107558}\numprint{692.924} & \btc{20.9017826537}\numprint{688.794} & \btc{20.9393339638}\numprint{688.467} & \btc{20.5781753394}\numprint{691.612} & \btc{19.8758395215}\numprint{697.728} & \btc{18.8791795211}\numprint{706.407} & \btc{17.6627237791}\numprint{717.0} & \btc{16.2760813658}\numprint{729.075} & \btc{14.7856273796}\numprint{742.054} & \btc{13.2386741746}\numprint{755.525} \\
\wp & \wtc{20.7199743005}\numprint{558.048} & \wtc{18.7042539314}\numprint{548.73} & \wtc{15.3451526338}\numprint{533.202} & \wtc{11.4047849022}\numprint{514.987} & \wtc{8.2998659864}\numprint{500.634} & \wtc{5.74982180193}\numprint{488.846} & \wtc{2.33296161413}\numprint{473.051} & \btc{1.30671376792}\numprint{456.226} & \btc{4.43607745748}\numprint{441.76} & \btc{7.37767067265}\numprint{428.162} & \btc{10.3692350625}\numprint{414.333} & \btc{12.9246873827}\numprint{402.52} & \btc{15.1911289267}\numprint{392.043} & \btc{17.0848417525}\numprint{383.289} & \btc{18.6278910542}\numprint{376.156} & \btc{19.7157916483}\numprint{371.127} & \btc{20.4145227915}\numprint{367.897} & \btc{20.7199743005}\numprint{366.485} & \btc{20.666541919}\numprint{366.732} & \btc{20.3035478452}\numprint{368.41} & \btc{19.677502047}\numprint{371.304} & \btc{18.8515715502}\numprint{375.122} & \btc{17.8729152989}\numprint{379.646} & \btc{16.7984268815}\numprint{384.613} & \btc{15.6705060825}\numprint{389.827} \\
\end{tabular}
}
\caption{The perplexity values for different \textsf{fullnpref} preference rates with the \obw model. The 25 steps were sampled in a log space from $[10^{-1.3},10^{1.3}]$. The results show that indeed \textsf{fullnpref-2.0} was a good first guess, with optimal values somewhere between 2.71 and 4.47, depending on the test set.}
\label{tab:nprefgrid}
\end{table*}
These results to some extent weaken the position of skipgrams, as $n$-grams are given a preference of at least 2 times, up to 4. But nonetheless, the skipgrams contribute to a lower perplexity,\footnote{See \cref{tab:ngramsvsskipgrams,tab:fullunivsfullnpref2}.} where this could not be achieved with solely using $n$-grams.
\subsection{Individual interpolation values per backoff step}
In the previous chapter we graphically introduced the backoff steps in \cref{fig:bof}. If we consider the directed edges in the tree going from one node to a smaller node\footnote{Where we measure the size of a node by the length of a pattern minus the number of skips.}, then for \textsf{uni} the edges are weighted 1, and for \textsf{npref} the edges $w_{d}$, $w_{cd}$, $w_{bcd}$, and $w_{abcd}$ have weight 1, with the others being 2.
In the following example in \cref{fig:value} we convert the graph into a tree for a 4-gram model, and we add the names of the backoff weights, corresponding to the terms introduced for \textsf{value} earlier this chapter.
In an attempt to find the optimal values, we only have to consider the weights that can be combined. For example, $w_d$ is never interpolated with another term, and since the weighted terms are normalised, its value does not matter. This leaves us with 6 unique weights: $w_{axcd}$, $w_{abxd}$, and $w_{bcd}$, for the first level; $w_{axxd}$, $w_{cd}$, and $w_{bxd}$ for the second level.
We limit the weights to be integers between 0 through 10. We optimise the value per backoff weight, and set it to the lowest, after which we continue to the next parameter. After a period of stagnation in finding a lower perplexity, we randomise all values to escape possible local minima.
\input{figvalue}
%\multicolumn{2}{c}{\drawtwoboxes{3em}{4em}{1em}{red}{blue}}
%\multicolumn{2}{c}{\drawtwoboxes{3em}{4em}{1em}{red}{blue}}
\begin{table}[]
\centering
\caption{1bw}
\label{tab:obwvalues}
\begin{tabular}{llllllllllllll}
& \multicolumn{2}{c}{ppl} & \multicolumn{10}{c}{weights} & \\
& \multicolumn{2}{c}{ } & \multicolumn{3}{c}{\emph{abcd}} & \multicolumn{2}{c}{\emph{a\{1\}cd}} & \multicolumn{2}{c}{\emph{ab\{1\}d}} & \multicolumn{2}{c}{\emph{bcd}} & & \\ \cline{2-3}\cline{4-6}\cline{7-8}\cline{9-10}\cline{11-12}
test & \textsf{uni} & \textsf{value} & $w_{a\{1\}cd}$ & $w_{ab\{1\}d}$ & $w_{bcd}$ & $w_{a\{2\}d}$ & $w_{cd}$ & $w_{a\{2\}d}$ & $w_{b\{1\}d}$ & $w_{cd}$ & $w_{b\{1\}d}$ & $w_{d}$ & \\
\emea & \numprint{728.265} & \numprint{717.17} & 4 & 2 & 9 & 10 & 9 & 10 & 4 & 9 & 4 & 9 & \\
& \multicolumn{2}{c}{\numprint{1.523483897}} & \wtc{9}27 & \wtc{15}13 & \btc{4}60 & \btc{1}53 & \wtc{1}47 & \btc{8}71 & \wtc{8}29 & \btc{8}69 & \wtc{8}31 & \btc{20}100 & \% \\
\jrc & \numprint{728.987} & \numprint{687.015} & 4 & 2 & 10 & 7 & 9 & 7 & 3 & 9 & 3 & 9 & \\
& \multicolumn{2}{c}{\numprint{5.757578667}} & \wtc{10}25 & \wtc{15}13 & \btc{5}62 & \wtc{2}44 & \btc{2}56 & \btc{8}70 & \wtc{8}30 & \btc{10}75 & \wtc{10}25 & \btc{20}100 & \% \\
\obw & \numprint{124.685} & \numprint{113.711} & 4 & 2 & 9 & 2 & 9 & 2 & 1 & 9 & 1 & 9 & \\
& \multicolumn{2}{c}{\numprint{8.801379476}} & \wtc{11}27 & \wtc{15}13 & \btc{4}60 & \wtc{13}18 & \btc{13}82 & \btc{7}66 & \wtc{6}34 & \btc{16}90 & \wtc{16}10 & \btc{20}100 & \% \\
\wp & \numprint{392.043} & \numprint{363.846} & 2 & 1 & 5 & 3 & 9 & 3 & 2 & 6 & 2 & 4 & \\
& \multicolumn{2}{c}{\numprint{7.192323291}} & \wtc{10}25 & \wtc{15}13 & \btc{5}62 & \wtc{7}33 & \btc{6}66 & \btc{4}60 & \wtc{4}40 & \btc{10}75 & \wtc{10}25 & \btc{20}100 & \% \\
\end{tabular}
\end{table}
\begin{table}[]
\centering
\caption{jrc}
\label{tab:jrcvalues}
\begin{tabular}{llllllllllllll}
& \multicolumn{2}{c}{ppl} & \multicolumn{10}{c}{weights} & \\
& \multicolumn{2}{c}{ } & \multicolumn{3}{c}{\emph{abcd}} & \multicolumn{2}{c}{\emph{a\{1\}cd}} & \multicolumn{2}{c}{\emph{ab\{1\}d}} & \multicolumn{2}{c}{\emph{bcd}} & & \\ \cline{2-3}\cline{4-6}\cline{7-8}\cline{9-10}\cline{11-12}
test & \textsf{uni} & \textsf{value} & $w_{a\{1\}cd}$ & $w_{ab\{1\}d}$ & $w_{bcd}$ & $w_{a\{2\}d}$ & $w_{cd}$ & $w_{a\{2\}d}$ & $w_{b\{1\}d}$ & $w_{cd}$ & $w_{b\{1\}d}$ & $w_{d}$ & \\
\emea & \numprint{1100.26} & \numprint{971.437} & 1 & 1 & 9 & 8 & 6 & 8 & 4 & 6 & 4 & 2 & \\
& \multicolumn{2}{c}{\numprint{11.70841437}} & \wtc{16}9 & \wtc{16}9 & \btc{13}82 & \btc{3}57 & \wtc{3}43 & \btc{7}67 & \wtc{7}33 & \btc{4}60 & \wtc{4}40 & \btc{20}100 & \% \\
\jrc & \numprint{12.588} & \numprint{11.6829} & 2 & 1 & 10 & 2 & 8 & 2 & 1 & 8 & 1 & 8 & \\
& \multicolumn{2}{c}{\numprint{7.190181125}} & \wtc{14}15 & \wtc{16}8 & \btc{11}77 & \wtc{12}20 & \btc{12}80 & \btc{7}67 & \wtc{7}33 & \btc{16}89 & \wtc{16}11 & \btc{20}100 & \% \\
\obw & \numprint{1329.83} & \numprint{1166.39} & 2 & 0.1 & 10 & 2 & 1 & 2 & 1 & 1 & 1 & 1 & \\
& \multicolumn{2}{c}{\numprint{12.29029274}} & \wtc{13}17 & \wtc{20}1 & \btc{13}82 & \wtc{7}67 & \btc{7}33 & \btc{7}67 & \wtc{7}33 & \btc{0}50 & \wtc{0}50 & \btc{20}100 & \% \\
\wp & \numprint{1079.26} & \numprint{954.277} & 2 & 0.1 & 10 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & \\
& \multicolumn{2}{c}{\numprint{11.58043474}} & \wtc{13}17 & \wtc{20}1 & \btc{13}82 & \wtc{0}50 & \btc{0}50 & \wtc{0}50 & \btc{0}50 & \wtc{0}50 & \btc{0}50 & \btc{20}100 & \% \\
\end{tabular}
\end{table}
\begin{table}[]
\centering
\caption{emea}
\label{tab:emeavalues}
\begin{tabular}{llllllllllllll}
& \multicolumn{2}{c}{ppl} & \multicolumn{10}{c}{weights} & \\
& \multicolumn{2}{c}{ } & \multicolumn{3}{c}{\emph{abcd}} & \multicolumn{2}{c}{\emph{a\{1\}cd}} & \multicolumn{2}{c}{\emph{ab\{1\}d}} & \multicolumn{2}{c}{\emph{bcd}} & & \\ \cline{2-3}\cline{4-6}\cline{7-8}\cline{9-10}\cline{11-12}
test & \textsf{uni} & \textsf{value} & $w_{a\{1\}cd}$ & $w_{ab\{1\}d}$ & $w_{bcd}$ & $w_{a\{2\}d}$ & $w_{cd}$ & $w_{a\{2\}d}$ & $w_{b\{1\}d}$ & $w_{cd}$ & $w_{b\{1\}d}$ & $w_{d}$ & \\
\emea & \numprint{5.66484} & \numprint{5.50167} & 2 & 1 & 10 & 2 & 8 & 2 & 1 & 8 & 1 & 8 & \\
& \multicolumn{2}{c}{\numprint{2.880399093}} & \wtc{14}15 & \wtc{16}8 & \btc{11}77 & \wtc{12}20 & \btc{12}80 & \btc{7}67 & \wtc{7}33 & \btc{16}89 & \wtc{16}11 & \btc{20}100 & \% \\
\jrc & \numprint{762.331} & \numprint{630.976} & 1 & 1 & 10 & 8 & 6 & 8 & 5 & 6 & 5 & 2 & \\
& \multicolumn{2}{c}{\numprint{17.23070425}} & \wtc{17}8 & \wtc{17}8 & \btc{15}88 & \btc{3}57 & \wtc{3}43 & \btc{5}62 & \wtc{5}38 & \btc{2}55 & \wtc{2}45 & \btc{20}100 & \% \\
\obw & \numprint{1389.33} & \numprint{1217.06} & 2 & 0.1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & \\
& \multicolumn{2}{c}{\numprint{12.39950192}} & \btc{6}65 & \wtc{19}3 & \wtc{7}32 & \wtc{0}50 & \btc{0}50 & \wtc{0}50 & \btc{0}50 & \wtc{0}50 & \btc{0}50 & \btc{20}100 & \% \\
\wp & \numprint{899.598} & \numprint{798.043} & 2 & 0.1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & \\
& \multicolumn{2}{c}{\numprint{11.28893128}} & \btc{6}65 & \wtc{19}3 & \wtc{7}32 & \wtc{0}50 & \btc{0}50 & \wtc{0}50 & \btc{0}50 & \wtc{0}50 & \btc{0}50 & \btc{20}100 & \% \\
\end{tabular}
\end{table}
We report the findings in \cref{tab:obwvalues}. We use \obw as training set, and report the lowest perplexities found on development sets for \emea, \jrc, \wp, and within-domain \obw. On the rightside of the table, we list the weight values for the \textsf{value} strategy with the lowest perplexity. For each of the weight we also report the relative weight for a particular backoff step. For example, in the case of \jrc, the pattern \emph{abcd} can backoff to three steps \emph{a\{1\}cd}, \emph{ab\{1\}d}, and \emph{bcd}. These respective steps have been assigned weights 4, 2, and 10. During the search for the best weights, we only kept track of the first time the lowest perplexity was found, however, when a set of weights with the same relative distribution have been found per backoff step, these yielded the same perplexity. Since the weight values are normalised, only their relative weight is important. Which is why it does not make a difference whether the value for \emph{d} is 4 or 9.
From \cref{tab:obwvalues} the first thing that jumps out is that for \emph{abcd} the relative weights estimated for all 4 development sets are almost the same. But this is not the case for the three other steps \emph{a\{1\}cd}, \emph{ab\{1\}d}, and \emph{bcd}.
If we do not read the table from left to right, but from top to bottom, we notice that there seems to be a distinction between within-domain behaviour, and cross-domain behaviour.
\begin{table}[]
\centering
\caption{My caption}
\label{tab:allvalues}
\begin{tabular}{lllllllllllllll}
training & \multicolumn{4}{c}{\obw} & & \multicolumn{4}{c}{\emea} & & \multicolumn{4}{c}{\jrc} \\
test & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp \\ \cline{2-5}\cline{7-10}\cline{12-15}
\textsf{uni} & \copr{obw}{obw}{124.685} & \copr{obw}{emea}{728.265}
& \copr{obw}{jrc}{728.987} & \copr{obw}{wp}{392.043} &
& \copr{emea}{obw}{1393.81} & \copr{emea}{emea}{5.6754}
& \copr{emea}{jrc}{773.116} & \copr{emea}{wp}{908} &
& \copr{jrc}{obw}{1303.66} & \copr{jrc}{emea}{1069.64}
& \copr{jrc}{jrc}{13.32} & \copr{jrc}{wp}{1067.99} \\
\obw-\textsf{fullvalue} & \copr{obw}{obw}{114.537} & \copr{obw}{emea}{712.609}
& \copr{obw}{jrc}{694.436} & \copr{obw}{wp}{365.706}
& & \copr{emea}{obw}{1212.13} & \copr{emea}{emea}{5.56569}
& \copr{emea}{jrc}{655.143} & \copr{emea}{wp}{655.143} &
& \copr{jrc}{obw}{1155.22} & \copr{jrc}{emea}{950.893}
& \copr{jrc}{jrc}{12.6641} & \copr{jrc}{wp}{949.983} \\
\emea-\textsf{fullvalue} & \copr{obw}{obw}{115.966} & \copr{obw}{emea}{692.109}
& \copr{obw}{jrc}{685.726} & \copr{obw}{wp}{366.04}
& & \copr{emea}{obw}{1221.16} & \copr{emea}{emea}{5.55541}
& \copr{emea}{jrc}{650.849} & \copr{emea}{wp}{804.805} &
& \copr{jrc}{obw}{1234.75} & \copr{jrc}{emea}{1021.2}
& \copr{jrc}{jrc}{12.4544} & \copr{jrc}{wp}{1019.34} \\
\jrc-\textsf{fullvalue} & \copr{obw}{obw}{115.186} & \copr{obw}{emea}{694}
& \copr{obw}{jrc}{684.972} & \copr{obw}{wp}{364.5}
& & \copr{emea}{obw}{1372.8} & \copr{emea}{emea}{5.52968}
& \copr{emea}{jrc}{708.803} & \copr{emea}{wp}{890.016} &
& \copr{jrc}{obw}{1155.73} & \copr{jrc}{emea}{948.762}
& \copr{jrc}{jrc}{12.6653} & \copr{jrc}{wp}{951.25} \\
\wp-\textsf{fullvalue} & \copr{obw}{obw}{115.009} & \copr{obw}{emea}{696.297}
& \copr{obw}{jrc}{685.437} & \copr{obw}{wp}{316.727}
& & \copr{emea}{obw}{1211.78} & \copr{emea}{emea}{5.56345}
& \copr{emea}{jrc}{653.655} & \copr{emea}{wp}{653.655} &
& \copr{jrc}{obw}{1153.54} & \copr{jrc}{emea}{950.737}
& \copr{jrc}{jrc}{12.6445} & \copr{jrc}{wp}{949.004} \\
\end{tabular}
\end{table}
If we are concerned with cross-domain generalisability, we do not want to optimise the parameters for every possible set. According to \cref{tab:allvalues} the parameters learned for \wp-\textsf{fullvalue} seem to be effective on all three training sets, for almost all tests (8 out of 12). For all sets where it is not the best-performing set of parameters, it is a close second with a difference of at most 4 points in perplexity (0.6\%).
\subsection{Interpolation weights with contextual knowledge}
In contrast to parameters based on heuristics, or parameters learned on a development set, we can also use knowledge from the training corpus to estimate certain contextual knowledge. Here we investigate three such examples: \textsf{mle}, \textsf{count}, \textsf{ent}, and \textsf{ppl}. \cref{tab:contextbasedinterpol}
\begin{table}[]
\centering
\caption{My caption}
\label{tab:contextbasedinterpol}
\begin{tabular}{lllllllllllllll}
training & \multicolumn{4}{c}{\obw} & & \multicolumn{4}{c}{\emea} & & \multicolumn{4}{c}{\jrc} \\
test & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp \\ \cline{2-5}\cline{7-10}\cline{12-15}
\textsf{fulluni} & \copr{obw}{obw}{124.685} & \copr{obw}{emea}{728.265}
& \copr{obw}{jrc}{728.987} & \copr{obw}{wp}{392.043} &
& \copr{emea}{obw}{1393.81} & \copr{emea}{emea}{5.6754}
& \copr{emea}{jrc}{773.116} & \copr{emea}{wp}{908} &
& \numprint{1303.66} & \numprint{1069.64}
& \numprint{13.32} & \numprint{1067.99} \\
\textsf{fullmle} & \copr{obw}{obw}{125.17} & \numprint{000}
& \numprint{000} & \numprint{000}
& & \copr{emea}{obw}{1931.25} & \copr{emea}{emea}{5.63}
& \copr{emea}{jrc}{1015.46} & \copr{emea}{wp}{1225.27} &
& \copr{jrc}{obw}{1535.75} & \copr{jrc}{emea}{1244.74}
& \numprint{000} & \numprint{000} \\
\textsf{fullcount} & \copr{obw}{obw}{122.086} & \copr{obw}{emea}{893.166}
& \copr{obw}{jrc}{885.283} & \copr{obw}{wp}{421.195}
& & \copr{emea}{obw}{1681.37} & \copr{emea}{emea}{5.61967}
& \copr{emea}{jrc}{888.956} & \copr{emea}{wp}{1075.4} &
& \copr{jrc}{obw}{1436.12} & \copr{jrc}{emea}{1168.68}
& \copr{jrc}{jrc}{12.8619} & \copr{jrc}{wp}{1192.74} \\
\textsf{fullent} & \copr{obw}{obw}{132.26} & \copr{obw}{emea}{794.05}
& \copr{obw}{jrc}{791.69} & \copr{obw}{wp}{434.24}
& & \copr{emea}{obw}{1552.49} & \copr{emea}{emea}{5.69}
& \copr{emea}{jrc}{880.78} & \copr{emea}{wp}{1032.07} &
& \copr{jrc}{obw}{1453.86} & \copr{jrc}{emea}{1179.18}
& \copr{jrc}{jrc}{13.4475} & \copr{jrc}{wp}{1197.05} \\
\textsf{fullppl} & \copr{obw}{obw}{157.065} & \copr{obw}{emea}{1002.24}
& \copr{obw}{jrc}{1027.3} & \copr{obw}{wp}{555.01}
& & \copr{emea}{obw}{2007.03} & \copr{emea}{emea}{5.82737}
& \copr{emea}{jrc}{1217.94} & \copr{emea}{wp}{1329.48} &
& \copr{jrc}{obw}{1868.78} & \copr{jrc}{emea}{1475.07}
& \copr{jrc}{jrc}{14.2414} & \copr{jrc}{wp}{1544.06} \\
$\Delta$\% & \numprint{000} & \numprint{000} & \numprint{000} & \numprint{000}
& & \numprint{000} & \numprint{000}
& \numprint{000} & \numprint{000} &
& \numprint{000} & \numprint{000} & \numprint{000} & \numprint{000} \\
\end{tabular}
\end{table}
\subsection{Random interpolation weights}
As a sanity check we have also implemented a random interpolation weight \textsf{random}. The weights are normally distributed between 0 through 1. \cref{tab:randominterpol}
\begin{table}[]
\centering
\caption{My caption}
\label{tab:randominterpol}
\begin{tabular}{lllllllllllllll}
training & \multicolumn{4}{c}{\obw} & & \multicolumn{4}{c}{\emea} & & \multicolumn{4}{c}{\jrc} \\
test & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp \\ \cline{2-5}\cline{7-10}\cline{12-15}
\textsf{ngram} & \copr{obw}{obw}{129.47} & \copr{obw}{emea}{1123.89}
& \copr{obw}{jrc}{941.4} & \copr{obw}{wp}{456.27} &
& \copr{emea}{obw}{1761.34} & \copr{emea}{emea}{5.63033}
& \copr{emea}{jrc}{898} & \copr{emea}{wp}{1123.58} &
& \copr{jrc}{obw}{1520.1} & \copr{jrc}{emea}{1278.94}
& \copr{jrc}{jrc}{12.85} & \copr{jrc}{wp}{1249.28} \\
\textsf{fullrandom} & \copr{obw}{obw}{129.713} & \copr{obw}{emea}{769.142}
& \copr{obw}{jrc}{769.019} & \copr{obw}{wp}{411.774}
& & \copr{emea}{obw}{1483.92} & \copr{emea}{emea}{5.72414}
& \copr{emea}{jrc}{826.277} & \copr{emea}{wp}{961.939} &
& \copr{jrc}{obw}{1372.32} & \copr{jrc}{emea}{1119.66}
& \copr{jrc}{jrc}{13.5574} & \copr{jrc}{wp}{1122.53} \\
$\Delta$\% & \numprint{000} & \numprint{000} & \numprint{000} & \numprint{000}
& & \numprint{000} & \numprint{000}
& \numprint{000} & \numprint{000} &
& \numprint{000} & \numprint{000} & \numprint{000} & \numprint{000} \\
\end{tabular}
\end{table}
\subsection{\textsf{full} backoff versus \textsf{lim}ited backoff strategies}
In the previous chapter we saw that with the discount-based \textsf{lim}ited backoff strategy, there was a clear effect of testing on either within-domain and cross-domain, in favour of the within-domain setting. We argued that this was the case because \text{lim} stops the backoff procedure once it encounteres a pattern that has also been seen in whole in the training data, and that for already seen patterns, the estimated probability is better than an combinated of estimated patterns up to the uniform probabilities.
But with the \textsf{full} and \textsf{lim}ited backoff strategies in this chapter we do not see this effect. An overview of perplexities is given in \cref{tab:limperplexities}. The colours white through blue show that the perplexities are average on best, and the worst for that training-test combination.
\begin{table}[]
\centering
\caption{My caption}
\label{tab:limperplexities}
\begin{tabular}{lllllllllllllll}
training & \multicolumn{4}{c}{\obw} & & \multicolumn{4}{c}{\emea} & & \multicolumn{4}{c}{\jrc} \\
test & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp
& & \obw & \emea & \jrc & \wp \\ \cline{2-5}\cline{7-10}\cline{12-15}
\textsf{ngram} & \copr{obw}{obw}{129.47} & \copr{obw}{emea}{1123.89}
& \copr{obw}{jrc}{941.4} & \copr{obw}{wp}{456.27} &
& \copr{emea}{obw}{1761.34} & \copr{emea}{emea}{5.63033}
& \copr{emea}{jrc}{898} & \copr{emea}{wp}{1123.58} &
& \copr{jrc}{obw}{1520.1} & \copr{jrc}{emea}{1278.94}
& \copr{jrc}{jrc}{12.85} & \copr{jrc}{wp}{1249.28} \\
\textsf{limuni} & \copr{obw}{obw}{134.17} & \copr{obw}{emea}{758.54} %limuni done
& \copr{obw}{jrc}{755.7} & \copr{obw}{wp}{406.31} &
& \copr{emea}{obw}{1421.99} & \copr{emea}{emea}{5.9}
& \copr{emea}{jrc}{793.02} & \copr{emea}{wp}{925.72} &
& \copr{jrc}{obw}{1353.05} & \copr{jrc}{emea}{1112.07}
& \copr{jrc}{jrc}{14.34} & \copr{jrc}{wp}{1103.96} \\
\textsf{limnpref} & \copr{obw}{obw}{128.32} & \copr{obw}{emea}{732.86} %limnpref done
& \copr{obw}{jrc}{723.26} & \copr{obw}{wp}{387.39} &
& \copr{emea}{obw}{1339.55} & \copr{emea}{emea}{5.83}
& \copr{emea}{jrc}{727.58} & \copr{emea}{wp}{874.17} &
& \copr{jrc}{obw}{1271.47} & \copr{jrc}{emea}{1048.3}
& \copr{jrc}{jrc}{13.89} & \copr{jrc}{wp}{1041.44} \\
% \textsf{limmle} & \copr{obw}{obw}{138.388} & \copr{obw}{emea}{1027.84}
% & \copr{obw}{jrc}{993.144} & \copr{obw}{wp}{465.52}
% & & \copr{emea}{obw}{bbb} & \copr{emea}{emea}{bbb}
% & \copr{emea}{jrc}{bbb} & \copr{emea}{wp}{bbb} &
% & \copr{jrc}{obw}{ccc} & \copr{jrc}{emea}{ccc}
% & \copr{jrc}{jrc}{ccc} & \copr{jrc}{wp}{ccc} \\
\textsf{limcount} & \copr{obw}{obw}{133.354} & \copr{obw}{emea}{941.565}
& \copr{obw}{jrc}{927.673} & \copr{obw}{wp}{441.112} &
& \copr{emea}{obw}{1745.28} & \copr{emea}{emea}{5.85979}
& \copr{emea}{jrc}{928.113} & \copr{emea}{wp}{1114.12} &
& \copr{jrc}{obw}{1528.67} & \copr{jrc}{emea}{1243.3}
& \copr{jrc}{jrc}{13.949} & \copr{jrc}{wp}{1260.12} \\
\textsf{liment} & \copr{obw}{obw}{143.67} & \copr{obw}{emea}{832.28}
& \copr{obw}{jrc}{824.78} & \copr{obw}{wp}{452.52} &
& \copr{emea}{obw}{1583.12} & \copr{emea}{emea}{5.96}
& \copr{emea}{jrc}{903.881} & \copr{emea}{wp}{1052.99} &
& \copr{jrc}{obw}{1508.13} & \copr{jrc}{emea}{1228.23}
& \copr{jrc}{jrc}{14.6535} & \copr{jrc}{wp}{1238.09} \\
\textsf{limppl} & \copr{obw}{obw}{172.141} & \copr{obw}{emea}{1055.32}
& \copr{obw}{jrc}{1074.87} & \copr{obw}{wp}{850.723} &
& \copr{emea}{obw}{2049.38} & \copr{emea}{emea}{6.13118}
& \copr{emea}{jrc}{1251.99} & \copr{emea}{wp}{1358.52} &
& \copr{jrc}{obw}{1945.12} & \copr{jrc}{emea}{1543.46}
& \copr{jrc}{jrc}{15.6463} & \copr{jrc}{wp}{1602.42} \\
\textsf{limrandom} & \copr{obw}{obw}{139.896} & \copr{obw}{emea}{804.404}
& \copr{obw}{jrc}{799.865} & \copr{obw}{wp}{427.539}
& & \copr{emea}{obw}{1522.77} & \copr{emea}{emea}{5.95858}
& \copr{emea}{jrc}{854.708} & \copr{emea}{wp}{985.087} &
& \copr{jrc}{obw}{1433.02} & \copr{jrc}{emea}{1177.73}
& \copr{jrc}{jrc}{14.611} & \copr{jrc}{wp}{1163.32} \\
\end{tabular}
\end{table}
\subsection{A qualitative analysis into the contribution of skipgrams}
%\section{Experiments}
%We train 4-gram language model on the two training corpora, the Google 1 billion word benchmark and the Mediargus corpus.\footnote{See~\cref{sec:data} for a description of the corpora.} We do not perform any preprocessing on the data except tokenisation.
% %The models are trained with a HPYLM. We do not use sentence beginning and end markers. The results for the {\sf ngram} backoff strategy are obtained by training without skipgrams; for {\sf limited} and {\sf full} we added skipgram features during training.
%
%When setting up the experimental framework, we had to decide on the basis. Earlier work on hierarchical Pitman-Yor language models by Huang and Renals had accompanying software releases. An SRILM extension with HPYPLM was proposed in \autocite{huang2007hierarchical}, and a frequentist approximation extension of the HPYPLM was described in \autocite{huang2010power}. However, at the time I started this thesis, they were no longer accessible. With further inquiries we learned that also none of the source code has survived during the period.
%
%We found an alternative in cpyp,\footnote{\url{https://github.com/redpony/cpyp}} which is an existing library for non-parametric Bayesian modelling with PY priors with histogram-based sampling \cite{blunsom2009note}. This library has an example application to showcase its performance with $n$-gram based language modelling. Limitations of the library, such as not natively supporting skipgrams, and the lack of other functionality such as thresholding and discarding of certain patterns, led us to extend the library with Colibri Core,\footnote{\url{http://proycon.github.io/colibri-core/}} a pattern modelling library. Colibri Core resolves the limitations, and together the libraries are a complete language model that handles skipgrams: cococpyp.\footnote{\url{https://github.com/naiaden/cococpyp}} This software in turn has been rewritten to allow also for reranking nbest lists, and being more in control of the underlying language model. We gave it the name SLM, for skipgram language model.\footnote{\url{https://github.com/naiaden/SLM}} Throughout the rest of the thesis the reported results were obtained with SLM.
%
% Each model is run for 50 iterations (without an explicit burn-in phase), with the initial values for hyperparameters $\theta=1.0$ and $\gamma=0.8$. The hyperparameters are resampled every 30 iterations with slice sampling \cite{walker2007sampling}.
%
% \textbf{Plot van dalende ppl over iteraties, effect resampling?}
%
% We test each model on different test sets, and we collect their intrinsic performance by means of perplexity. We compute the perplexity on all 4-grams, rather than computing the perplexity for sentences.
% Words in the test set that were unseen in the training data are ignored in computing the perplexity on test data.\footnote{This is common for perplexity. }
\subsection{PPL}
\subsection{Learning curves}
\section{Results}
\section{Discussion}