experiments.tex

\section{Experiments}
\label{sec:experiments}

To demonstrate \meta/'s effectiveness in three crucial fields, we evaluate its
performance in NLP, IR, and ML tasks. All experiments were performed on a
workstation with an Intel(R) Core(TM) i7-5820K CPU, 16 GB of RAM, and a 4
TB 5900 RPM disk.

\input{nlp-parsing}
\input{nlp-pos}

\input{ir-datasets}
\input{ir-indexing}
\input{ir-index-size}
\input{ir-query-speed}
\input{ir-map}

\input{ml-datasets}
\input{ml-exp}

\subsection{Natural Language Processing}


\meta/'s part-of-speech taggers for English provide quite reasonable
performance. It provides a linear-chain CRF tagger (CRF) as well as an
averaged perceptron based greedy tagger (AP). We report the token level
accuracy on sections 22--24 of the Penn Treebank, with a few prior model
results trained on sections 0--18 in Table~\ref{table:nlp-pos}. ``Human
annotators'' is an estimate based on a 3\% error rate reported in the Penn
Treebank README and is likely overly
optimistic~\citep{Manning:2011:CICLing}. CoreNLP's model is the result of
\citet{Manning:2011:CICLing}, LTag-Spinal is from
\citet{Shen:2007:ACL}, and SCCN is from \citet{Sogaard:2011:ACL-HLT}.
Both of \meta/'s taggers are within $0.6\%$ of the existing literature.

Both \meta/ and CoreNLP provide implementations of shift-reduce
constituency parsers, following the framework of \citet{const-parsing}.
These can be trained greedily or via beam search. We compared the parser
implementations in \meta/ and CoreNLP along two dimensions---speed,
measured in wall time, and memory consumption, measured as maximum resident
set size---for both training and testing a greedy and beam search parser
(with a beam size of 4). Training was performed on the standard training
split (sections 2--21) of the Penn Treebank, with section 22 being used as
a development set (only used by CoreNLP). Section 23 was held out for
evaluation. The results are summarized in Table~\ref{table:nlp-parsing}.

\meta/ consistently uses less RAM than CoreNLP, both at training
time and testing time. Its training time is slower than CoreNLP
for the greedy parser, but less than half of CoreNLP's training time for
the beam parser. \meta/'s beam parser has worse labeled $F_1$ score, likely
the result of its simpler model averaging strategy\footnote{At training
time, both CoreNLP and \meta/ perform model averaging, but \meta/
computes the average over all updates and CoreNLP performs
cross-validation over a default of the best 8 models on the development
set.}. Overall, however, \meta/'s shift-reduce parser is competitive and
particularly lightweight.

\subsection{Information Retrieval}

\meta/'s IR performance is compared with two well-known search engine toolkits:
\textsc{Lucene}'s latest version 5.5.0\footnote{\url{http://lucene.apache.org/}} and
\textsc{Indri}'s version 5.9~\citep{lemur}\footnote{Indri 5.10 does not
provide source code packages and thus could not be used. It is also known
as \textsc{Lemur}.}.

We use the TREC blog06~\citep{blog06} permalink documents and TREC gov2
corpus~\citep{gov2}. To ensure a more uniform indexing environment, all
HTML is cleaned before indexing. In addition, each corpus is converted into
a single file with one document per line to reduce the effects of many file
operations.

During indexing, terms are lower-cased, stop words are removed from a common
list of 431 stop words, Porter2 (\meta/) or Porter (Indri, Lucene) stemming is
performed, a maximum word length of 32 characters is set, original documents are
not stored in the index, and term position information is not
stored\footnote{For Indri, we are unable to disable positions information
storage.}.

We compare the following: indexing speed (Table~\ref{table:ir-indexing}), index
size in Table~\ref{table:ir-index-size}, query speed in
Table~\ref{table:ir-query-speed}, and query accuracy with BM25 ($k_1=0.9,
b=0.4$) in Table~\ref{table:ir-map}. We use the standard TREC queries associated
with each dataset and score each system's search results with the usual
\texttt{trec\_eval} program\footnote{\url{http://trec.nist.gov/trec_eval/}}.

\meta/ leads in indexing speed, though we note that \meta/'s default
indexer is multithreaded and \textsc{Lucene} does not provide a parallel
one\footnote{Additionally, we did not feel that writing a correct and threadsafe
indexer as a user is something to be reasonably expected.}. \meta/ creates the
smallest index for gov2 while \textsc{Lucene} creates the smallest index for
blog06; \textsc{Indri} greatly lags behind both. \meta/ follows \textsc{Lucene}
closely in retrieval speed, with \textsc{Indri} again lagging. As expected,
query performance between the three systems is relatively even, and we attribute
any small difference in MAP or precision to idiosyncrasies during tokenization.

\subsection{Machine Learning}

\meta/'s ML performance is compared with \textsc{liblinear}~\citep{liblinear},
\textsc{scikit-learn}~\citep{scikit}, and
\textsc{SVMMulticlass}\footnote{\url{http://www.cs.cornell.edu/people/tj/svm_light/svm_multiclass.html}}. We focus on linear classification
with SVM across these tools (MALLET~\citep{mallet} does not provide an SVM, so it
is excluded from the comparisons).
%\todo{Maybe use MALLET's maxent classifier?}
% not enough space
Statistics for the four datasets used can be found in
Table~\ref{table:ml-datasets}.

The 20news
dataset~\citep{Lang:1995:ICML}\footnote{\url{http://qwone.com/~jason/20Newsgroups/}}
is split into its standard 60\% training and 40\% testing sets by post
date. The Blog dataset~\citep{blog} is split into 80\% training and 20\%
testing randomly. Both of these two textual datasets were preprocessed
using \meta/ using the same settings as were used for the IR experiments.

The rcv1 dataset~\citep{rcv1} was processed into a training and testing set
using the \texttt{prep\_rcv1} tool provided with Leon Bottou's SGD
tool\footnote{\url{http://leon.bottou.org/projects/sgd}}. The resulting
training set has 781,265 documents, and the testing set 23,149. The Webspam
corpus~\citep{Webb:2006:CEAS} consists of the subset of the Webb Spam Corpus
used in the Pascal Large Scale Learning
Challenge\footnote{\url{ftp://largescale.ml.tu-berlin.de/largescale/}}.
The corpus was processed using the provided \texttt{convert.py} into byte
trigrams. The first 80\% of the resulting file is used for training and the
last 20\% for testing.

In Table~\ref{table:ml-exp}, we can see that \meta/ performs well both in
terms of speed and accuracy. Both \textsc{liblinear} and
\textsc{SVMMulticlass} were unable to produce models on the Webspam dataset
due to memory limitations and lack of a minibatch framework. For
\textsc{scikit-learn} and \meta/, we broke the training data into 4 equal
sized batches and ran one iteration of SGD per batch. The timing result
includes the time to load each chunk into memory; for \meta/ this is from
its forward-index format\footnote{It took 12m 24s to generate the index.}
and for \textsc{scikit-learn} this is from \textsc{libsvm}-formatted text
files.