Skip to content

Commit

Permalink
final corrections part 2
Browse files Browse the repository at this point in the history
  • Loading branch information
EmanueleC committed Jul 3, 2019
1 parent 2a9a0f8 commit e0f7296
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 21 deletions.
24 changes: 20 additions & 4 deletions res/sections/05-Abstract.tex
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,22 @@ \section*{Abstract}

In the recent years they have been applied to Information Retrieval, leading to the birth of an ``hybrid'' discipline called \textit{Neural IR}.

Neural IR models have shown some improvements over traditional IR baselines on the task of document ranking: by the end of 2016, the Deep Relevance Matching Model (DRMM) \cite{drmm} developed by Jiafeng Guo, Yixing Fan, Qingyao Ai, W. Bruce Croft was one of the first Neural IR models to show improvements over BM25 and the Query Likelihood model.
Neural IR models have shown some improvements over traditional IR baselines on the task of document ranking:
by the end of 2016, the Deep Relevance Matching Model (DRMM) \cite{drmm} developed by Jiafeng Guo, Yixing Fan, Qingyao Ai and W. Bruce Croft
was one of the first Neural IR models to show improvements over BM25 and the Query Likelihood model.

Since then, Neural IR has been an emerging trend, leading to the possibility of advancing the state-of-the-art, which makes even more important to verify published results, to build future directions on a solid foundation.
Since then, Neural IR has been an emerging trend, leading to the possibility of advancing the state-of-the-art, which makes even more important to verify published results,
to build future directions on a solid foundation.

The aim of this work is to repeat, reproduce DRMM and evaluate it on the collection Robust04 \cite{rob04}, a dataset of ``difficult topics'' where the state-of-the-art has reached a maximum of $ \sim 30.2\%$ Mean Average Precision (approximately).
The aim of this work is to reproduce, repeat and evaluate DRMM on the text collection Robust04 \cite{rob04}, a dataset of ``difficult topics'' where the state-of-the-art
has reached a maximum of $ \sim 30.2\%$ Mean Average Precision (approximately).

Following the same methodology and using the same input data, I firstly reproduce the original experiment with a given implementation publicly available and then,
I repeat it with a personal implementation of the model.

The results obtained with my implementation of DRMM were able to get close to the originals
only in two cases and this is motivated by the fact that the paper by Guo et al. does not precisely
describe all steps and parameters needed to guarantee full reproducibility. For this reason, the repetition task was complex and required a lot of time to be completed.

\bigskip

Expand All @@ -26,8 +37,13 @@ \section*{Sommario}

È stato solo di recente che sono state applicate al reperimento dell'informazione, portando alla nascita di una disciplina ibrida chiamata \textit{Neural IR}.

I modelli di Neural IR hanno mostrato miglioramenti rispetto alle baseline date da modelli tradizionali di IR - e uno di questi è il ``Deep Relevance Matching Model'' (DRMM) \cite{drmm}. Alla fine del 2016, il modello ``Deep Relevance Matching Model'' (DRMM) sviluppato da Jiafeng Guo, Yixing Fan, Qingyao Ai, W. Bruce Croft è stato uno dei primi a superare le baselines (ad esempio, i modelli BM25 e Query Likelihood).
I modelli di Neural IR hanno mostrato miglioramenti rispetto alle baseline date da modelli tradizionali di IR - e uno di questi è il ``Deep Relevance Matching Model'' (DRMM) \cite{drmm}. Alla fine del 2016, il modello DRMM sviluppato da Jiafeng Guo, Yixing Fan, Qingyao Ai, W. Bruce Croft è stato uno dei primi a superare le baselines (ad esempio, i modelli BM25 e Query Likelihood).

Da allora il Neural IR è stato un trend in crescita e ha contribuito a far avanzare lo stato dell'arte, fatto che rende ancora più importante verificare risultati pubblicati, in modo tale da far procedere la ricerca su una base solida.

Lo scopo di questo lavoro è di ripetere, riprodurre DRMM e valutarlo sulla collezione Robust04 \cite{rob04}, un dataset di topic su cui è difficile riuscire a ottenere delle buone performance. Lo stato dell'arte ha raggiunto un massimo di $30.2 \%$ MAP.

Seguendo la stessa metodologia e gli stessi dati, ho dapprima riprodotto l'esperimento originale con un'implementazione disponibilile pubblicamente e poi ho ripetuto l'esperimento usando un'implementazione personale del modello.

I risultati ottenuti con la mia implementazione di DRMM sono stati in grado di avvicinarsi agli originali solo in due casi e questo è motivato dal fatto che il paper di Guo et al. non
riporta precisamente tutti i passaggi effettuati e i parametri necessari per garantire completa riproducibilità. Per questa ragione, il task di ripetizione è stato complesso e ha richiesto molto tempo per essere completato.
14 changes: 10 additions & 4 deletions res/sections/07-introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,19 @@ \chapter{Introduction}

Although some Neural IR models have indeed produced some improvements over the baselines (w.r.t. document ad-hoc retrieval), the consequence of their application to IR have not yet been completely understood.

There is a strong discussion going on whether IR can benefit from neural networks or not. In fact, there are two main problems with these models: one regards efficiency (especially when a large collection of documents is considered), the other regards their ability to learn in a way that can address the complexity of IR tasks (where the concept of \textit{relevance} plays an important role).
There is a strong discussion going on whether IR can benefit from neural networks or not. In fact, there are two main problems with these models: one regards
efficiency (especially when a large collection of documents is considered), the other regards their ability to address the complexity
of IR tasks, i.e. learn patterns in query and document text that indicate \textit{relevance}, even if they use different
vocabulary, and even if the patterns are task-specific or context-specific.

The first problem arises from the long time required by a Neural IR model to compute a similarity score between an (appropriate) learnt representation of a document and a query. In case of a large corpus, this time becomes prohibitive.
The first problem arises from the long time required by a Neural IR model to compute a similarity score between an (appropriate) learnt
representation of a document and a query. In case of a large corpus, this time becomes prohibitive.

The second problem is especially linked to the difficulty to learn from queries and documents when no large-scale supervised data is available.
The second problem is especially linked to the difficulty to learn from queries and documents when no large-scale supervised data is available
(unsupervised learning is typically used to learn text representation, while supervised or semi-supervised learning are used to learn ``to match'').

Unfortunately, this is often the case, in fact it is very expensive for a human to label a document ``relevant'' or ``not relevant'' with respect to a certain information need (most of it because relevance is a multifaceted concept).
Unfortunately, this is often the case, in fact it is very expensive for a human to label a document ``relevant'' or ``not relevant''
with respect to a certain information need (most of it because relevance is a multifaceted concept).

A couple of strategies have been applied to deal with these problem: to address the first, a re-ranking approach is often used, while for the second, pseudo relevance feedback is used.

Expand Down
2 changes: 1 addition & 1 deletion res/sections/08-adHocRetrieval.tex
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ \subsubsection{Parsing}

\subsubsection{Stopwords removal}

In language, some words occur more frequently than others. For instance, in English, ``and'' or ``the'' are the two most frequent words, accounting for about 10\% of all word occurrences (\cite{croftIR}).
In language, some words occur more frequently than others. For instance, in English, ``and'' or ``the'' are the two most frequent words, accounting for about 10\% of all word occurrences \cite{croftIR}.

This was observed by Luhn in 1958: he proposed that the significance of a word depended on its frequency in the document.

Expand Down
24 changes: 12 additions & 12 deletions res/sections/17-drmm_impl.tex
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ \section{Preliminar Analysis}
\begin{itemize}
\item the corpus and queries can be stemmed/not stemmed, processed with
stopwords/without stopwords, and different fields (or combination of them) from topics can be considered (i.e. title and description);
\item the preranked results can be obtained with a wide range of retrieval models. For instance, the authors of the original paper use Bm25 and QL models as first-stage rankers.
\item the preranked results can be obtained with a wide range of retrieval models. For instance, the authors of the original paper use BM25 and QL models as first-stage rankers.
\end{itemize}

\textbf{DRMM system}:
Expand All @@ -34,7 +34,7 @@ \section{Preliminar Analysis}
\end{itemize}

After considering the above options, I chose the following configuration for my tests:
both corpus text / query titles stemmed and without stopwords; the top 2000 preranked results with both QL and Bm25 retrieval algorithms, 300 embeddings size, and training/testing on all histograms mode/term gating network.
both corpus text / query titles stemmed and without stopwords; the top 2000 preranked results with both QL and BM25 retrieval algorithms, 300 embeddings size, and training/testing on all histograms mode/term gating network.

Despite a few hyperparameters of the model were given in \cite{drmm}, I had to (empirically) try all the others (e.g. number of epochs, number of documents to sample from the training set, initial learning rate and early stopping values).

Expand Down Expand Up @@ -196,7 +196,7 @@ \section{Dataset analysis}

\begin{itemize}
\item Parsing of collection and topics;
\item Indexing of parsed collection and queries with Terrier Dirichlet QL and Bm25 algorithms (to obtain preranked data);
\item Indexing of parsed collection and queries with Terrier Dirichlet QL and BM25 algorithms (to obtain preranked data);
\item Stemming and stopwords removal;
\item Word-embeddings preparation both for collection and queries with Word2Vec (although queries-based embeddings were ignored) - this is where originates the out-of-vocabulary problem;
\item Pre-computed IDF values for each query term (input to query term gating);
Expand Down Expand Up @@ -476,9 +476,9 @@ \section{Evaluation and metrics}
Galago DirichletLM($\mu=1500$) & 0.206 & 0.390 & 0.326 \\
Galago Bm25($k_1 = 1.2, k_3 = 1.0, b=0.75$) & 0.193 & 0.377 & 0.319 \\
Terrier DirichletLM($\mu=2500$) & 0.241 & 0.404 & 0.343 \\
Terrier Bm25($k_1 = 1.2, k_3 = 8d, b = 0.75d$) & 0.247 & 0.417 & 0.359 \\
Terrier BM25($k_1 = 1.2, k_3 = 8d, b = 0.75d$) & 0.247 & 0.417 & 0.359 \\
Original QL(\textit{unknown parameters}) & 0.253 & 0.415 & 0.369 \\
Original Bm25(\textit{unknown parameters}) & 0.255 & 0.418 & 0.370
Original BM25(\textit{unknown parameters}) & 0.255 & 0.418 & 0.370
\end{tabular}
\caption{Evaluation of preranked results}
\end{table}
Expand All @@ -487,14 +487,14 @@ \section{Evaluation and metrics}

As Guo pointed out in an issue in MatchZoo \footnote{\url{https://github.com/NTMC-Community/MatchZoo/issues/604}}, on Robust04 dataset there is a data imbalance problem, in fact the number of labelled documents for each topic is significantly different from one another.
In this way, the model could be dominated by a specific topic.
Figure \ref{fig:queries_frequencies_bm} shows the (logarithmically scaled) distribution of documents retrieved by Bm25 per topic. It can be noticed that there is also a different distribution of positive and negative examples.
For instance, QL algorithm, on average, retrieves only 3.68\% of positive samples for each topic (with some topics with 0 positive samples - e.g. topic 672) while Bm25 retrieves on average 3.70\% positive samples per topic.
Since the moment that the distributions relative to the runs obtained with these two algorithms are very similar, just the one relative to Bm25 is shown.
Figure \ref{fig:queries_frequencies_bm} shows the (logarithmically scaled) distribution of documents retrieved by BM25 per topic. It can be noticed that there is also a different distribution of positive and negative examples.
For instance, QL algorithm, on average, retrieves only 3.68\% of positive samples for each topic (with some topics with 0 positive samples - e.g. topic 672) while BM25 retrieves on average 3.70\% positive samples per topic.
Since the moment that the distributions relative to the runs obtained with these two algorithms are very similar, just the one relative to BM25 is shown.

\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{queries_frequenciesBm25.png}
\caption{Queries imbalance problem - Bm25 algorithm}
\caption{Queries imbalance problem - BM25 algorithm}
\label{fig:queries_frequencies_bm}
\end{figure}

Expand Down Expand Up @@ -556,7 +556,7 @@ \subsection{Term gating with IDF}

The following tables report the results of my implementation of DRMM. Each set of tables was obtained w.r.t. one of the three histograms modalities (ch, nh, lch). Then, each table in a set show results for different number of (balanced) positive (relevant)/negative (non relevant) samples for each query used for training.

The run to re-rank is given by the pre-ranked results from Terrier Bm25.
The run to re-rank is given by the pre-ranked results from Terrier BM25.

\begin{adjustbox}{center, tabular=ccc, caption={DRMM runs (count-based histograms), IDF weighting, stemmed with stopwords removal}, nofloat=table}
\centering
Expand Down Expand Up @@ -869,7 +869,7 @@ \subsection{Different pre-ranking results}
\end{tabular}
\end{adjustbox}

Since the moment that the Terrier Dirichlet LM run had a lower MAP and P@20 than the Bm25 one, these results are lower than the previous ones. This confirm that a Neural IR system that operate in a re-ranking strategy is very dependent on the previous retrieval system(s).
Since the moment that the Terrier Dirichlet LM run had a lower MAP and P@20 than the BM25 one, these results are lower than the previous ones. This confirm that a Neural IR system that operate in a re-ranking strategy is very dependent on the previous retrieval system(s).

\section{Reproduction of the experiment}

Expand Down Expand Up @@ -958,4 +958,4 @@ \section{Software used}
\item Trec eval 9.0.4.
\end{itemize}

My implementation of DRMM is available on Github: \url{https://github.com/EmanueleC/DRMM_repro}.
My implementation of DRMM is available on Github: \url{https://github.com/EmanueleC/DRMM_repro}.

0 comments on commit e0f7296

Please sign in to comment.