layout |
---|
default |
Lucene is the most widely-used information retrieval toolkit in the world and has emerged as the de facto platform used in industry, especially via other software components in the ecosystem such as Solr and Elasticsearch. However, unlike open-source academic information retrieval systems (e.g., Indri, Terrier, etc.), Lucene has been less focused on evaluation, particularly using standard IR test collections. As a result, Lucene is sometimes viewed as less suitable for research. We wish to change this.
This workshop aims to develop Lucene as a platform for information access and retrieval research. We believe that there are numerous benefits for the adoption of Lucene by IR researchers, including greater reproducibility and easier dissemination of research results to the large community of Lucene users. The purpose of this SIGIR 2017 Workshop is to bring together the community of researchers, practitioners, and developers to realize this vision.
Lucene for Information Access and Retrieval Research (LIARR) is not a traditional "mini conference"-style workshop with a call for papers, submissions reviewed by a program committee, and presentations at the event. Instead, it is designed as a hackathon for attendees to actually work with Lucene in a hands-on capacity. Presentations are meant only as a tool for structuring and guiding the efforts of attendees. Hence, the workshop motto of: less yaking, more hacking.
The goals of this workshop are to:
- create a development plan and common codebase for IR research with Lucene,
- implement various information retrieval methods in Lucene/Solr/Elasticsearch,
- evaluate the quality of such methods and models.
The aim is to take state of the art in the IR field and provide prototype implementations, where we will focus on:
- exposing the standard functions that we need to have access to when we want to code up a retrieval model;
- getting some of the core retrieval functions in there;
- provide an understanding on how some of the functions are implemented in Lucene and how they deviate from how people know them in IR;
- provide a roadmap and set of guidelines to researchers and developers for which models/algorithms/techniques should the community include next into Lucene and how this should be done.
The workshop is a full day workshop held on the SIGIR workshop day (August 11) and is organised as follows:
- Session One (morning): Introduction talks followed by pitches for ideas for teams to work on (ideas will also be collected through a pre-workshop online discussion). Scheduled introduction talks:
- Lucene4IR — Leif Azzopardi will explain the code base of the previous hackathon showing how to run a standard IR batch experiment and evaluate it, and then explain how to hack and mod the toolkit.
- Anserini — Jimmy Lin will give an overview of Anserini, which provides a range of applications for indexing and retrieval using Lucene.
- Session Two (morning): Break up into teams to work on the different ideas. In parallel other breakout groups will provide training and explaining how the core works and how to mod/hack Lucene for the purpose of running TREC-style research experiments:
- Lucene and Solr Innards — Grant Ingersoll, from Lucidworks, will explain the inner loop of Lucene, describing the key innards of Lucene (similarities, codecs, queries) and Solr (components, LTR, parsers) and how they can be extended for research purposes.
- Elastic4IR — Guido Zuccon will explain how Elasticsearch can be used for IR experimentation, outlining the point of departure between current retrieval methods in Elasticsearch and how these are instead defined and understood in IR.
- Session Three (afternoon): Starting off with a quick report on progress before teams continue hacking.
- Session Four (afternoon): Finalising and evaluating the methods implemented. Followed by a summary of progress from each of the teams and a plenary to discuss future directions of work and activities.
There will be a series of sponsored Prizes for various awards.