Skip to content

Latest commit

 

History

History
57 lines (38 loc) · 3.52 KB

README.md

File metadata and controls

57 lines (38 loc) · 3.52 KB

MEIC-PRI-2022-23

This project aims to enlighten us about information search systems and their use for large amounts of data. To put this knowledge into practice, we were tasked to develop a search system, with the corresponding tasks of data collection and preparation, and the information processing and retrieval. This project is divided into three milestones:

  1. Data Preparation,
  2. Information Retrieval
  3. Building the final Search System.

Milestone #1: Data Preparation

The first milestone is achieved with the preparation and characterisation of the datasets selected for the project. The datasets are the foundation for the project and the goal of the first task is to prepare and explore them. This task is heavily dependent on the datasets, which may require some extraction actions such as crawling or scraping.

Work on these tasks depends on the nature, volume, organization and accessibility of the selected datasets. As a result of this milestone, a well-documented and reproducible pipeline of data processing is expected. Groups must include the Makefile in the milestone submission.

The following list has a sample of the actions which are required:

  • Search repositories for datasets;
  • Select convenient data subsets;
  • Assess the authority of the data source and data quality;
  • Perform exploratory data analysis;
  • Prepare and document a data processing pipeline;
  • Characterize the datasets, identifying and describing some of their properties;
  • Identify the conceptual model for the data domain;
  • Identify follow-up information needs in the data domain.

Milestone #2: Information Retrieval

The second milestone is achieved with the implementation and use of an information retrieval tool on the project datasets and its exploration with free-text queries.

This task makes use of state-of-the-art retrieval tools and involves the view of the datasets as collections of documents, the identification of a document model for indexing, and the design of queries to be executed on the indexed information.

Also included in this milestone is a brief description of the ideas to explore in the development of the final search system, i.e. Milestone #3.

The following list has a sample of the actions which are required:

  • Choose the information retrieval tool (Solr, Elasticsearch, Lucene, Terrier, …);
  • Analyze the documents and identify their indexable components;
  • Use the selected tool to build the indexes;
  • Use the selected tool to configure and execute the queries;
  • Demonstrate the indexing and retrieval processes;
  • Manually evaluate the returned results;
  • Evaluate the results obtained for the defined information needs.

Milestone #3: Search System

The third milestone is achieved with the development of the final version of the search system. This version is an improvement over the previous milestone, making use of features and techniques with the goal of improving the quality of the search results.

For this milestone, each group is expected to explore innovative approaches and ideas, and will heavily depend on the context and data of each group. Additionally, an extended evaluation of the results and a comparison with the previous version of the search system is also expected.

As examples of topics to explore, groups may: incorporate new information retrieval algorithms; expand the information available for each document by adding and linking new datasets; work on user interfaces by developing a frontend for the search system; etc.


Group members:

  • Fabio Huang - up201806829

  • Lisa Sonck - up202202272