Advanced Information Retrieval System

This project is a sophisticated information retrieval system developed to search and rank documents based on user queries. It leverages preprocessing techniques, indexing, and both traditional and advanced NLP models for enhanced search accuracy and relevance.

Project Overview

This project utilizes the CISI dataset, which consists of:

Documents: A file containing the documents to be searched.
Queries: A file containing the queries to be processed.
Qrels: A file containing relevance judgments for evaluating search accuracy.

The main objectives of this project are to:

Preprocess the data by tokenizing, stemming, and removing stop words.
Build an inverted index and posting list for efficient searching.
Implement query processing, retrieval, and ranking using TF-IDF and BM25 models.
Incorporate query expansion using the RM3 model to improve search accuracy.
Provide a simple user interface to interact with the system.

Dataset

The CISI dataset is used for testing and evaluating the system. It includes files for documents, queries, and relevance judgments.

Preprocessing

Implemented using PyTerrier and NLTK, the preprocessing steps include:

Tokenization and Stop Words Removal: Splitting text into tokens and removing common stop words.
Stemming: Reducing words to their root form.
Cleaning: Removing unnecessary characters.

All of these functions are wrapped in a single process function that can be applied directly to queries and documents.

Indexing

The system builds an inverted index and posting list for efficient document retrieval.

Query Processing

Takes a query from the dataset.
Applies the process function for preprocessing.
Passes the processed query to a TF-IDF model to retrieve and rank relevant documents by score.

Query Expansion

The system supports query expansion through:

BM25 Model: Initial search to retrieve relevant documents.
RM3 Expander: Expands the original query based on initial results.

The expanded query often improves retrieval accuracy, as shown by higher TF-IDF scores compared to the original query.

User Interface

A simple UI is developed using:

Python, HTML, and CSS via Flask.
Ngrok integration for running the app on Colab for easy access.

Users can input queries, apply the RM3 expander, and utilize advanced models (Elmo, BERT) for improved results. The UI displays relevant documents, retrieval times, and document counts.

Evaluation

Evaluation is conducted using pt.Experiment to measure performance across different models.

Technologies Used

PyTerrier: For building and evaluating retrieval models.
NLTK: For text preprocessing.
TF-IDF and BM25: For traditional information retrieval.
RM3: For query expansion.
Elmo and BERT: For deep semantic understanding and contextual relevance.
Flask and Ngrok: For a user-friendly interface on Colab.

Installation

To run this project locally, follow these steps:

Clone the repository:

git clone https://github.com/yourusername/repository-name.git
cd repository-name

Install required packages:
```
pip install -r requirements.txt
```
Run the Flask app:
```
python app.py
```

Usage

Start the Flask app by running app.py.
Open the link provided by Ngrok to access the UI.
Enter a query and select options for query expansion or model use (Elmo, BERT).
View results, including retrieved documents, scores, retrieval times, and document counts.

Contributing

Contributions are welcome! Please fork the repository, make changes, and submit a pull request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
DSAI 201 Final Project Advanced Search Engine.ipynb		DSAI 201 Final Project Advanced Search Engine.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced Information Retrieval System

Table of Contents

Project Overview

Dataset

Preprocessing

Indexing

Query Processing

Query Expansion

User Interface

Evaluation

Technologies Used

Installation

Usage

Contributing

License

About

Releases

Packages

Languages

License

AhmedFoda54/Advanced-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Advanced Information Retrieval System

Table of Contents

Project Overview

Dataset

Preprocessing

Indexing

Query Processing

Query Expansion

User Interface

Evaluation

Technologies Used

Installation

Usage

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages