This project is a sophisticated information retrieval system developed to search and rank documents based on user queries. It leverages preprocessing techniques, indexing, and both traditional and advanced NLP models for enhanced search accuracy and relevance.
- Project Overview
- Dataset
- Preprocessing
- Indexing
- Query Processing
- Query Expansion
- User Interface
- Evaluation
- Technologies Used
- Installation
- Usage
- Contributing
- License
This project utilizes the CISI dataset, which consists of:
- Documents: A file containing the documents to be searched.
- Queries: A file containing the queries to be processed.
- Qrels: A file containing relevance judgments for evaluating search accuracy.
The main objectives of this project are to:
- Preprocess the data by tokenizing, stemming, and removing stop words.
- Build an inverted index and posting list for efficient searching.
- Implement query processing, retrieval, and ranking using TF-IDF and BM25 models.
- Incorporate query expansion using the RM3 model to improve search accuracy.
- Provide a simple user interface to interact with the system.
The CISI dataset is used for testing and evaluating the system. It includes files for documents, queries, and relevance judgments.
Implemented using PyTerrier and NLTK, the preprocessing steps include:
- Tokenization and Stop Words Removal: Splitting text into tokens and removing common stop words.
- Stemming: Reducing words to their root form.
- Cleaning: Removing unnecessary characters.
All of these functions are wrapped in a single process
function that can be applied directly to queries and documents.
The system builds an inverted index and posting list for efficient document retrieval.
- Takes a query from the dataset.
- Applies the
process
function for preprocessing. - Passes the processed query to a TF-IDF model to retrieve and rank relevant documents by score.
The system supports query expansion through:
- BM25 Model: Initial search to retrieve relevant documents.
- RM3 Expander: Expands the original query based on initial results.
The expanded query often improves retrieval accuracy, as shown by higher TF-IDF scores compared to the original query.
A simple UI is developed using:
- Python, HTML, and CSS via Flask.
- Ngrok integration for running the app on Colab for easy access.
Users can input queries, apply the RM3 expander, and utilize advanced models (Elmo, BERT) for improved results. The UI displays relevant documents, retrieval times, and document counts.
Evaluation is conducted using pt.Experiment
to measure performance across different models.
- PyTerrier: For building and evaluating retrieval models.
- NLTK: For text preprocessing.
- TF-IDF and BM25: For traditional information retrieval.
- RM3: For query expansion.
- Elmo and BERT: For deep semantic understanding and contextual relevance.
- Flask and Ngrok: For a user-friendly interface on Colab.
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/repository-name.git cd repository-name
-
Install required packages:
pip install -r requirements.txt
-
Run the Flask app:
python app.py
- Start the Flask app by running
app.py
. - Open the link provided by Ngrok to access the UI.
- Enter a query and select options for query expansion or model use (Elmo, BERT).
- View results, including retrieved documents, scores, retrieval times, and document counts.
Contributions are welcome! Please fork the repository, make changes, and submit a pull request.
This project is licensed under the MIT License.