- Project Description
- Features
- Requirements
- Installation
- Configuration
- Running the Application
- Security Considerations
- Scenarios Handled
- Contributions
This project is a movie search engine that allows users to search for movies based on various attributes such as title, actors, genre, and more. It leverages Elasticsearch for efficient search capabilities and Sentence Transformers for embedding movie descriptions. The application consists of three main components:
- Data Cleaning: Cleans and preprocesses the movie dataset.
- Embedding and Storing: Generates embeddings for the cleaned data and stores them in Elasticsearch.
- Search Application: Provides a user interface for searching movies.
- Data Cleaning: Handles missing values, converts data types, and cleans text fields.
- Embedding Generation: Uses Sentence Transformers to generate embeddings for movie descriptions.
- Elasticsearch Integration: Stores movie data and embeddings in Elasticsearch for fast and efficient search.
- Search Interface: A Streamlit-based web application for searching movies.
- Python 3.7 or higher
- Pandas
- Sentence Transformers
- Elasticsearch Python client
- Streamlit
-
Clone the repository:
git clone https://github.com/meggitt/ElasticSearch-Movie-Search-Engine.git cd movie-search-engine
-
Install required packages:
pip install -r requirements.txt
-
Download the movie dataset: Ensure you have the
imdb_top_1000.csv
file in the root directory.
- Configure Elasticsearch:
- Create an
example.ini
file with your Elasticsearch cloud ID and API keys(You can get a 14 day free trial at Elastic Cloud):[DEFAULT] cloud_id = "DeploymentCloudID" apikey_id = "API Key ID" apikey_key = "API Key"
- Create an
- Run the data cleaning script:
This script will clean the dataset and save it as
python clean_data.py
cleaned_dataset.csv
.
- Run the embedding and storing script:
This script will generate embeddings for the cleaned data and store them in Elasticsearch.
python embed_and_store.py
- Run the search application:
Open the provided URL in your browser to access the search interface.
streamlit run search_app.py
- Ensure Elasticsearch is securely configured to prevent unauthorized access.
- Keep your API keys and sensitive information secure.
- Validate and sanitize user inputs to prevent injection attacks.
- Search by Title: Users can search for movies by entering the title.
- Search by Actor: Users can search for movies by entering an actor's name.
- Search by Genre: Users can search for movies by entering a genre.
- Search by Keywords: Users can search using any relevant keywords.
Contributions are welcome! Please create an issue or submit a pull request with your changes.