This repository is a proof of concept for a semantic vector search engine for Digikala products. The search engine is based on Elasticsearch and uses the sentence-transformers library for embeddings.
The project aims to provide a semantic search capability for Digikala products. This is achieved by using Elasticsearch, a powerful open-source search and analytics engine, in combination with sentence-transformers, a Python framework for state-of-the-art sentence, text and image embeddings.
The search engine works by converting product titles into 1024-dimension vectors using intfloat/multilingual-e5-large transformer.
These vectors are then indexed in Elasticsearch. When a search query is made, it is also converted into a semantic vector and the closest matching vectors in the index are fetched using Cosine Similarity and K-nearest neighbors (KNN), and returned as the search results.
- Clone the repository
git clone https://github.com/ArmanJR/Digikala-Vector-Search
cd Digikala-Vector-Search
- Create virtual environment and install dependencies
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Open
indexData.ipynb
and set environment variables:
ELASTIC_ENDPOINT
: Local or remote Elasticsearch endpoint (free trial cluster available on Elastic Cloud)ELASTIC_USERNAME
: Username for the Elasticsearch clusterELASTIC_PASSWORD
: Password for the Elasticsearch clusterELASTIC_INDEX
: Index name for the productsDIGIKALA_DATASET_PATH
: The products dataset path, not available in the git repository. Download from Kaggle: https://www.kaggle.com/datasets/radeai/digikala-comments-and-productsCUSTOM_DATASET_PATH
: Create a custom dataset containing the examples and edge cases to merge with the original dataset. It should match the format of the Digikala dataset (id,title_fa,Rate,Rate_cnt,Category1,Category2,Brand,Price,Seller,Is_Fake,min_price_last_month,sub_category
)SAMPLE_COUNT
: Number of samples to index from the full dataset (Warning: Since the vector size is 1024, setting this to a high number will consume a lot of memory and time. Keep it low for testing purposes)RANDOM_STATE
: Random seed for sampling the dataset
- Run the notebook
indexData.ipynb
step by step to index the data - Open
searchApp.py
and set environment variables - Run the search app
streamlit run searchApp.py
- Use a more powerful and lightweight transformer model (preferably fine-tuned on Persian content) for better embeddings
- Include images in embeddings for multimodal search
- Implement a feedback loop to improve search results over time
MIT