SemanTweet Search allows you to search over all your tweets from the Twitter archive using semantic similarity. A demo is available here.
It preprocesses your tweets, generates embeddings using OpenAI's small/large embedding model, stores the data and embeddings in LanceDB vector db, and provides a web interface to search and view the results.
You can do semantic search post pre-filtering by time, likes, retweets, media only or link only tweets too.
Pre-filtering by sql operations helps not only filter but also reduce the vector search space thus speeding up the search.
You can additionally use/edit projector.py
and tensorflow projector to get a visualization of your tweets using t-sne algorithm as shown here
UPDATE
-
(25/3/2024)
bge-small-en-v1.5
embedding support added. No API key required. It's 29th on MTEB leaderboard and 130 MB size, 384 dimension, sequence length 512. Note it's not multi-lingual.Technically, a lot of embeddings from sentence-transformers are possible. You can refer LanceDB docs here
-
(24/3/2024) Added CLIP based image searching on tweets_media folder. Check code app_image_search.py after installing requirements i.e till step 5. Does not require to run the setup bash scripts for this, it's a standalone program. First run will take roughly 5-10 minutes as it creates the embeddings. Subsequent runs will be instant.
This is also available as a standalone light weight repository as Embeddit where you can use any images folder.
Technologies Used:
- Twitter archive for data
- OpenAI
text-embedding-3-large
embeddings by default for semantic search - Lancedb for vector search and sql operations
- Hybrid search provided by Lancedb combining BM25 + embedding search
- Flask for server
- Python 3.x
- OpenAI API key
- Twitter archive data
-
Clone the repository:
git clone https://github.com/sankalp1999/semantweet-search.git
cd semantweet-search/
-
Download your Twitter archive (takes 2 days to be available)
Go to:
More (3 dot button) > Settings and Privacy > Your Account > Download an archive of your data
.Extract it. Put the extracted folder at the root of this project and rename it to
twitter-archive
. -
Create a virtual environment:
python3 -m venv venv
Make sure you do this at the root of project.
-
Activate the virtual environment:
- For Unix/Linux:
source venv/bin/activate
- For Windows:
venv\Scripts\activate
- For Unix/Linux:
-
Install the required dependencies:
pip install -r requirements.txt
If you want to try out image vector search, please run below command. I have not included this in requirements.txt as it downloads a 620 MB model and not everyone would like to do that by default.
pip install open_clip_torch
-
(Not required if using sentence transformers) Set up your OpenAI API key as an environment variable:
export OPENAI_API_KEY=your_api_key
-
By default, this repo uses openAI
text-embedding-3-large
You can change it to
text-embedding-3-small
. Change required at two places.
- openai/async_openai_embedding_two.py file around line 11.
- change the MODEL_NAME at openai/create_lance_db_table_openai_three.py around line 8.
-
Run the setup script:
chmod +x run_scripts.sh ./run_scripts.sh
chmod +x run_sentence_tf_scripts.sh ./run_sentence_tf_scripts.sh
Uncomment line 15, 16 in
app.py
# db = lancedb.connect("data/bge_embeddings") # table = db.open_table("bge_table")
-
Start the application:
python app.py
or
flask run
Enjoy!
graph TD
A[Twitter Archive Data] --> B[preprocess_tweets_one.py]
B --> C[Preprocessed Tweets CSV]
C --> D[async_openai_embedding_two.py]
D --> E[Embeddings CSV]
E --> F[create_lance_db_table_openai_three.py]
F --> G[LanceDB Database]
G --> H[Web Interface]
H --> I[Search Tweets]
I --> J[View Results]
The OpenAI embedding flow consists of the following steps:
-
preprocess_tweets_one.py
: This script preprocesses the tweets from the Twitter archive, extracting relevant information and saving it to a CSV file. -
async_openai_embedding_two.py
: This script reads the preprocessed tweets from the CSV file, generates embeddings using OpenAI's embedding model asynchronously, and saves the embeddings to a new CSV file. -
create_lance_db_table_openai_three.py
: This script reads the generated embeddings from the CSV file, creates a LanceDB table using the specified schema, and stores the data in the database.
The run_scripts.sh
script automates the execution of these steps in the correct order.
-
The project uses the
text-embedding-3-large
model by default. You can change the model by modifying theMODEL_NAME
variable -
The batch size for generating embeddings is set to 32 to stay within the token limit. Adjust the batch size if needed.
-
The LanceDB database is stored in the
data/openai_db
directory. -
The project also includes a synchronous version of the OpenAI embedding generation script (
create_openai_embedding_sync_two.py
), which can be used as an alternative to the asynchronous version.
This project is licensed under the MIT License. See the LICENSE file for details.