Skip to content

Latest commit

 

History

History
113 lines (92 loc) · 4.5 KB

README.md

File metadata and controls

113 lines (92 loc) · 4.5 KB

Similar images search

Overview

The project aims to highlight the diverse search capabilities of Weaviate, empowered by the CLIP model. It demonstrates the potential to create robust AI applications capable of multilingual understanding and visual perception with just a few lines of code.

In particular, we will index a collection of random pictures featuring various foods from around the world.
Subsequently, we'll be able to search through them using three different inputs:

  1. User-provided text
  2. Selected image from the indexed collection
  3. Any uploaded image

These scenarios in Weaviate terms correspond to the following operators:

  1. nearText
  2. nearObject
  3. nearImage

CLIP model

CLIP, or Contrastive Language-Image Pre-training, is a multimodal deep learning model by OpenAI that is designed to understand and generate meaningful representations of images and text, allowing it to perform tasks that involve both modalities.

CLIP is trained to learn a joint embedding space where images and text representations are aligned. This means that similar concepts in images and text are close to each other in the embedding space. In this demo we will use a multilingual CLIP model

Technology stack

  • Python
  • Weaviate
  • Streamlit
  • Docker

Used Weaviate modules/models

multi2vec-clip vectorizer
The multi2vec-clip module enables Weaviate to obtain vectors locally from text or images using a Sentence-BERT CLIP model.

To be able to use it you need to enable it in the docker compose file

sentence-transformers/clip-ViT-B-32-multilingual-v1 The particular model that we'll use is sentence-transformers/clip-ViT-B-32-multilingual-v1 model. It supports encoding of text in 50+ languages. The model is based on Multilingual Knowledge Distillation, which uses the original clip-ViT-B-32 model as the teacher and trains a multilingual DistilBERT model as the student. As mentioned above, the model can map text and images to a common vector space such that the distance between the two represents their semantic similarity.

Prerequisites

  1. Python3 interpreter installed
  2. Ability to execute docker compose (The most straightforward way to do it on Windows/Mac is to install Docker Desktop)

Setup instructions

Start up

  1. Clone this repository

  2. Download the dataset (you need to be logged in to Kaggle to be able to do it) from this link and unzip it to the project root

  3. Create a virtual environment and activate it

    Note
    This was tested using python 3.10

    python3 -m venv venv
    source venv/bin/activate
  4. Install all required dependencies

    pip install -r requirements.txt
  5. Run containerized instance of Weaviate. It also includes vectorizer module to compute the embeddings.

    Note
    Make sure you don't have anything occupying port 8080
    If you do, you have the option to either stop that process or change the port that Weaviate is using.

    docker compose up
  6. Index the dataset in Weaviate. By default, 1000 pictures will be ingested

    python add_data.py

    If you want to have a bigger dataset you can use --image-number parameter to set the number of pictures to ingest:

    python add_data.py --image-number 3000
  7. Run the Streamlit demo

    streamlit run app.py

    Now you can open the app on http://localhost:8501/ and also play with changing app.py on the fly

Shut down

  1. Both streamlit app and docker compose can be stopped with Ctrl+C in the corresponding terminal window
  2. To remove created docker containers and volumes use
docker compose down -v

Usage instructions

Dataset license

The dataset used for this example is available on Kaggle: https://www.kaggle.com/datasets/abhijeetbhilare/world-cuisines/