prrao87 · prrao87 · Sep 24, 2023 · May 5, 2023 · May 5, 2023 · May 5, 2023
diff --git a/.gitignore b/.gitignore
@@ -135,4 +135,5 @@ dmypy.json
 data/*.json
 data/*.jsonl
 dbs/meilisearch/meili_data
-*/*/onnx_model/onnx
+*/*/onnx_model/onnx
+*/*/lancedb/*.lance
diff --git a/dbs/lancedb/.env.example b/dbs/lancedb/.env.example
@@ -0,0 +1,9 @@
+LANCEDB_DIR = "lancedb"
+API_PORT = 8006
+EMBEDDING_MODEL_CHECKPOINT = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
+
+# Container image tag
+TAG = "0.1.0"
+
+# Docker project namespace (defaults to the current folder name if not set)
+COMPOSE_PROJECT_NAME = lancedb_wine
diff --git a/dbs/lancedb/Dockerfile b/dbs/lancedb/Dockerfile
@@ -0,0 +1,13 @@
+FROM python:3.10-slim-bullseye
+
+WORKDIR /wine
+
+COPY ./requirements.txt /wine/requirements.txt
+
+RUN pip install --no-cache-dir -U pip wheel setuptools
+RUN pip install --no-cache-dir -r /wine/requirements.txt
+
+COPY ./api /wine/api
+COPY ./schemas /wine/schemas
+
+EXPOSE 8000
diff --git a/dbs/lancedb/README.md b/dbs/lancedb/README.md
@@ -0,0 +1,184 @@
+# LanceDB
+
+[LanceDB](https://github.com/lancedb/lancedb) is an embedded vector database written in Rust. The primary advantage of LanceDB serverless architecture is to place the database right next to the application, so as to retrieve results that are most semantically similar to the input natural language query. The semantic similarity is obtained by comparing the sentence embeddings (which are n-dimensional vectors) between the input query and the data stored in the database.
+
+Code is provided for ingesting the wine reviews dataset into LanceDB. In addition, a query API written in FastAPI is also provided that allows a user to query available endpoints. As always in FastAPI, documentation is available via OpenAPI (http://localhost:8000/docs).
+
+* Unlike "normal" databases, in a vector DB, the vectorization process is the biggest bottleneck
+* [Pydantic](https://docs.pydantic.dev) is used for schema validation, both prior to data ingestion and during API request handling
+* For ease of reproducibility during development, the whole setup is orchestrated and deployed via docker
+
+## Setup
+
+Note that this code base has been tested in Python 3.10, and requires a minimum of Python 3.10 to work. Install dependencies via `requirements.txt`.
+
+```sh
+# Setup the environment for the first time
+python -m venv .venv  # python -> python 3.10
+
+# Activate the environment (for subsequent runs)
+source .venv/bin/activate
+
+python -m pip install -r requirements.txt
+```
+
+--- 
+
+## Step 1: Set up containers
+
+A `docker-compose.yml` file is provided, which starts a FastAPI container with the information supplied in `.env`. Because LanceDB is serverless, the database doesn't run in a separate process -- it is simply part of the Python code that is imported into the FastAPI backend. The API is then served via `uvicorn`, which is a production-ready ASGI server that is used by FastAPI.
+
+The FastAPI service can be restarted at any time for maintenance and updates by simply running the `docker restart <container_name>` command.
+
+**💡 Note:** The setup shown here would not be ideal in production, as there are other details related to security and scalability that are not addressed via simple docker, but, this is a good starting point to begin experimenting!
+
+### Use `sbert` model
+
+If using the `sbert` model [from the sentence-transformers repo](https://www.sbert.net/) directly, use the provided `docker-compose.yml` to initiate separate containers, one that runs LanceDB, and another one that serves as an API on top of the database.
+
+**⚠️ Note**: This approach will attempt to run `sbert` on a GPU if available, and if not, on CPU (while utilizing all CPU cores).
+
+```
+docker compose -f docker-compose.yml up -d
+```
+Tear down the services using the following command.
+
+```
+docker compose -f docker-compose.yml down
+```
+
+## Step 2: Ingest the data
+
+We ingest both the JSON data for filtering, as well as the sentence embedding vectors (for similarity search) into LanceDB. For this dataset, it's reasonable to expect that a simple concatenation of fields like `title`, `variety` and `description` would result in a useful sentence embedding that can be compared against a search query which is also converted to a vector during query time.
+
+As an example, consider the following data snippet form the `data/` directory in this repo:
+
+```json
+"title": "Castello San Donato in Perano 2009 Riserva  (Chianti Classico)",
+"description": "Made from a blend of 85% Sangiovese and 15% Merlot, this ripe wine delivers soft plum, black currants, clove and cracked pepper sensations accented with coffee and espresso notes. A backbone of firm tannins give structure. Drink now through 2019.",
+"variety": "Red Blend"
+```
+
+The three fields are concatenated for vectorization as follows:
+
+```py
+to_vectorize = data["variety"] + data["title"] + data["description"]
+```
+
+### Choice of embedding model
+
+[SentenceTransformers](https://www.sbert.net/) is a Python framework for a range of sentence and text embeddings. It results from extensive work on fine-tuning BERT to work well on semantic similarity tasks using Siamese BERT networks, where the model is trained to predict the similarity between sentence pairs. The original work is [described here](https://arxiv.org/abs/1908.10084).
+
+#### Why use sentence transformers?
+
+Although larger and more powerful text embedding models exist (such as [OpenAI embeddings](https://platform.openai.com/docs/guides/embeddings)), they can become really expensive as they are not free, and charge per token of text. SentenceTransformers are free and open-source, and have been optimized for years for performance, both to utilize all CPU cores and for reduced size while maintaining performance. A full list of sentence transformer models [is in the project page](https://www.sbert.net/docs/pretrained_models.html).
+
+For this work, it makes sense to use among the fastest models in this list, which is the `multi-qa-MiniLM-L6-cos-v1` **uncased** model. As per the docs, it was tuned for semantic search and question answering, and generates sentence embeddings for single sentences or paragraphs up to a maximum sequence length of 512. It was trained on 215M question answer pairs from various sources. Compared to the more general-purpose `all-MiniLM-L6-v2` model, it shows slightly improved performance on semantic search tasks while offering a similar level of performance. [See the sbert docs](https://www.sbert.net/docs/pretrained_models.html) for more details on performance comparisons between the various pretrained models.
+
+### Run data loader
+
+Data is ingested into the LanceDB database through the scripts in the `scripts` directly. The scripts validate the input JSON data via [Pydantic](https://docs.pydantic.dev), and then index both the JSON data and the vectors to LanceDB using the [LanceDB Python client](https://lancedb.github.io/lancedb/).
+
+Prior to indexing and vectorizing, we simply concatenate the key fields that contain useful information about each wine and vectorize this instead.
+
+If running on a Macbook or other development machine, it's possible to generate sentence embeddings using the original `sbert` model as per the `EMBEDDING_MODEL_CHECKPOINT` variable in the `.env` file.
+
+```sh
+cd scripts
+python bulk_index_sbert.py
+```
+
+Depending on the CPU on your machine, this may take a while. On a 2022 M2 Macbook Pro, vectorizing and bulk-indexing ~130k records took about 25 minutes. When tested on an AWS EC2 T2 medium instance, the same process took just over an hour.
+
+## Step 3: Test API
+
+Once the data has been successfully loaded into LanceDB and the containers are up and running, we can test out a search query via an HTTP request as follows.
+
+```sh
+curl -X 'GET' \
+  'http://0.0.0.0:8000/wine/search?terms=tuscany%20red&max_price=100&country=Italy'
+```
+
+This cURL request passes the search terms "**tuscany red**", along with the country "Italy" and a maximum price of "100" to the `/wine/search/` endpoint, which is then parsed into a working filter query to LanceDB by the FastAPI backend. The query runs and retrieves results that are semantically similar to the input query for red Tuscan wines, and, if the setup was done correctly, we should see the following response:
+
+```json
+[
+    {
+        "id": 8456,
+        "country": "Italy",
+        "province": "Tuscany",
+        "title": "Petra 2008 Petra Red (Toscana)",
+        "description": "From one of Italy's most important showcase designer wineries, this blend of Cabernet Sauvignon and Merlot lives up to its super Tuscan celebrity. It is gently redolent of dark chocolate, ripe fruit, leather, tobacco and crushed black pepper—the bouquet's elegant moderation is one of its strongest points. The mouthfeel is rich, creamy and long. Drink after 2018.",
+        "points": 92,
+        "price": 80.0,
+        "variety": "Red Blend",
+        "winery": "Petra"
+    },
+    {
+        "id": 896,
+        "country": "Italy",
+        "province": "Tuscany",
+        "title": "Le Buche 2006 Giuseppe Olivi Memento Red (Toscana)",
+        "description": "Le Buche is an interesting winery to watch, and its various Tuscan blends show great promise. Memento is equal parts Sangiovese and Syrah with a soft, velvety texture and a bright berry finish.",
+        "points": 90,
+        "price": 45.0,
+        "variety": "Red Blend",
+        "winery": "Le Buche"
+    },
+    {
+        "id": 9343,
+        "country": "Italy",
+        "province": "Tuscany",
+        "title": "Poggio Mandorlo 2008 Red (Toscana)",
+        "description": "Made from Merlot and Cabernet Franc, this structured red offers aromas of black currant, toast, graphite and a whiff of cedar. The firm palate offers coconut, coffee, grilled sage and red berry alongside bracing tannins. Drink sooner rather than later to capture the fruit richness.",
+        "points": 89,
+        "price": 60.0,
+        "variety": "Red Blend",
+        "winery": "Poggio Mandorlo"
+    }
+]
+```
+
+Not bad! This example correctly returns some highly rated Tuscan red wines form Italy along with their price. More specific search queries, such as low/high acidity, or flavour profiles of wines can also be entered to get more relevant results by country.
+
+## Step 4: Extend the API
+
+The API can be easily extended with the provided structure.
+
+- The `schemas` directory houses the Pydantic schemas, both for the data input as well as for the endpoint outputs
+  - As the data model gets more complex, we can add more files and separate the ingestion logic from the API logic here
+- The `api/routers` directory contains the endpoint routes so that we can provide additional endpoint that answer more business questions
+  - For e.g.: "What are the top rated wines from Argentina?"
+  - In general, it makes sense to organize specific business use cases into their own router files
+- The `api/main.py` file collects all the routes and schemas to run the API
+
+
+#### Existing endpoints
+
+As an example, a search endpoint is implemented and can be accessed via the API at the following URL.
+
+```
+GET
+/wine/search
+Search By Similarity
+
+
+GET
+/wine/search_by_country
+Search By Similarity And Country
+
+
+GET
+/wine/search_by_filters
+Search By Similarity And Filters
+
+
+GET
+/wine/count_by_country
+Count By Country
+
+
+GET
+/wine/count_by_filters
+Count By Filters
+```
diff --git a/dbs/lancedb/api/__init__.py b/dbs/lancedb/api/__init__.py
diff --git a/dbs/lancedb/api/config.py b/dbs/lancedb/api/config.py
@@ -0,0 +1,13 @@
+from pydantic_settings import BaseSettings, SettingsConfigDict
+
+
+class Settings(BaseSettings):
+    model_config = SettingsConfigDict(
+        env_file=".env",
+        extra="allow",
+    )
+
+    lancedb_dir: str
+    api_port: str
+    embedding_model_checkpoint: str
+    tag: str
diff --git a/dbs/lancedb/api/main.py b/dbs/lancedb/api/main.py
@@ -0,0 +1,55 @@
+from collections.abc import AsyncGenerator
+from contextlib import asynccontextmanager
+from functools import lru_cache
+
+import lancedb
+from fastapi import FastAPI
+
+from api.config import Settings
+from api.routers.rest import router
+
+from sentence_transformers import SentenceTransformer
+
+model_type = "sbert"
+
+
+@lru_cache()
+def get_settings():
+    # Use lru_cache to avoid loading .env file for every request
+    return Settings()
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
+    """Async context manager for lancedb connection."""
+    settings = get_settings()
+    model_checkpoint = settings.embedding_model_checkpoint
+    app.model = SentenceTransformer(model_checkpoint)
+    app.model_type = "sbert"
+    # Define LanceDB client
+    db = lancedb.connect("./lancedb")
+    app.table = db.open_table("wines")
+    print("Successfully connected to LanceDB")
+    yield
+    print("Successfully closed LanceDB connection and released resources")
+
+
+app = FastAPI(
+    title="REST API for wine reviews on LanceDB",
+    description=(
+        "Query from a LanceDB database of 130k wine reviews from the Wine Enthusiast magazine"
+    ),
+    version=get_settings().tag,
+    lifespan=lifespan,
+)
+
+
+@app.get("/", include_in_schema=False)
+async def root():
+    return {
+        "message": "REST API for querying LanceDB database of 130k wine reviews from the Wine Enthusiast magazine"
+    }
+
+
+# Attach routes
+app.include_router(router, prefix="/wine", tags=["wine"])
diff --git a/dbs/lancedb/api/routers/__init__.py b/dbs/lancedb/api/routers/__init__.py