Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LanceDB data loader with FastAPI endpoints #27

Merged
merged 17 commits into from
Sep 24, 2023
Merged
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -135,4 +135,5 @@ dmypy.json
data/*.json
data/*.jsonl
dbs/meilisearch/meili_data
*/*/onnx_model/onnx
*/*/onnx_model/onnx
*/*/lancedb/*.lance
9 changes: 9 additions & 0 deletions dbs/lancedb/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
LANCEDB_DIR = "lancedb"
API_PORT = 8006
EMBEDDING_MODEL_CHECKPOINT = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"

# Container image tag
TAG = "0.1.0"

# Docker project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME = lancedb_wine
13 changes: 13 additions & 0 deletions dbs/lancedb/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.10-slim-bullseye

WORKDIR /wine

COPY ./requirements.txt /wine/requirements.txt

RUN pip install --no-cache-dir -U pip wheel setuptools
RUN pip install --no-cache-dir -r /wine/requirements.txt

COPY ./api /wine/api
COPY ./schemas /wine/schemas

EXPOSE 8000
184 changes: 184 additions & 0 deletions dbs/lancedb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# LanceDB

[LanceDB](https://github.com/lancedb/lancedb) is an embedded vector database written in Rust. The primary advantage of LanceDB serverless architecture is to place the database right next to the application, so as to retrieve results that are most semantically similar to the input natural language query. The semantic similarity is obtained by comparing the sentence embeddings (which are n-dimensional vectors) between the input query and the data stored in the database.

Code is provided for ingesting the wine reviews dataset into LanceDB. In addition, a query API written in FastAPI is also provided that allows a user to query available endpoints. As always in FastAPI, documentation is available via OpenAPI (http://localhost:8000/docs).

* Unlike "normal" databases, in a vector DB, the vectorization process is the biggest bottleneck
* [Pydantic](https://docs.pydantic.dev) is used for schema validation, both prior to data ingestion and during API request handling
* For ease of reproducibility during development, the whole setup is orchestrated and deployed via docker

## Setup

Note that this code base has been tested in Python 3.10, and requires a minimum of Python 3.10 to work. Install dependencies via `requirements.txt`.

```sh
# Setup the environment for the first time
python -m venv .venv # python -> python 3.10

# Activate the environment (for subsequent runs)
source .venv/bin/activate

python -m pip install -r requirements.txt
```

---

## Step 1: Set up containers

A `docker-compose.yml` file is provided, which starts a FastAPI container with the information supplied in `.env`. Because LanceDB is serverless, the database doesn't run in a separate process -- it is simply part of the Python code that is imported into the FastAPI backend. The API is then served via `uvicorn`, which is a production-ready ASGI server that is used by FastAPI.

The FastAPI service can be restarted at any time for maintenance and updates by simply running the `docker restart <container_name>` command.

**💡 Note:** The setup shown here would not be ideal in production, as there are other details related to security and scalability that are not addressed via simple docker, but, this is a good starting point to begin experimenting!

### Use `sbert` model

If using the `sbert` model [from the sentence-transformers repo](https://www.sbert.net/) directly, use the provided `docker-compose.yml` to initiate separate containers, one that runs LanceDB, and another one that serves as an API on top of the database.

**⚠️ Note**: This approach will attempt to run `sbert` on a GPU if available, and if not, on CPU (while utilizing all CPU cores).

```
docker compose -f docker-compose.yml up -d
```
Tear down the services using the following command.

```
docker compose -f docker-compose.yml down
```

## Step 2: Ingest the data

We ingest both the JSON data for filtering, as well as the sentence embedding vectors (for similarity search) into LanceDB. For this dataset, it's reasonable to expect that a simple concatenation of fields like `title`, `variety` and `description` would result in a useful sentence embedding that can be compared against a search query which is also converted to a vector during query time.

As an example, consider the following data snippet form the `data/` directory in this repo:

```json
"title": "Castello San Donato in Perano 2009 Riserva (Chianti Classico)",
"description": "Made from a blend of 85% Sangiovese and 15% Merlot, this ripe wine delivers soft plum, black currants, clove and cracked pepper sensations accented with coffee and espresso notes. A backbone of firm tannins give structure. Drink now through 2019.",
"variety": "Red Blend"
```

The three fields are concatenated for vectorization as follows:

```py
to_vectorize = data["variety"] + data["title"] + data["description"]
```

### Choice of embedding model

[SentenceTransformers](https://www.sbert.net/) is a Python framework for a range of sentence and text embeddings. It results from extensive work on fine-tuning BERT to work well on semantic similarity tasks using Siamese BERT networks, where the model is trained to predict the similarity between sentence pairs. The original work is [described here](https://arxiv.org/abs/1908.10084).

#### Why use sentence transformers?

Although larger and more powerful text embedding models exist (such as [OpenAI embeddings](https://platform.openai.com/docs/guides/embeddings)), they can become really expensive as they are not free, and charge per token of text. SentenceTransformers are free and open-source, and have been optimized for years for performance, both to utilize all CPU cores and for reduced size while maintaining performance. A full list of sentence transformer models [is in the project page](https://www.sbert.net/docs/pretrained_models.html).

For this work, it makes sense to use among the fastest models in this list, which is the `multi-qa-MiniLM-L6-cos-v1` **uncased** model. As per the docs, it was tuned for semantic search and question answering, and generates sentence embeddings for single sentences or paragraphs up to a maximum sequence length of 512. It was trained on 215M question answer pairs from various sources. Compared to the more general-purpose `all-MiniLM-L6-v2` model, it shows slightly improved performance on semantic search tasks while offering a similar level of performance. [See the sbert docs](https://www.sbert.net/docs/pretrained_models.html) for more details on performance comparisons between the various pretrained models.

### Run data loader

Data is ingested into the LanceDB database through the scripts in the `scripts` directly. The scripts validate the input JSON data via [Pydantic](https://docs.pydantic.dev), and then index both the JSON data and the vectors to LanceDB using the [LanceDB Python client](https://lancedb.github.io/lancedb/).

Prior to indexing and vectorizing, we simply concatenate the key fields that contain useful information about each wine and vectorize this instead.

If running on a Macbook or other development machine, it's possible to generate sentence embeddings using the original `sbert` model as per the `EMBEDDING_MODEL_CHECKPOINT` variable in the `.env` file.

```sh
cd scripts
python bulk_index_sbert.py
```

Depending on the CPU on your machine, this may take a while. On a 2022 M2 Macbook Pro, vectorizing and bulk-indexing ~130k records took about 25 minutes. When tested on an AWS EC2 T2 medium instance, the same process took just over an hour.

## Step 3: Test API

Once the data has been successfully loaded into LanceDB and the containers are up and running, we can test out a search query via an HTTP request as follows.

```sh
curl -X 'GET' \
'http://0.0.0.0:8000/wine/search?terms=tuscany%20red&max_price=100&country=Italy'
```

This cURL request passes the search terms "**tuscany red**", along with the country "Italy" and a maximum price of "100" to the `/wine/search/` endpoint, which is then parsed into a working filter query to LanceDB by the FastAPI backend. The query runs and retrieves results that are semantically similar to the input query for red Tuscan wines, and, if the setup was done correctly, we should see the following response:

```json
[
{
"id": 8456,
"country": "Italy",
"province": "Tuscany",
"title": "Petra 2008 Petra Red (Toscana)",
"description": "From one of Italy's most important showcase designer wineries, this blend of Cabernet Sauvignon and Merlot lives up to its super Tuscan celebrity. It is gently redolent of dark chocolate, ripe fruit, leather, tobacco and crushed black pepper—the bouquet's elegant moderation is one of its strongest points. The mouthfeel is rich, creamy and long. Drink after 2018.",
"points": 92,
"price": 80.0,
"variety": "Red Blend",
"winery": "Petra"
},
{
"id": 896,
"country": "Italy",
"province": "Tuscany",
"title": "Le Buche 2006 Giuseppe Olivi Memento Red (Toscana)",
"description": "Le Buche is an interesting winery to watch, and its various Tuscan blends show great promise. Memento is equal parts Sangiovese and Syrah with a soft, velvety texture and a bright berry finish.",
"points": 90,
"price": 45.0,
"variety": "Red Blend",
"winery": "Le Buche"
},
{
"id": 9343,
"country": "Italy",
"province": "Tuscany",
"title": "Poggio Mandorlo 2008 Red (Toscana)",
"description": "Made from Merlot and Cabernet Franc, this structured red offers aromas of black currant, toast, graphite and a whiff of cedar. The firm palate offers coconut, coffee, grilled sage and red berry alongside bracing tannins. Drink sooner rather than later to capture the fruit richness.",
"points": 89,
"price": 60.0,
"variety": "Red Blend",
"winery": "Poggio Mandorlo"
}
]
```

Not bad! This example correctly returns some highly rated Tuscan red wines form Italy along with their price. More specific search queries, such as low/high acidity, or flavour profiles of wines can also be entered to get more relevant results by country.

## Step 4: Extend the API

The API can be easily extended with the provided structure.

- The `schemas` directory houses the Pydantic schemas, both for the data input as well as for the endpoint outputs
- As the data model gets more complex, we can add more files and separate the ingestion logic from the API logic here
- The `api/routers` directory contains the endpoint routes so that we can provide additional endpoint that answer more business questions
- For e.g.: "What are the top rated wines from Argentina?"
- In general, it makes sense to organize specific business use cases into their own router files
- The `api/main.py` file collects all the routes and schemas to run the API


#### Existing endpoints

As an example, a search endpoint is implemented and can be accessed via the API at the following URL.

```
GET
/wine/search
Search By Similarity


GET
/wine/search_by_country
Search By Similarity And Country


GET
/wine/search_by_filters
Search By Similarity And Filters


GET
/wine/count_by_country
Count By Country


GET
/wine/count_by_filters
Count By Filters
```
Empty file added dbs/lancedb/api/__init__.py
Empty file.
13 changes: 13 additions & 0 deletions dbs/lancedb/api/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from pydantic_settings import BaseSettings, SettingsConfigDict


class Settings(BaseSettings):
model_config = SettingsConfigDict(
env_file=".env",
extra="allow",
)

lancedb_dir: str
api_port: str
embedding_model_checkpoint: str
tag: str
55 changes: 55 additions & 0 deletions dbs/lancedb/api/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
from collections.abc import AsyncGenerator
from contextlib import asynccontextmanager
from functools import lru_cache

import lancedb
from fastapi import FastAPI

from api.config import Settings
from api.routers.rest import router

from sentence_transformers import SentenceTransformer

model_type = "sbert"


@lru_cache()
def get_settings():
# Use lru_cache to avoid loading .env file for every request
return Settings()


@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
"""Async context manager for lancedb connection."""
settings = get_settings()
model_checkpoint = settings.embedding_model_checkpoint
app.model = SentenceTransformer(model_checkpoint)
app.model_type = "sbert"
# Define LanceDB client
db = lancedb.connect("./lancedb")
app.table = db.open_table("wines")
print("Successfully connected to LanceDB")
yield
print("Successfully closed LanceDB connection and released resources")


app = FastAPI(
title="REST API for wine reviews on LanceDB",
description=(
"Query from a LanceDB database of 130k wine reviews from the Wine Enthusiast magazine"
),
version=get_settings().tag,
lifespan=lifespan,
)


@app.get("/", include_in_schema=False)
async def root():
return {
"message": "REST API for querying LanceDB database of 130k wine reviews from the Wine Enthusiast magazine"
}


# Attach routes
app.include_router(router, prefix="/wine", tags=["wine"])
Empty file.
Loading