lancedb · changhiskhan · Feb 2, 2023 · Feb 1, 2023 · Feb 2, 2023 · Feb 2, 2023
diff --git a/README.md b/README.md
@@ -10,11 +10,9 @@
 <a href="https://discord.gg/zMM32dvNtd">Discord</a> •
 <a href="https://twitter.com/etodotai">Twitter</a>
 
-![CI](https://github.com/eto-ai/lance/actions/workflows/cpp.yml/badge.svg)
+![CI](https://github.com/eto-ai/lance/actions/workflows/rust.yml/badge.svg)
 [![Docs](https://img.shields.io/badge/docs-passing-brightgreen)](https://eto-ai.github.io/lance/)
 ![Python versions](https://img.shields.io/pypi/pyversions/pylance)
-
-<img width="600" alt="Lance Basic Query Visualizing Misclassifications" src="https://user-images.githubusercontent.com/917119/199368681-7985c183-5f5e-4327-9561-77f679767bfa.png">
 
 </p>
 </div>
@@ -25,73 +23,133 @@ Lance makes machine learning workflows with ML data easy (images, videos, point
 
 * Use SQL to greatly simplify common operations on ML data, such as similarity search for data discovery, model inference and computing evaluation metrics.
 
+* Search for nearest neighbors in under 1 millisecond.
+
 * Version, compare and diff ML datasets easily.
 
 * (Coming soon) visualize, slice and drill-into datasets to inspect embeddings, labels/annotations and metrics.
 
 Lance is powered by Lance Format, an Apache-Arrow compatible columnar data format which is an alternative to Parquet, Iceberg and Delta. Lance has 50-100x faster query performance for ML data.
 
+
 ## Quick Start
 
-```
+**Installation**
+
+```shell
 pip install pylance
 ```
 
-Thanks to its Apache Arrow-first APIs, `lance` can be used as a native `Arrow` extension.
-For example, it enables users to directly use `DuckDB` to analyze lance dataset
-via [DuckDB's Arrow integration](https://duckdb.org/docs/guides/python/sql_on_arrow).
+**Converting to Lance**
 
 ```python
-# pip install pylance duckdb
 import lance
-import duckdb
 
-# Understand Label distribution of Oxford Pet Dataset
-ds = lance.dataset("s3://eto-public/datasets/oxford_pet/oxford_pet.lance")
-duckdb.query('select class, count(1) from ds group by 1').to_arrow_table()
+import pandas as pd
+import pyarrow as pa
+import pyarrow.dataset
+
+df = pd.DataFrame({"a": [5], "b": [10]})
+uri = "/tmp/test.parquet"
+tbl = pa.Table.from_pandas(df)
+pa.dataset.write_dataset(tbl, uri, format='parquet')
+
+parquet = pa.dataset.dataset(uri, format='parquet')
+lance.write_dataset(parquet, "/tmp/test.lance")
 ```
 
-You can easily import a DataFrame or a Parquet file to Lance using Apache Arrow-first APIs:
+**Reading Lance data**
+```python
+dataset = lance.dataset("/tmp/test.lance")
+assert isinstance(dataset, pa.dataset.Dataset)
+```
 
+**Pandas**
 ```python
-import pyarrow as pa
+df = dataset.to_table().to_pandas()
+df
+```
 
-# Import a pandas DataFrame to Lance
-tbl = pa.Table.from_pandas(my_dataframe)
-lance.write_dataset(tbl, '/tmp/my_dataframe.lance')
+**DuckDB**
+```python
+import duckdb
 
-# Import a Parquet file to Lance
-parquet_dataset = ds.dataset('/tmp/hello.parquet')
-lance.write_dataset(parquet_dataset, '/tmp/hello.lance')
+tbl = dataset.to_table()  # next release of duckdb will have pushdowns enabled
+duckdb.query("SELECT * FROM tbl LIMIT 10").to_df()
 ```
 
-For more details, read our [documentation](https://eto-ai.github.io/lance/notebooks/02_creating_lance_datasets.html).
+**Vector search**
 
-## Important directories
+Download an indexed [sift dataset](https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz),
+and unzip it into `vec_data.lance`
 
-| Directory                                  | Description                            |
-|--------------------------------------------|----------------------------------------|
-| [cpp](./cpp)                               | Core Lance Format |
-| [python](./python)                         | Python SDK (Pylance)                |
-| [notebooks](./python/notebooks)            | Jupyter Notebooks                      |
-| [duckdb extension](./integration/duckdb)   | Lance Duckdb extension                 |
+```shell
+wget https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz
+tar -xzf sift_ivf256_pq16.tar.gz
+```
+
+```python
+# Get top 10 similar vectors
+import lance
+import duckdb
+import numpy as np
+
+uri = "vec_data.lance"
+dataset = lance.dataset(uri)
+
+# Sample 100 query vectors
+tbl = dataset.to_table()
+sample = duckdb.query("SELECT vector FROM tbl USING SAMPLE 100").to_df()
+query_vectors = np.array([np.array(x) for x in sample.vector])
+
+# Get nearest neighbors for all of them
+rs = [dataset.to_table(nearest={"column": "vector", 
+                                "k": 10, 
+                                "q": query_vectors[i, :]}) 
+      for i in range(query_vectors.shape[0])]
+```
+
+## Directory structure
+
+| Directory          | Description              |
+|--------------------|--------------------------|
+| [rust](./rust)     | Core Rust implementation |
+| [python](./python) | Python bindings (pyo3)   |
+| [docs](./docs)     | Documentation source     |
 
 ## What makes Lance different
 
-Here we will highlight a few aspects of Lance’s design. For more details, see the full [Lance design document](https://docs.google.com/document/d/1kknVcqRK65YqGkKASuQ40apr2A2DyK0Qtx5nhCPCdqQ/edit).
+Here we will highlight a few aspects of Lance’s design. For more details, see the full [Lance design document](https://eto-ai.github.io/lance/format.html).
+
+**Vector index**: Vector index for similarity search over embedding space
 
 **Encodings**: to achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.
 
 **Nested fields**: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.
 
-**Versioning / updates** (ROADMAP): a Manifest can be used to record snapshots. Updates are supported via write-ahead logs.
+**Versioning**: a Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation 
 
-**Secondary Indices** (ROADMAP):
-  - Vector index for similarity search over embedding space
+**Fast updates** (ROADMAP): Updates will be supported via write-ahead logs.
+
+**Rich secondary indices** (ROADMAP): 
   - Inverted index for fuzzy search over many label / annotation fields
 
 ## Benchmarks
 
+### Vector search
+
+We used the sift dataset to benchmark our results with 1M vectors of 128D
+
+1. For 100 randomly sampled query vectors, we get <1ms average response time (on a 2023 m2 macbook air)
+
+![avg_latency.png](docs/avg_latency.png)
+
+2. ANN is always a trade-off between recall and performance
+
+![avg_latency.png](docs/recall_vs_latency.png)
+
+### Vs parquet
+
 We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/xmls. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.
 
 ![](docs/lance_perf.png)
@@ -138,6 +196,4 @@ A comparison of different data formats in each stage of ML development cycle.
 
 ## Presentations and Talks
 
-* [Lance: A New Columnar Data Format](https://docs.google.com/presentation/d/1a4nAiQAkPDBtOfXFpPg7lbeDAxcNDVKgoUkw3cUs2rE/edit#slide=id.p)
-  .
-  [Scipy 2022, Austin, TX](https://www.scipy2022.scipy.org/posters). July, 2022.
+* [Lance: A New Columnar Data Format](https://docs.google.com/presentation/d/1a4nAiQAkPDBtOfXFpPg7lbeDAxcNDVKgoUkw3cUs2rE/edit#slide=id.p), [Scipy 2022, Austin, TX](https://www.scipy2022.scipy.org/posters). July, 2022.
diff --git a/docs/avg_latency.png b/docs/avg_latency.png
diff --git a/docs/recall_vs_latency.png b/docs/recall_vs_latency.png
diff --git a/python/README.md b/python/README.md
@@ -29,10 +29,15 @@ pip install pylance
 ```python
 import lance
 
+import pandas as pd
 import pyarrow as pa
 import pyarrow.dataset
 
-uri = "/path/to/parquet"
+df = pd.DataFrame({"a": [5], "b": [10]})
+uri = "/tmp/test.parquet"
+tbl = pa.Table.from_pandas(df)
+pa.dataset.write_dataset(tbl, uri, format='parquet')
+
 parquet = pa.dataset.dataset(uri, format='parquet')
 lance.write_dataset(parquet, "/tmp/test.lance")
 ```
@@ -58,19 +63,31 @@ duckdb.query("SELECT * FROM tbl LIMIT 10").to_df()
 
 **Vector search**
 
+Download an indexed [sift dataset](https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz),
+and unzip it into `vec_data.lance`
+
 ```python
 # Get top 10 similar vectors
+import lance
+import duckdb
 import numpy as np
-q = np.random.randn(128)  # query vector
-query = {
-    "column": "vector",  # assume `emb` column is FixedSizeList of Float32
-    "q": q,
-    "k": 10
-}
-dataset.to_table(nearest=query).to_pandas()
+
+uri = "vec_data.lance"
+dataset = lance.dataset(uri)
+
+# Sample 100 query vectors
+tbl = dataset.to_table()
+sample = duckdb.query("SELECT vector FROM tbl USING SAMPLE 100").to_df()
+query_vectors = np.array([np.array(x) for x in sample.vector])
+
+# Get nearest neighbors for all of them
+rs = [dataset.to_table(nearest={"column": "vector", 
+                                "k": 10, 
+                                "q": query_vectors[i, :]}) 
+      for i in range(query_vectors.shape[0])]
 ```
 
-For the fast indexing capability, you can download an indexed [sift dataset](https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz),
+For the fast indexing capability, you can 
 and run the same code as above. We're working on a more convenient indexing tool via python.
 
 *More distance metrics, supported types, and compute integration coming