Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update main github readme #508

Merged
merged 3 commits into from
Feb 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 91 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,9 @@
<a href="https://discord.gg/zMM32dvNtd">Discord</a> •
<a href="https://twitter.com/etodotai">Twitter</a>

![CI](https://github.com/eto-ai/lance/actions/workflows/cpp.yml/badge.svg)
![CI](https://github.com/eto-ai/lance/actions/workflows/rust.yml/badge.svg)
[![Docs](https://img.shields.io/badge/docs-passing-brightgreen)](https://eto-ai.github.io/lance/)
![Python versions](https://img.shields.io/pypi/pyversions/pylance)

<img width="600" alt="Lance Basic Query Visualizing Misclassifications" src="https://user-images.githubusercontent.com/917119/199368681-7985c183-5f5e-4327-9561-77f679767bfa.png">

</p>
</div>
Expand All @@ -25,73 +23,133 @@ Lance makes machine learning workflows with ML data easy (images, videos, point

* Use SQL to greatly simplify common operations on ML data, such as similarity search for data discovery, model inference and computing evaluation metrics.

* Search for nearest neighbors in under 1 millisecond.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is not clear that per-query-latency is under 1 milliseconds.


* Version, compare and diff ML datasets easily.

* (Coming soon) visualize, slice and drill-into datasets to inspect embeddings, labels/annotations and metrics.

Lance is powered by Lance Format, an Apache-Arrow compatible columnar data format which is an alternative to Parquet, Iceberg and Delta. Lance has 50-100x faster query performance for ML data.


## Quick Start

```
**Installation**

```shell
pip install pylance
```

Thanks to its Apache Arrow-first APIs, `lance` can be used as a native `Arrow` extension.
For example, it enables users to directly use `DuckDB` to analyze lance dataset
via [DuckDB's Arrow integration](https://duckdb.org/docs/guides/python/sql_on_arrow).
**Converting to Lance**

```python
# pip install pylance duckdb
import lance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the import before installation, require pip install pylance above this

import duckdb

# Understand Label distribution of Oxford Pet Dataset
ds = lance.dataset("s3://eto-public/datasets/oxford_pet/oxford_pet.lance")
duckdb.query('select class, count(1) from ds group by 1').to_arrow_table()
import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")
```

You can easily import a DataFrame or a Parquet file to Lance using Apache Arrow-first APIs:
**Reading Lance data**
```python
dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)
```

**Pandas**
```python
import pyarrow as pa
df = dataset.to_table().to_pandas()
df
```

# Import a pandas DataFrame to Lance
tbl = pa.Table.from_pandas(my_dataframe)
lance.write_dataset(tbl, '/tmp/my_dataframe.lance')
**DuckDB**
```python
import duckdb

# Import a Parquet file to Lance
parquet_dataset = ds.dataset('/tmp/hello.parquet')
lance.write_dataset(parquet_dataset, '/tmp/hello.lance')
tbl = dataset.to_table() # next release of duckdb will have pushdowns enabled
duckdb.query("SELECT * FROM tbl LIMIT 10").to_df()
```

For more details, read our [documentation](https://eto-ai.github.io/lance/notebooks/02_creating_lance_datasets.html).
**Vector search**

## Important directories
Download an indexed [sift dataset](https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be moved to wget/curl + unzip via shell? or you prefer this for general os - windows included?

and unzip it into `vec_data.lance`

| Directory | Description |
|--------------------------------------------|----------------------------------------|
| [cpp](./cpp) | Core Lance Format |
| [python](./python) | Python SDK (Pylance) |
| [notebooks](./python/notebooks) | Jupyter Notebooks |
| [duckdb extension](./integration/duckdb) | Lance Duckdb extension |
```shell
wget https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz
tar -xzf sift_ivf256_pq16.tar.gz
```

```python
# Get top 10 similar vectors
import lance
import duckdb
import numpy as np

uri = "vec_data.lance"
dataset = lance.dataset(uri)

# Sample 100 query vectors
tbl = dataset.to_table()
sample = duckdb.query("SELECT vector FROM tbl USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector",
"k": 10,
"q": query_vectors[i, :]})
for i in range(query_vectors.shape[0])]
```

## Directory structure

| Directory | Description |
|--------------------|--------------------------|
| [rust](./rust) | Core Rust implementation |
| [python](./python) | Python bindings (pyo3) |
| [docs](./docs) | Documentation source |

## What makes Lance different

Here we will highlight a few aspects of Lance’s design. For more details, see the full [Lance design document](https://docs.google.com/document/d/1kknVcqRK65YqGkKASuQ40apr2A2DyK0Qtx5nhCPCdqQ/edit).
Here we will highlight a few aspects of Lance’s design. For more details, see the full [Lance design document](https://eto-ai.github.io/lance/format.html).

**Vector index**: Vector index for similarity search over embedding space

**Encodings**: to achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.

**Nested fields**: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.

**Versioning / updates** (ROADMAP): a Manifest can be used to record snapshots. Updates are supported via write-ahead logs.
**Versioning**: a Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation

**Secondary Indices** (ROADMAP):
- Vector index for similarity search over embedding space
**Fast updates** (ROADMAP): Updates will be supported via write-ahead logs.

**Rich secondary indices** (ROADMAP):
- Inverted index for fuzzy search over many label / annotation fields

## Benchmarks

### Vector search

We used the sift dataset to benchmark our results with 1M vectors of 128D

1. For 100 randomly sampled query vectors, we get <1ms average response time (on a 2023 m2 macbook air)

![avg_latency.png](docs/avg_latency.png)

2. ANN is always a trade-off between recall and performance

![avg_latency.png](docs/recall_vs_latency.png)

### Vs parquet

We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/xmls. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.

![](docs/lance_perf.png)
Expand Down Expand Up @@ -138,6 +196,4 @@ A comparison of different data formats in each stage of ML development cycle.

## Presentations and Talks

* [Lance: A New Columnar Data Format](https://docs.google.com/presentation/d/1a4nAiQAkPDBtOfXFpPg7lbeDAxcNDVKgoUkw3cUs2rE/edit#slide=id.p)
.
[Scipy 2022, Austin, TX](https://www.scipy2022.scipy.org/posters). July, 2022.
* [Lance: A New Columnar Data Format](https://docs.google.com/presentation/d/1a4nAiQAkPDBtOfXFpPg7lbeDAxcNDVKgoUkw3cUs2rE/edit#slide=id.p), [Scipy 2022, Austin, TX](https://www.scipy2022.scipy.org/posters). July, 2022.
Binary file added docs/avg_latency.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/recall_vs_latency.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 26 additions & 9 deletions python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,15 @@ pip install pylance
```python
import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

uri = "/path/to/parquet"
df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")
```
Expand All @@ -58,19 +63,31 @@ duckdb.query("SELECT * FROM tbl LIMIT 10").to_df()

**Vector search**

Download an indexed [sift dataset](https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz),
and unzip it into `vec_data.lance`

```python
# Get top 10 similar vectors
import lance
import duckdb
import numpy as np
q = np.random.randn(128) # query vector
query = {
"column": "vector", # assume `emb` column is FixedSizeList of Float32
"q": q,
"k": 10
}
dataset.to_table(nearest=query).to_pandas()

uri = "vec_data.lance"
dataset = lance.dataset(uri)

# Sample 100 query vectors
tbl = dataset.to_table()
sample = duckdb.query("SELECT vector FROM tbl USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector",
"k": 10,
"q": query_vectors[i, :]})
for i in range(query_vectors.shape[0])]
```

For the fast indexing capability, you can download an indexed [sift dataset](https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz),
For the fast indexing capability, you can
and run the same code as above. We're working on a more convenient indexing tool via python.

*More distance metrics, supported types, and compute integration coming
Expand Down