-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update main github readme #508
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,11 +10,9 @@ | |
<a href="https://discord.gg/zMM32dvNtd">Discord</a> • | ||
<a href="https://twitter.com/etodotai">Twitter</a> | ||
|
||
![CI](https://github.com/eto-ai/lance/actions/workflows/cpp.yml/badge.svg) | ||
![CI](https://github.com/eto-ai/lance/actions/workflows/rust.yml/badge.svg) | ||
[![Docs](https://img.shields.io/badge/docs-passing-brightgreen)](https://eto-ai.github.io/lance/) | ||
![Python versions](https://img.shields.io/pypi/pyversions/pylance) | ||
|
||
<img width="600" alt="Lance Basic Query Visualizing Misclassifications" src="https://user-images.githubusercontent.com/917119/199368681-7985c183-5f5e-4327-9561-77f679767bfa.png"> | ||
|
||
</p> | ||
</div> | ||
|
@@ -25,73 +23,133 @@ Lance makes machine learning workflows with ML data easy (images, videos, point | |
|
||
* Use SQL to greatly simplify common operations on ML data, such as similarity search for data discovery, model inference and computing evaluation metrics. | ||
|
||
* Search for nearest neighbors in under 1 millisecond. | ||
|
||
* Version, compare and diff ML datasets easily. | ||
|
||
* (Coming soon) visualize, slice and drill-into datasets to inspect embeddings, labels/annotations and metrics. | ||
|
||
Lance is powered by Lance Format, an Apache-Arrow compatible columnar data format which is an alternative to Parquet, Iceberg and Delta. Lance has 50-100x faster query performance for ML data. | ||
|
||
|
||
## Quick Start | ||
|
||
``` | ||
**Installation** | ||
|
||
```shell | ||
pip install pylance | ||
``` | ||
|
||
Thanks to its Apache Arrow-first APIs, `lance` can be used as a native `Arrow` extension. | ||
For example, it enables users to directly use `DuckDB` to analyze lance dataset | ||
via [DuckDB's Arrow integration](https://duckdb.org/docs/guides/python/sql_on_arrow). | ||
**Converting to Lance** | ||
|
||
```python | ||
# pip install pylance duckdb | ||
import lance | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have the import before installation, require |
||
import duckdb | ||
|
||
# Understand Label distribution of Oxford Pet Dataset | ||
ds = lance.dataset("s3://eto-public/datasets/oxford_pet/oxford_pet.lance") | ||
duckdb.query('select class, count(1) from ds group by 1').to_arrow_table() | ||
import pandas as pd | ||
import pyarrow as pa | ||
import pyarrow.dataset | ||
|
||
df = pd.DataFrame({"a": [5], "b": [10]}) | ||
uri = "/tmp/test.parquet" | ||
tbl = pa.Table.from_pandas(df) | ||
pa.dataset.write_dataset(tbl, uri, format='parquet') | ||
|
||
parquet = pa.dataset.dataset(uri, format='parquet') | ||
lance.write_dataset(parquet, "/tmp/test.lance") | ||
``` | ||
|
||
You can easily import a DataFrame or a Parquet file to Lance using Apache Arrow-first APIs: | ||
**Reading Lance data** | ||
```python | ||
dataset = lance.dataset("/tmp/test.lance") | ||
assert isinstance(dataset, pa.dataset.Dataset) | ||
``` | ||
|
||
**Pandas** | ||
```python | ||
import pyarrow as pa | ||
df = dataset.to_table().to_pandas() | ||
df | ||
``` | ||
|
||
# Import a pandas DataFrame to Lance | ||
tbl = pa.Table.from_pandas(my_dataframe) | ||
lance.write_dataset(tbl, '/tmp/my_dataframe.lance') | ||
**DuckDB** | ||
```python | ||
import duckdb | ||
|
||
# Import a Parquet file to Lance | ||
parquet_dataset = ds.dataset('/tmp/hello.parquet') | ||
lance.write_dataset(parquet_dataset, '/tmp/hello.lance') | ||
tbl = dataset.to_table() # next release of duckdb will have pushdowns enabled | ||
duckdb.query("SELECT * FROM tbl LIMIT 10").to_df() | ||
``` | ||
|
||
For more details, read our [documentation](https://eto-ai.github.io/lance/notebooks/02_creating_lance_datasets.html). | ||
**Vector search** | ||
|
||
## Important directories | ||
Download an indexed [sift dataset](https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this be moved to wget/curl + unzip via shell? or you prefer this for general os - windows included? |
||
and unzip it into `vec_data.lance` | ||
|
||
| Directory | Description | | ||
|--------------------------------------------|----------------------------------------| | ||
| [cpp](./cpp) | Core Lance Format | | ||
| [python](./python) | Python SDK (Pylance) | | ||
| [notebooks](./python/notebooks) | Jupyter Notebooks | | ||
| [duckdb extension](./integration/duckdb) | Lance Duckdb extension | | ||
```shell | ||
wget https://eto-public.s3.us-west-2.amazonaws.com/datasets/sift/sift_ivf256_pq16.tar.gz | ||
tar -xzf sift_ivf256_pq16.tar.gz | ||
``` | ||
|
||
```python | ||
# Get top 10 similar vectors | ||
import lance | ||
import duckdb | ||
import numpy as np | ||
|
||
uri = "vec_data.lance" | ||
dataset = lance.dataset(uri) | ||
|
||
# Sample 100 query vectors | ||
tbl = dataset.to_table() | ||
sample = duckdb.query("SELECT vector FROM tbl USING SAMPLE 100").to_df() | ||
query_vectors = np.array([np.array(x) for x in sample.vector]) | ||
|
||
# Get nearest neighbors for all of them | ||
rs = [dataset.to_table(nearest={"column": "vector", | ||
"k": 10, | ||
"q": query_vectors[i, :]}) | ||
for i in range(query_vectors.shape[0])] | ||
``` | ||
|
||
## Directory structure | ||
|
||
| Directory | Description | | ||
|--------------------|--------------------------| | ||
| [rust](./rust) | Core Rust implementation | | ||
| [python](./python) | Python bindings (pyo3) | | ||
| [docs](./docs) | Documentation source | | ||
|
||
## What makes Lance different | ||
|
||
Here we will highlight a few aspects of Lance’s design. For more details, see the full [Lance design document](https://docs.google.com/document/d/1kknVcqRK65YqGkKASuQ40apr2A2DyK0Qtx5nhCPCdqQ/edit). | ||
Here we will highlight a few aspects of Lance’s design. For more details, see the full [Lance design document](https://eto-ai.github.io/lance/format.html). | ||
|
||
**Vector index**: Vector index for similarity search over embedding space | ||
|
||
**Encodings**: to achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts. | ||
|
||
**Nested fields**: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”. | ||
|
||
**Versioning / updates** (ROADMAP): a Manifest can be used to record snapshots. Updates are supported via write-ahead logs. | ||
**Versioning**: a Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation | ||
|
||
**Secondary Indices** (ROADMAP): | ||
- Vector index for similarity search over embedding space | ||
**Fast updates** (ROADMAP): Updates will be supported via write-ahead logs. | ||
|
||
**Rich secondary indices** (ROADMAP): | ||
- Inverted index for fuzzy search over many label / annotation fields | ||
|
||
## Benchmarks | ||
|
||
### Vector search | ||
|
||
We used the sift dataset to benchmark our results with 1M vectors of 128D | ||
|
||
1. For 100 randomly sampled query vectors, we get <1ms average response time (on a 2023 m2 macbook air) | ||
|
||
![avg_latency.png](docs/avg_latency.png) | ||
|
||
2. ANN is always a trade-off between recall and performance | ||
|
||
![avg_latency.png](docs/recall_vs_latency.png) | ||
|
||
### Vs parquet | ||
|
||
We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/xmls. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files. | ||
|
||
![](docs/lance_perf.png) | ||
|
@@ -138,6 +196,4 @@ A comparison of different data formats in each stage of ML development cycle. | |
|
||
## Presentations and Talks | ||
|
||
* [Lance: A New Columnar Data Format](https://docs.google.com/presentation/d/1a4nAiQAkPDBtOfXFpPg7lbeDAxcNDVKgoUkw3cUs2rE/edit#slide=id.p) | ||
. | ||
[Scipy 2022, Austin, TX](https://www.scipy2022.scipy.org/posters). July, 2022. | ||
* [Lance: A New Columnar Data Format](https://docs.google.com/presentation/d/1a4nAiQAkPDBtOfXFpPg7lbeDAxcNDVKgoUkw3cUs2rE/edit#slide=id.p), [Scipy 2022, Austin, TX](https://www.scipy2022.scipy.org/posters). July, 2022. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example is not clear that
per-query-latency
is under 1 milliseconds.