Releases: DeployQL/LintDB
v0.5.1
This release adds a server implementation. The server can search, add, update, and remove documents. Indices must be created and trained from Python before use with the server.
What's Changed
Full Changelog: v0.5.0...v0.5.1
v0.5.0
v0.5.0 has major breaking changes
breaking changes
- Python's API has drastically changed.
- Collections have been removed.
We introduce a schema to the database. We can index/store/filter by different data types, and we can compose different queries and ways to score.
What problem does this solve?
ColBERT and more heavyweight retrieval mechanisms can be slow because there are more embeddings to compare per document. This makes it necessary to filter documents or iteratively reduce the amount of documents scored.
How did we solve it?
Schemas enable more flexible queries. Filtering becomes an option, and we can choose to score documents based on each matched element.
DocumentProcessor
Our main new abstraction is document processing. This has been broken out from index writing. The DocProcessor branches for each data type supported, and we optionally quantize tensors as part of this.
ColBERT fields are a special case. ColBERT is both indexed and contextual, in that we search the index but don't retrieve data from that field. During scoring, we scan the context field to get all token embeddings at once.
Scoring
Retrievers have been generalized into scoring. It's still a WIP, but we have the concept of retrieval and ranking. Combined with different types of fields, we can think of ColBERT as indexed with contextual data and XTR as indexed only.
Collections
Collections have been removed. Collections enabled an easier way to index data by passing text and automatically embedding it in LintDB. However, this conflates the main idea behind LintDB -- storing and retrieving. We see collections coming back as extensions within the Python library.
Python bindings
Python bindings were using SWIG. SWIG files used a custom syntax to define what C++ was bound to Python. This became difficult to maintain, because some of our data objects made sense to be translated to Python dictionaries. This wasn't simple to accomplish.
We've migrated to nanobind instead of SWIG. Nanobind is declared in C++ directly. There are still some growing pains with this, but it's much clearer how to define, override, or rename our bindings.
Documentation
Documentation is moving to mkdocs instead of sphinx. The main problem here was versioning our documentation. Sphinx did not have a clear enough way to handle this automatically. Mkdocs, however, has mike to version docs.
We haven't figured out all of the bugs with translating our docstrings, but fixing this seems doable.
What's Changed
Full Changelog: v0.4.1...v0.5.0
What's Changed
- Add a schema to the database by @mtbarta in #38
- Bump version: 0.4.1 → 0.5.0 by @mtbarta in #39
- Fix conda releases by @mtbarta in #40
Full Changelog: v0.4.1...v0.5.0
v0.4.0
What's Changed
- Set configuration for index properly. run collection benchmark by @mtbarta in #24
- Add interpret method and batching in collections by @mtbarta in #26
- Add XTR Support by @mtbarta in #27
- Enable passing Python dictionaries in index.add by @mtbarta in #28
- Bump version: 0.3.1 → 0.4.0 by @mtbarta in #29
Full Changelog: v0.3.0...v0.4.0
v0.3.0
This release adds collections, which enable users to insert, search, and retrieve text.
Here's an example from testing:
index_one = lintdb.IndexIVF(dir_path, 32, 128, 2, 4, 16, lintdb.IndexEncoding_BINARIZER)
collection_options = lintdb.CollectionOptions()
collection_options.model_file = "assets/model.onnx"
collection_options.tokenizer_file = "assets/colbert_tokenizer.json"
collection = lintdb.Collection(index_one, collection_options)
collection.train(['hello world!'] * 1500)
collection.add(0, 1, "hello world!", {"key": "metadata"})
opts = lintdb.SearchOptions()
opts.n_probe = 250
results = collection.search(0, "hello world!", 10, opts)
Using Collections
Databases created before v0.3 will not fetch metadata on documents. To upgrade, create an empty database and merge the old database into the one.
What's Changed
Full Changelog: v0.2.1...v0.3.0
Release 0.2.1
What's Changed
Full Changelog: v0.2.0...v0.2.1
Release 0.2.0
This release adds some refactoring
Breaking Changes
When creating an index, there was a boolean deciding if compression should be used. Instead of this, there is now an enum.
BINARIZER
is the standard encoding as defined in PLAID. This is the default.
PRODUCT_QUANTIZER
uses faiss' PQ encoder and methods as defined in EMVB.
NONE
doesn't use any compression.
Here's an example of the new constructor from our indexing scripts in benchmarks.
index_type_enum = ldb.IndexEncoding_BINARIZER
if index_type == "binarizer":
index_type_enum = ldb.IndexEncoding_BINARIZER
elif index_type == 'pq':
index_type_enum = ldb.IndexEncoding_PRODUCT_QUANTIZER
elif index_type == 'none':
index_type_enum = ldb.IndexEncoding_NONE
index = ldb.IndexIVF(index_path, num_centroids, num_dims, nbits, k_iter_training, num_subquantizers, index_type_enum)
What's Changed
- add logo by @mtbarta in #6
- Create a Retriever abstraction by @mtbarta in #14
- Add CI build and testing for PRs by @mtbarta in #15
- Add Retrievers by @mtbarta in #16
- centralize version number. by @mtbarta in #19
Additionally, Linux now uses MKL. This helps avoid slowdowns caused by OpenBLAS and OpenMP. Please reach out if you notice any problems.
Full Changelog: v0.1.0...v0.2.0
Initial Release
LintDB v0.1.0
Major Features
- Multi vector support: LintDB stores multiple vectors per document id and calculates the max similarity across vectors to determine relevance.
- Bit-level Compression: LintDB fully implements PLAID's bit compression, storing 128 dimension embeddings in as low as 32 bytes.
- Embedded: LintDB can be embedded directly into your Python application. No need to setup a separate database.
- Full Support for PLAID and ColBERT: LintDB is built around PLAID and colbertfor efficient storage and lookup of token level embeddings.
Other Features
- Multi-tenancy: support multiple tenants within a single database.
- Index Merge Operator: Build indices in parallel and merge them together.