Skip to content

Releases: DeployQL/LintDB

v0.5.1

15 Aug 23:13
Compare
Choose a tag to compare

This release adds a server implementation. The server can search, add, update, and remove documents. Indices must be created and trained from Python before use with the server.

What's Changed

Full Changelog: v0.5.0...v0.5.1

v0.5.0

08 Aug 05:49
ea7d726
Compare
Choose a tag to compare

v0.5.0 has major breaking changes

breaking changes

  • Python's API has drastically changed.
  • Collections have been removed.

We introduce a schema to the database. We can index/store/filter by different data types, and we can compose different queries and ways to score.

What problem does this solve?

ColBERT and more heavyweight retrieval mechanisms can be slow because there are more embeddings to compare per document. This makes it necessary to filter documents or iteratively reduce the amount of documents scored.

How did we solve it?

Schemas enable more flexible queries. Filtering becomes an option, and we can choose to score documents based on each matched element.

DocumentProcessor

Our main new abstraction is document processing. This has been broken out from index writing. The DocProcessor branches for each data type supported, and we optionally quantize tensors as part of this.

ColBERT fields are a special case. ColBERT is both indexed and contextual, in that we search the index but don't retrieve data from that field. During scoring, we scan the context field to get all token embeddings at once.

Scoring

Retrievers have been generalized into scoring. It's still a WIP, but we have the concept of retrieval and ranking. Combined with different types of fields, we can think of ColBERT as indexed with contextual data and XTR as indexed only.

Collections

Collections have been removed. Collections enabled an easier way to index data by passing text and automatically embedding it in LintDB. However, this conflates the main idea behind LintDB -- storing and retrieving. We see collections coming back as extensions within the Python library.

Python bindings

Python bindings were using SWIG. SWIG files used a custom syntax to define what C++ was bound to Python. This became difficult to maintain, because some of our data objects made sense to be translated to Python dictionaries. This wasn't simple to accomplish.

We've migrated to nanobind instead of SWIG. Nanobind is declared in C++ directly. There are still some growing pains with this, but it's much clearer how to define, override, or rename our bindings.

Documentation

Documentation is moving to mkdocs instead of sphinx. The main problem here was versioning our documentation. Sphinx did not have a clear enough way to handle this automatically. Mkdocs, however, has mike to version docs.

We haven't figured out all of the bugs with translating our docstrings, but fixing this seems doable.

What's Changed

Full Changelog: v0.4.1...v0.5.0

What's Changed

Full Changelog: v0.4.1...v0.5.0

v0.4.0

17 Jun 01:51
1bad637
Compare
Choose a tag to compare

What's Changed

  • Set configuration for index properly. run collection benchmark by @mtbarta in #24
  • Add interpret method and batching in collections by @mtbarta in #26
  • Add XTR Support by @mtbarta in #27
  • Enable passing Python dictionaries in index.add by @mtbarta in #28
  • Bump version: 0.3.1 → 0.4.0 by @mtbarta in #29

Full Changelog: v0.3.0...v0.4.0

v0.3.0

13 May 17:16
Compare
Choose a tag to compare

This release adds collections, which enable users to insert, search, and retrieve text.

Here's an example from testing:

index_one = lintdb.IndexIVF(dir_path, 32, 128, 2, 4, 16, lintdb.IndexEncoding_BINARIZER)

collection_options = lintdb.CollectionOptions()
collection_options.model_file = "assets/model.onnx"
collection_options.tokenizer_file = "assets/colbert_tokenizer.json"
collection = lintdb.Collection(index_one, collection_options)

collection.train(['hello world!'] * 1500)

collection.add(0, 1, "hello world!", {"key": "metadata"})

opts = lintdb.SearchOptions()
opts.n_probe = 250
results = collection.search(0, "hello world!", 10, opts)

Using Collections

Databases created before v0.3 will not fetch metadata on documents. To upgrade, create an empty database and merge the old database into the one.

What's Changed

Full Changelog: v0.2.1...v0.3.0

Release 0.2.1

06 May 14:27
Compare
Choose a tag to compare

What's Changed

  • enable gh actions to build multiple python versions by @mtbarta in #20

Full Changelog: v0.2.0...v0.2.1

Release 0.2.0

05 May 06:47
9eaecc9
Compare
Choose a tag to compare

This release adds some refactoring

Breaking Changes

When creating an index, there was a boolean deciding if compression should be used. Instead of this, there is now an enum.

BINARIZER is the standard encoding as defined in PLAID. This is the default.
PRODUCT_QUANTIZER uses faiss' PQ encoder and methods as defined in EMVB.
NONE doesn't use any compression.

Here's an example of the new constructor from our indexing scripts in benchmarks.

index_type_enum = ldb.IndexEncoding_BINARIZER
if index_type == "binarizer":
    index_type_enum = ldb.IndexEncoding_BINARIZER
elif index_type == 'pq':
    index_type_enum = ldb.IndexEncoding_PRODUCT_QUANTIZER
elif index_type == 'none':
      index_type_enum = ldb.IndexEncoding_NONE

index = ldb.IndexIVF(index_path, num_centroids, num_dims, nbits, k_iter_training, num_subquantizers, index_type_enum)

What's Changed

Additionally, Linux now uses MKL. This helps avoid slowdowns caused by OpenBLAS and OpenMP. Please reach out if you notice any problems.

Full Changelog: v0.1.0...v0.2.0

Initial Release

04 Apr 23:32
Compare
Choose a tag to compare

LintDB v0.1.0

Major Features

  • Multi vector support: LintDB stores multiple vectors per document id and calculates the max similarity across vectors to determine relevance.
  • Bit-level Compression: LintDB fully implements PLAID's bit compression, storing 128 dimension embeddings in as low as 32 bytes.
  • Embedded: LintDB can be embedded directly into your Python application. No need to setup a separate database.
  • Full Support for PLAID and ColBERT: LintDB is built around PLAID and colbertfor efficient storage and lookup of token level embeddings.

Other Features

  • Multi-tenancy: support multiple tenants within a single database.
  • Index Merge Operator: Build indices in parallel and merge them together.