15 Aug 23:13

mtbarta

cc76925

v0.5.1 Latest

Latest

This release adds a server implementation. The server can search, add, update, and remove documents. Indices must be created and trained from Python before use with the server.

What's Changed

add a server implementation. by @mtbarta in #42
Bump version: 0.5.0 → 0.5.1 by @mtbarta in #43

Full Changelog: v0.5.0...v0.5.1

Contributors

mtbarta

Assets 3

08 Aug 05:49

mtbarta

v0.5.0

ea7d726

v0.5.0

v0.5.0 has major breaking changes

breaking changes

Python's API has drastically changed.
Collections have been removed.

We introduce a schema to the database. We can index/store/filter by different data types, and we can compose different queries and ways to score.

What problem does this solve?

ColBERT and more heavyweight retrieval mechanisms can be slow because there are more embeddings to compare per document. This makes it necessary to filter documents or iteratively reduce the amount of documents scored.

How did we solve it?

Schemas enable more flexible queries. Filtering becomes an option, and we can choose to score documents based on each matched element.

DocumentProcessor

Our main new abstraction is document processing. This has been broken out from index writing. The DocProcessor branches for each data type supported, and we optionally quantize tensors as part of this.

ColBERT fields are a special case. ColBERT is both indexed and contextual, in that we search the index but don't retrieve data from that field. During scoring, we scan the context field to get all token embeddings at once.

Scoring

Retrievers have been generalized into scoring. It's still a WIP, but we have the concept of retrieval and ranking. Combined with different types of fields, we can think of ColBERT as indexed with contextual data and XTR as indexed only.

Collections

Collections have been removed. Collections enabled an easier way to index data by passing text and automatically embedding it in LintDB. However, this conflates the main idea behind LintDB -- storing and retrieving. We see collections coming back as extensions within the Python library.

Python bindings

Python bindings were using SWIG. SWIG files used a custom syntax to define what C++ was bound to Python. This became difficult to maintain, because some of our data objects made sense to be translated to Python dictionaries. This wasn't simple to accomplish.

We've migrated to nanobind instead of SWIG. Nanobind is declared in C++ directly. There are still some growing pains with this, but it's much clearer how to define, override, or rename our bindings.

Documentation

Documentation is moving to mkdocs instead of sphinx. The main problem here was versioning our documentation. Sphinx did not have a clear enough way to handle this automatically. Mkdocs, however, has mike to version docs.

We haven't figured out all of the bugs with translating our docstrings, but fixing this seems doable.

What's Changed

Add a schema to the database by @mtbarta in #38
Bump version: 0.4.1 → 0.5.0 by @mtbarta in #39

Full Changelog: v0.4.1...v0.5.0

What's Changed

Add a schema to the database by @mtbarta in #38
Bump version: 0.4.1 → 0.5.0 by @mtbarta in #39
Fix conda releases by @mtbarta in #40

Full Changelog: v0.4.1...v0.5.0

Contributors

mtbarta

Assets 2

17 Jun 01:51

mtbarta

v0.4.0

1bad637

v0.4.0

What's Changed

Set configuration for index properly. run collection benchmark by @mtbarta in #24
Add interpret method and batching in collections by @mtbarta in #26
Add XTR Support by @mtbarta in #27
Enable passing Python dictionaries in index.add by @mtbarta in #28
Bump version: 0.3.1 → 0.4.0 by @mtbarta in #29

Full Changelog: v0.3.0...v0.4.0

Contributors

mtbarta

Assets 2

13 May 17:16

mtbarta

v0.3.0

f17949a

v0.3.0

This release adds collections, which enable users to insert, search, and retrieve text.

Here's an example from testing:

index_one = lintdb.IndexIVF(dir_path, 32, 128, 2, 4, 16, lintdb.IndexEncoding_BINARIZER)

collection_options = lintdb.CollectionOptions()
collection_options.model_file = "assets/model.onnx"
collection_options.tokenizer_file = "assets/colbert_tokenizer.json"
collection = lintdb.Collection(index_one, collection_options)

collection.train(['hello world!'] * 1500)

collection.add(0, 1, "hello world!", {"key": "metadata"})

opts = lintdb.SearchOptions()
opts.n_probe = 250
results = collection.search(0, "hello world!", 10, opts)

Using Collections

Databases created before v0.3 will not fetch metadata on documents. To upgrade, create an empty database and merge the old database into the one.

What's Changed

Add Collections. Create v0.3.0 by @mtbarta in #23

Full Changelog: v0.2.1...v0.3.0

Contributors

mtbarta

Assets 2

06 May 14:27

mtbarta

v0.2.1

b3dc7a4

Release 0.2.1

What's Changed

enable gh actions to build multiple python versions by @mtbarta in #20

Full Changelog: v0.2.0...v0.2.1

Contributors

mtbarta

Assets 2

05 May 06:47

mtbarta

v0.2.0

9eaecc9

Release 0.2.0

This release adds some refactoring

Breaking Changes

When creating an index, there was a boolean deciding if compression should be used. Instead of this, there is now an enum.

BINARIZER is the standard encoding as defined in PLAID. This is the default.
PRODUCT_QUANTIZER uses faiss' PQ encoder and methods as defined in EMVB.
NONE doesn't use any compression.

Here's an example of the new constructor from our indexing scripts in benchmarks.

index_type_enum = ldb.IndexEncoding_BINARIZER
if index_type == "binarizer":
    index_type_enum = ldb.IndexEncoding_BINARIZER
elif index_type == 'pq':
    index_type_enum = ldb.IndexEncoding_PRODUCT_QUANTIZER
elif index_type == 'none':
      index_type_enum = ldb.IndexEncoding_NONE

index = ldb.IndexIVF(index_path, num_centroids, num_dims, nbits, k_iter_training, num_subquantizers, index_type_enum)

What's Changed

add logo by @mtbarta in #6
Create a Retriever abstraction by @mtbarta in #14
Add CI build and testing for PRs by @mtbarta in #15
Add Retrievers by @mtbarta in #16
centralize version number. by @mtbarta in #19

Additionally, Linux now uses MKL. This helps avoid slowdowns caused by OpenBLAS and OpenMP. Please reach out if you notice any problems.

Full Changelog: v0.1.0...v0.2.0

Contributors

mtbarta

Assets 2

04 Apr 23:32

mtbarta

v0.1.0

383b62b

Initial Release

LintDB v0.1.0

Major Features

Multi vector support: LintDB stores multiple vectors per document id and calculates the max similarity across vectors to determine relevance.
Bit-level Compression: LintDB fully implements PLAID's bit compression, storing 128 dimension embeddings in as low as 32 bytes.
Embedded: LintDB can be embedded directly into your Python application. No need to setup a separate database.
Full Support for PLAID and ColBERT: LintDB is built around PLAID and colbertfor efficient storage and lookup of token level embeddings.

Other Features

Multi-tenancy: support multiple tenants within a single database.
Index Merge Operator: Build indices in parallel and merge them together.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What problem does this solve?

How did we solve it?

DocumentProcessor

Scoring

Collections

Python bindings

Documentation

What's Changed

What's Changed

Contributors

What's Changed

Contributors

Using Collections

What's Changed

Contributors

What's Changed

Contributors

Breaking Changes

What's Changed

Contributors

LintDB v0.1.0

Major Features

Other Features

Releases: DeployQL/LintDB

v0.5.1

What's Changed

Contributors

v0.5.0

What problem does this solve?

How did we solve it?

DocumentProcessor

Scoring

Collections

Python bindings

Documentation

What's Changed

What's Changed

Contributors

v0.4.0

What's Changed

Contributors

v0.3.0

Using Collections

What's Changed

Contributors

Release 0.2.1

What's Changed

Contributors

Release 0.2.0

Breaking Changes

What's Changed

Contributors

Initial Release

LintDB v0.1.0

Major Features

Other Features