Skip to content

Commit

Permalink
Adds new bit element_type for dense_vectors (#110059)
Browse files Browse the repository at this point in the history
This commit adds `bit` vector support by adding `element_type: bit` for
vectors. This new element type works for indexed and non-indexed
vectors. Additionally, it works with `hnsw` and `flat` index types. No
quantization based codec works with this element type, this is
consistent with `byte` vectors.

`bit` vectors accept up to `32768` dimensions in size and expect vectors
that are being indexed to be encoded either as a hexidecimal string or a
`byte[]` array where each element of the `byte` array represents `8`
bits of the vector.

`bit` vectors support script usage and regular query usage. When
indexed, all comparisons done are `xor` and `popcount` summations (aka,
hamming distance), and the scores are transformed and normalized given
the vector dimensions. Note, indexed bit vectors require `l2_norm` to be
the similarity.

For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is
`sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported.

Note, the dimensions expected by this element_type are always to be
divisible by `8`, and the `byte[]` vectors provided for index must be
have size `dim/8` size, where each byte element represents `8` bits of
the vectors.

closes: #48322
  • Loading branch information
benwtrent authored Jun 26, 2024
1 parent 97651df commit 5add44d
Show file tree
Hide file tree
Showing 38 changed files with 2,711 additions and 185 deletions.
32 changes: 32 additions & 0 deletions docs/changelog/110059.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
pr: 110059
summary: Adds new `bit` `element_type` for `dense_vectors`
area: Vector Search
type: feature
issues: []
highlight:
title: Adds new `bit` `element_type` for `dense_vectors`
body: |-
This adds `bit` vector support by adding `element_type: bit` for
vectors. This new element type works for indexed and non-indexed
vectors. Additionally, it works with `hnsw` and `flat` index types. No
quantization based codec works with this element type, this is
consistent with `byte` vectors.
`bit` vectors accept up to `32768` dimensions in size and expect vectors
that are being indexed to be encoded either as a hexidecimal string or a
`byte[]` array where each element of the `byte` array represents `8`
bits of the vector.
`bit` vectors support script usage and regular query usage. When
indexed, all comparisons done are `xor` and `popcount` summations (aka,
hamming distance), and the scores are transformed and normalized given
the vector dimensions.
For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is
`sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported.
Note, the dimensions expected by this element_type are always to be
divisible by `8`, and the `byte[]` vectors provided for index must be
have size `dim/8` size, where each byte element represents `8` bits of
the vectors.
notable: true
140 changes: 134 additions & 6 deletions docs/reference/mapping/types/dense-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -183,11 +183,23 @@ The following mapping parameters are accepted:
`element_type`::
(Optional, string)
The data type used to encode vectors. The supported data types are
`float` (default) and `byte`. `float` indexes a 4-byte floating-point
value per dimension. `byte` indexes a 1-byte integer value per dimension.
Using `byte` can result in a substantially smaller index size with the
trade off of lower precision. Vectors using `byte` require dimensions with
integer values between -128 to 127, inclusive for both indexing and searching.
`float` (default), `byte`, and bit.

.Valid values for `element_type`
[%collapsible%open]
====
`float`:::
indexes a 4-byte floating-point
value per dimension. This is the default value.
`byte`:::
indexes a 1-byte integer value per dimension.
`bit`:::
indexes a single bit per dimension. Useful for very high-dimensional vectors or models that specifically support bit vectors.
NOTE: when using `bit`, the number of dimensions must be a multiple of 8 and must represent the number of bits.
====

`dims`::
(Optional, integer)
Expand All @@ -205,7 +217,11 @@ API>>. Defaults to `true`.
The vector similarity metric to use in kNN search. Documents are ranked by
their vector field's similarity to the query vector. The `_score` of each
document will be derived from the similarity, in a way that ensures scores are
positive and that a larger score corresponds to a higher ranking. Defaults to `cosine`.
positive and that a larger score corresponds to a higher ranking.
Defaults to `l2_norm` when `element_type: bit` otherwise defaults to `cosine`.

NOTE: `bit` vectors only support `l2_norm` as their similarity metric.

+
^*^ This parameter can only be specified when `index` is `true`.
+
Expand All @@ -217,6 +233,9 @@ Computes similarity based on the L^2^ distance (also known as Euclidean
distance) between the vectors. The document `_score` is computed as
`1 / (1 + l2_norm(query, vector)^2)`.
For `bit` vectors, instead of using `l2_norm`, the `hamming` distance between the vectors is used. The `_score`
transformation is `(numBits - hamming(a, b)) / numBits`
`dot_product`:::
Computes the dot product of two unit vectors. This option provides an optimized way
to perform cosine similarity. The constraints and computed score are defined
Expand Down Expand Up @@ -320,3 +339,112 @@ any issues, but features in technical preview are not subject to the support SLA
of official GA features.

`dense_vector` fields support <<synthetic-source,synthetic `_source`>> .

[[dense-vector-index-bit]]
==== Indexing & Searching bit vectors

When using `element_type: bit`, this will treat all vectors as bit vectors. Bit vectors utilize only a single
bit per dimension and are internally encoded as bytes. This can be useful for very high-dimensional vectors or models.

When using `bit`, the number of dimensions must be a multiple of 8 and must represent the number of bits. Additionally,
with `bit` vectors, the typical vector similarity values are effectively all scored the same, e.g. with `hamming` distance.

Let's compare two `byte[]` arrays, each representing 40 individual bits.

`[-127, 0, 1, 42, 127]` in bits `1000000100000000000000010010101001111111`
`[127, -127, 0, 1, 42]` in bits `0111111110000001000000000000000100101010`

When comparing these two bit, vectors, we first take the {wikipedia}/Hamming_distance[`hamming` distance].

`xor` result:
```
1000000100000000000000010010101001111111
^
0111111110000001000000000000000100101010
=
1111111010000001000000010010101101010101
```

Then, we gather the count of `1` bits in the `xor` result: `18`. To scale for scoring, we subtract from the total number
of bits and divide by the total number of bits: `(40 - 18) / 40 = 0.55`. This would be the `_score` betwee these two
vectors.

Here is an example of indexing and searching bit vectors:

[source,console]
--------------------------------------------------
PUT my-bit-vectors
{
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector",
"dims": 40, <1>
"element_type": "bit"
}
}
}
}
--------------------------------------------------
<1> The number of dimensions that represents the number of bits

[source,console]
--------------------------------------------------
POST /my-bit-vectors/_bulk?refresh
{"index": {"_id" : "1"}}
{"my_vector": [127, -127, 0, 1, 42]} <1>
{"index": {"_id" : "2"}}
{"my_vector": "8100012a7f"} <2>
--------------------------------------------------
// TEST[continued]
<1> 5 bytes representing the 40 bit dimensioned vector
<2> A hexidecimal string representing the 40 bit dimensioned vector

Then, when searching, you can use the `knn` query to search for similar bit vectors:

[source,console]
--------------------------------------------------
POST /my-bit-vectors/_search?filter_path=hits.hits
{
"query": {
"knn": {
"query_vector": [127, -127, 0, 1, 42],
"field": "my_vector"
}
}
}
--------------------------------------------------
// TEST[continued]

[source,console-result]
----
{
"hits": {
"hits": [
{
"_index": "my-bit-vectors",
"_id": "1",
"_score": 1.0,
"_source": {
"my_vector": [
127,
-127,
0,
1,
42
]
}
},
{
"_index": "my-bit-vectors",
"_id": "2",
"_score": 0.55,
"_source": {
"my_vector": "8100012a7f"
}
}
]
}
}
----

20 changes: 18 additions & 2 deletions docs/reference/vectors/vector-functions.asciidoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
[role="xpack"]
[[vector-functions]]
===== Functions for vector fields

Expand All @@ -17,6 +16,8 @@ This is the list of available vector functions and vector access methods:
6. <<vector-functions-accessing-vectors,`doc[<field>].vectorValue`>> – returns a vector's value as an array of floats
7. <<vector-functions-accessing-vectors,`doc[<field>].magnitude`>> – returns a vector's magnitude

NOTE: The `cosineSimilarity` and `dotProduct` functions are not supported for `bit` vectors.

NOTE: The recommended way to access dense vectors is through the
`cosineSimilarity`, `dotProduct`, `l1norm` or `l2norm` functions. Please note
however, that you should call these functions only once per script. For example,
Expand Down Expand Up @@ -193,7 +194,7 @@ we added `1` in the denominator.
====== Hamming distance

The `hamming` function calculates {wikipedia}/Hamming_distance[Hamming distance] between a given query vector and
document vectors. It is only available for byte vectors.
document vectors. It is only available for byte and bit vectors.

[source,console]
--------------------------------------------------
Expand Down Expand Up @@ -278,10 +279,14 @@ You can access vector values directly through the following functions:

- `doc[<field>].vectorValue` – returns a vector's value as an array of floats

NOTE: For `bit` vectors, it does return a `float[]`, where each element represents 8 bits.

- `doc[<field>].magnitude` – returns a vector's magnitude as a float
(for vectors created prior to version 7.5 the magnitude is not stored.
So this function calculates it anew every time it is called).

NOTE: For `bit` vectors, this is just the square root of the sum of `1` bits.

For example, the script below implements a cosine similarity using these
two functions:

Expand Down Expand Up @@ -319,3 +324,14 @@ GET my-index-000001/_search
}
}
--------------------------------------------------
[[vector-functions-bit-vectors]]
====== Bit vectors and vector functions

When using `bit` vectors, not all the vector functions are available. The supported functions are:

* <<vector-functions-hamming,`hamming`>> – calculates Hamming distance, the sum of the bitwise XOR of the two vectors
* <<vector-functions-l1,`l1norm`>> – calculates L^1^ distance, this is simply the `hamming` distance
* <<vector-functions-l2,`l2norm`>> - calculates L^2^ distance, this is the square root of the `hamming` distance

Currently, the `cosineSimilarity` and `dotProduct` functions are not supported for `bit` vectors.

Loading

0 comments on commit 5add44d

Please sign in to comment.