Adds new bit element_type for dense_vectors (#110059)

This commit adds `bit` vector support by adding `element_type: bit` for vectors. This new element type works for indexed and non-indexed vectors. Additionally, it works with `hnsw` and `flat` index types. No quantization based codec works with this element type, this is consistent with `byte` vectors. `bit` vectors accept up to `32768` dimensions in size and expect vectors that are being indexed to be encoded either as a hexidecimal string or a `byte[]` array where each element of the `byte` array represents `8` bits of the vector. `bit` vectors support script usage and regular query usage. When indexed, all comparisons done are `xor` and `popcount` summations (aka, hamming distance), and the scores are transformed and normalized given the vector dimensions. Note, indexed bit vectors require `l2_norm` to be the similarity. For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is `sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported. Note, the dimensions expected by this element_type are always to be divisible by `8`, and the `byte[]` vectors provided for index must be have size `dim/8` size, where each byte element represents `8` bits of the vectors. closes: #48322
elastic · Jun 26, 2024 · 5add44d · 5add44d
1 parent 97651df
commit 5add44d
Show file tree

Hide file tree

Showing 38 changed files with 2,711 additions and 185 deletions.
diff --git a/docs/changelog/110059.yaml b/docs/changelog/110059.yaml
@@ -0,0 +1,32 @@
+pr: 110059
+summary: Adds new `bit` `element_type` for `dense_vectors`
+area: Vector Search
+type: feature
+issues: []
+highlight:
+  title: Adds new `bit` `element_type` for `dense_vectors`
+  body: |-
+    This adds `bit` vector support by adding `element_type: bit` for
+    vectors. This new element type works for indexed and non-indexed
+    vectors. Additionally, it works with `hnsw` and `flat` index types. No
+    quantization based codec works with this element type, this is
+    consistent with `byte` vectors.
+
+    `bit` vectors accept up to `32768` dimensions in size and expect vectors
+    that are being indexed to be encoded either as a hexidecimal string or a
+    `byte[]` array where each element of the `byte` array represents `8`
+    bits of the vector.
+
+    `bit` vectors support script usage and regular query usage. When
+    indexed, all comparisons done are `xor` and `popcount` summations (aka,
+    hamming distance), and the scores are transformed and normalized given
+    the vector dimensions.
+
+    For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is
+    `sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported. 
+
+    Note, the dimensions expected by this element_type are always to be
+    divisible by `8`, and the `byte[]` vectors provided for index must be
+    have size `dim/8` size, where each byte element represents `8` bits of
+    the vectors.
+  notable: true
diff --git a/docs/reference/mapping/types/dense-vector.asciidoc b/docs/reference/mapping/types/dense-vector.asciidoc
@@ -183,11 +183,23 @@ The following mapping parameters are accepted:
 `element_type`::
 (Optional, string)
 The data type used to encode vectors. The supported data types are
-`float` (default) and `byte`. `float` indexes a 4-byte floating-point
-value per dimension. `byte` indexes a 1-byte integer value per dimension.
-Using `byte` can result in a substantially smaller index size with the
-trade off of lower precision. Vectors using `byte` require dimensions with
-integer values between -128 to 127, inclusive for both indexing and searching.
+`float` (default), `byte`, and bit.
+
+.Valid values for `element_type`
+[%collapsible%open]
+====
+`float`:::
+indexes a 4-byte floating-point
+value per dimension. This is the default value.
+
+`byte`:::
+indexes a 1-byte integer value per dimension.
+
+`bit`:::
+indexes a single bit per dimension. Useful for very high-dimensional vectors or models that specifically support bit vectors.
+NOTE: when using `bit`, the number of dimensions must be a multiple of 8 and must represent the number of bits.
+
+====
 
 `dims`::
 (Optional, integer)
@@ -205,7 +217,11 @@ API>>. Defaults to `true`.
 The vector similarity metric to use in kNN search. Documents are ranked by
 their vector field's similarity to the query vector. The `_score` of each
 document will be derived from the similarity, in a way that ensures scores are
-positive and that a larger score corresponds to a higher ranking. Defaults to `cosine`.
+positive and that a larger score corresponds to a higher ranking.
+Defaults to `l2_norm` when `element_type: bit` otherwise defaults to `cosine`.
+
+NOTE: `bit` vectors only support `l2_norm` as their similarity metric.
+
 +
 ^*^ This parameter can only be specified when `index` is `true`.
 +
@@ -217,6 +233,9 @@ Computes similarity based on the L^2^ distance (also known as Euclidean
 distance) between the vectors. The document `_score` is computed as
 `1 / (1 + l2_norm(query, vector)^2)`.
 
+For `bit` vectors, instead of using `l2_norm`, the `hamming` distance between the vectors is used. The `_score`
+transformation is `(numBits - hamming(a, b)) / numBits`
+
 `dot_product`:::
 Computes the dot product of two unit vectors. This option provides an optimized way
 to perform cosine similarity. The constraints and computed score are defined
@@ -320,3 +339,112 @@ any issues, but features in technical preview are not subject to the support SLA
 of official GA features.
 
 `dense_vector` fields support <<synthetic-source,synthetic `_source`>> .
+
+[[dense-vector-index-bit]]
+==== Indexing & Searching bit vectors
+
+When using `element_type: bit`, this will treat all vectors as bit vectors. Bit vectors utilize only a single
+bit per dimension and are internally encoded as bytes. This can be useful for very high-dimensional vectors or models.
+
+When using `bit`, the number of dimensions must be a multiple of 8 and must represent the number of bits. Additionally,
+with `bit` vectors, the typical vector similarity values are effectively all scored the same, e.g. with `hamming` distance.
+
+Let's compare two `byte[]` arrays, each representing 40 individual bits.
+
+`[-127, 0, 1, 42, 127]` in bits `1000000100000000000000010010101001111111`
+`[127, -127, 0, 1, 42]` in bits `0111111110000001000000000000000100101010`
+
+When comparing these two bit, vectors, we first take the {wikipedia}/Hamming_distance[`hamming` distance].
+
+`xor` result:
+```
+1000000100000000000000010010101001111111
+^
+0111111110000001000000000000000100101010
+=
+1111111010000001000000010010101101010101
+```
+
+Then, we gather the count of `1` bits in the `xor` result: `18`. To scale for scoring, we subtract from the total number
+of bits and divide by the total number of bits: `(40 - 18) / 40 = 0.55`. This would be the `_score` betwee these two
+vectors.
+
+Here is an example of indexing and searching bit vectors:
+
+[source,console]
+--------------------------------------------------
+PUT my-bit-vectors
+{
+  "mappings": {
+    "properties": {
+      "my_vector": {
+        "type": "dense_vector",
+        "dims": 40, <1>
+        "element_type": "bit"
+      }
+    }
+  }
+}
+--------------------------------------------------
+<1> The number of dimensions that represents the number of bits
+
+[source,console]
+--------------------------------------------------
+POST /my-bit-vectors/_bulk?refresh
+{"index": {"_id" : "1"}}
+{"my_vector": [127, -127, 0, 1, 42]} <1>
+{"index": {"_id" : "2"}}
+{"my_vector": "8100012a7f"} <2>
+--------------------------------------------------
+// TEST[continued]
+<1> 5 bytes representing the 40 bit dimensioned vector
+<2> A hexidecimal string representing the 40 bit dimensioned vector
+
+Then, when searching, you can use the `knn` query to search for similar bit vectors:
+
+[source,console]
+--------------------------------------------------
+POST /my-bit-vectors/_search?filter_path=hits.hits
+{
+  "query": {
+    "knn": {
+      "query_vector": [127, -127, 0, 1, 42],
+      "field": "my_vector"
+    }
+  }
+}
+--------------------------------------------------
+// TEST[continued]
+
+[source,console-result]
+----
+{
+    "hits": {
+        "hits": [
+            {
+                "_index": "my-bit-vectors",
+                "_id": "1",
+                "_score": 1.0,
+                "_source": {
+                    "my_vector": [
+                        127,
+                        -127,
+                        0,
+                        1,
+                        42
+                    ]
+                }
+            },
+            {
+                "_index": "my-bit-vectors",
+                "_id": "2",
+                "_score": 0.55,
+                "_source": {
+                    "my_vector": "8100012a7f"
+                }
+            }
+        ]
+    }
+}
+----
+
diff --git a/docs/reference/vectors/vector-functions.asciidoc b/docs/reference/vectors/vector-functions.asciidoc
@@ -1,4 +1,3 @@
-[role="xpack"]
 [[vector-functions]]
 ===== Functions for vector fields
 
@@ -17,6 +16,8 @@ This is the list of available vector functions and vector access methods:
 6. <<vector-functions-accessing-vectors,`doc[<field>].vectorValue`>> – returns a vector's value as an array of floats
 7. <<vector-functions-accessing-vectors,`doc[<field>].magnitude`>> – returns a vector's magnitude
 
+NOTE: The `cosineSimilarity` and `dotProduct` functions are not supported for `bit` vectors.
+
 NOTE: The recommended way to access dense vectors is through the
 `cosineSimilarity`, `dotProduct`, `l1norm` or `l2norm` functions. Please note
 however, that you should call these functions only once per script. For example,
@@ -193,7 +194,7 @@ we added `1` in the denominator.
 ====== Hamming distance
 
 The `hamming` function calculates {wikipedia}/Hamming_distance[Hamming distance] between a given query vector and
-document vectors. It is only available for byte vectors.
+document vectors. It is only available for byte and bit vectors.
 
 [source,console]
 --------------------------------------------------
@@ -278,10 +279,14 @@ You can access vector values directly through the following functions:
 
 - `doc[<field>].vectorValue` – returns a vector's value as an array of floats
 
+NOTE: For `bit` vectors, it does return a `float[]`, where each element represents 8 bits.
+
 - `doc[<field>].magnitude` – returns a vector's magnitude as a float
 (for vectors created prior to version 7.5 the magnitude is not stored.
 So this function calculates it anew every time it is called).
 
+NOTE: For `bit` vectors, this is just the square root of the sum of `1` bits.
+
 For example, the script below implements a cosine similarity using these
 two functions:
 
@@ -319,3 +324,14 @@ GET my-index-000001/_search
   }
 }
 --------------------------------------------------
+[[vector-functions-bit-vectors]]
+====== Bit vectors and vector functions
+
+When using `bit` vectors, not all the vector functions are available. The supported functions are:
+
+* <<vector-functions-hamming,`hamming`>> – calculates Hamming distance, the sum of the bitwise XOR of the two vectors
+* <<vector-functions-l1,`l1norm`>> – calculates L^1^ distance, this is simply the `hamming` distance
+* <<vector-functions-l2,`l2norm`>> - calculates L^2^ distance, this is the square root of the `hamming` distance
+
+Currently, the `cosineSimilarity` and `dotProduct` functions are not supported for `bit` vectors.
+