-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding binary Hamming distance as similarity option for byte vectors #13076
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fail much earlier when using a KnnFloatVectorField
with an inappropriate similarity function. Seems to me FieldType#setVectorAttributes
should validate the encoding & similarity function compatibility.
if (a.length != b.length) { | ||
throw new IllegalArgumentException("vector dimensions differ: " + a.length + "!=" + b.length); | ||
} | ||
return 1f / (1 + IMPL.binaryHammingDistance(a, b)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return IMPL.binaryHammingDistance(a, b)
.
The user's of this function will have to transform the distance to a score separately.
|
||
@Override | ||
public float compare(byte[] v1, byte[] v2) { | ||
return binaryHammingDistance(v1, v2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The score transformation for the distance should be here. Note squareDistance
import org.apache.lucene.util.hnsw.HnswGraphBuilder; | ||
import org.junit.After; | ||
import org.junit.Before; | ||
|
||
/** Tests indexing of a knn-graph */ | ||
public class TestKnnGraph extends LuceneTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly large refactor adding test coverage, but I think it detracts from this PR. Could you change all this back and restrict the similarity function when using floats?
There are so many abstract functions, etc. it seems like this entire test case should be rewritten from the ground up if we were to 100% test byte vs. float for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was in-between excluding hamming distance from float fields or adjusting tests, to also slightly increase coverage. But yeah, you're right that this is a bit outside of the scope of this PR. Will revert changes and simply exclude the similarity when we know it'll throw.
public Set<VectorEncoding> supportedVectorEncodings() { | ||
return EnumSet.of(VectorEncoding.BYTE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public Set<VectorEncoding> supportedVectorEncodings() { | |
return EnumSet.of(VectorEncoding.BYTE); | |
} | |
public bool supportsVectorEncoding(VectorEncoding encoding) { | |
return encoding == VectorEncoding.BYTE; | |
} |
I am not sure why this returns a set. Do we really do anything with the set except check membership?
|
||
/** | ||
* Defines which encodings are supported by the similarity function - used in tests to control | ||
* randomization | ||
* | ||
* @return a list of all supported VectorEncodings for the given similarity | ||
*/ | ||
public Set<VectorEncoding> supportedVectorEncodings() { | ||
return EnumSet.of(VectorEncoding.BYTE, VectorEncoding.FLOAT32); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we do anything with the set except check membership? I think a simple bool supportsVectorEncoding(VectorEncoding)
function would be preferred.
case BINARY_HAMMING_DISTANCE -> throw new IllegalArgumentException( | ||
"Cannot use Hamming distance with scalar quantization"); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if the error message had the literal BINARY_HAMMING_DISTANCE
string to indicate the similarity enumeration we failed at.
@pmpailis could you also push a |
// Need to break up the total ByteVector as the result might not | ||
// fit in a byte | ||
var acc1 = total.castShape(ShortVector.SPECIES_512, 0); | ||
var acc2 = total.castShape(ShortVector.SPECIES_512, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector castShape() with part number > 0 really needs to be avoided. It is incredibly slow. Have you benchmarked non-mac machines with 256 or 512-bit vectors?
I'm confused about the use of lookup table. naively, i'd try to just xor + popcnt: https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#XOR I'm curious if any explicit vector code is needed actually at all. Integer.bitCount() has autovectorization support in hotspot. |
even if it doesn't autovectorize, i suspect just gathering e.g. 4/8 bytes at a time with BitUtil varhandle and using single int/long xor + popcount would perform very well as a baseline. |
Hi,
|
The native order PR was merged. |
Hi, @Override
public int binaryHammingDistance(byte[] a, byte[] b) {
int distance = 0, i = 0;
for (final int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound; i += Long.BYTES) {
distance += Long.bitCount(((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i)) & 0xFFFFFFFFFFFFFFFFL);
}
for (final int upperBound = a.length & ~(Integer.BYTES - 1); i < upperBound; i += Integer.BYTES) {
distance += Integer.bitCount(((int) BitUtil.VH_NATIVE_INT.get(a, i) ^ (int) BitUtil.VH_NATIVE_INT.get(b, i)) & 0xFFFFFFFF);
}
for (; i < a.length; i++) {
distance += Integer.bitCount((a[i] ^ b[i]) & 0xFF);
}
return distance;
} This one only uses popcnt CPU instruction. I then ran your benchmakrk to compare the panama-vectorized one with my new implementation as above:
Please note: the SCALAR one here is the above variant. As this one four (!) times faster than the Panama variant, there is no need to have a panama-vectorized one and it looks like it is not working at all. I think the lookup table is a bad idea. To make it short:
This will be a short PR. Make sure to add more tests (there's only one for the VectorUtil method). |
I am not sure if we really need the Integer tail. Mabye only implement the Long variant and the tail. |
Seems to autovectorize just fine, i took uwe's branch and dumped assembly on my AVX2 machine and see e.g. 256-bit xor and population count logic. I checked the logic in openjdk and it will use
|
I removed the integer tail and have seen no difference (especially looked also at the non-aligned sizes): @Override
public int binaryHammingDistance(byte[] a, byte[] b) {
int distance = 0, i = 0;
for (final int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound; i += Long.BYTES) {
distance += Long.bitCount(((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i)) & 0xFFFFFFFFFFFFFFFFL);
}
// tail:
for (; i < a.length; i++) {
distance += Integer.bitCount((a[i] ^ b[i]) & 0xFF);
}
return distance;
} I think this code is simpliest and most effective. It may not be the best for vector |
Here's my branch: main...uschindler:lucene:binary_hamming_distance I can merge this into this branch, but the code cleanup and removal of useless vectorization and those (public!!!!) lookup tables needs to be done after the merge. |
Thanks @uschindler , this is the way to go: compiler does a good job. java already has all the necessary logic here to autovectorize and use e.g. |
I figured that the |
This my final code: @Override
public int binaryHammingDistance(byte[] a, byte[] b) {
int distance = 0, i = 0;
for (final int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound; i += Long.BYTES) {
distance += Long.bitCount((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i));
}
// tail:
for (; i < a.length; i++) {
distance += Integer.bitCount((a[i] ^ b[i]) & 0xFF);
}
return distance;
} |
Thank you so much @rmuir & @uschindler for taking such a close look and also running benchmarks. 🙇 The reason I went with the look up table was because there seemed to be some improvement in Neon compared to I added the changes to use
Also run the same experiments on a Xeon cloud instance with the following results:
where As suggested, I'll proceed with adding this as the main and only implementation of hamming distance and remove both the Panama one and the leftovers from the existing implementation (i.e. lookup table). |
Please also add a test like the panama vs scalar one where you compare the results of the varhandle variant with the simple byte-by-byte one from the tail loop. Make sure to use interesting vector lengths which are not multiplies of 8. There is at moment only one static test without randomization in TestVectorUtil. |
About NEON: Robert checked yesterday. There is a lot going on in Hotspot and optimizations are added all the time. If neon is slower on your machine, it might be that there's still some optimization missing. Also the Apple NEON machines are a bit limited with their capabilities, so it's not representative for the whole modern ARM infrastructure. Thanks anyways for verifying the results on x86. You can keep the benchmark and maybe also add one without the varhandle to allow benchmarking the varhandle vs scalar one on older jdks. |
P.S. the long support for bit count was added recently on x86. We may also compare with the integer one using the integer var handle (that's easy to check). Maybe that performs better on Neon. In general as 64 bit optimizations for integer operations are added to hotspot all the time, we should stay with longs. |
Thanks for the suggestion @uschindler - will add the suggested variant to benchmarks! To be honest, the reason I re-run on x86 was mainly of the vector performance differences (hence why I went for the Panama impl in the first place) trying to make sure that I wasn't imagining numbers 😅 . But good to see that we won't have to overcomplicate things, as I pretty much initially approached this (rather simple) issue :) |
My question is why add this function when it's not that much faster than integer dot product? I see less than 20 percent improvement, which won't even translate to 20 percent indexing/search. The issue is that folks just want to add, add, add these functions yet there are no ways to remove any function from this list ( they will scream "bwc" ). So although this particular function is less annoying than others from a performance perspective, I'm -1 on adding it for this reason without any plans to address this. |
A good way to get in a new function would be to actually improve our support o&m by removing a horribly performing one such as cosine first. That way we are actually improving rather than just piling on more code. |
I think the idea is to have shorter vectors and so it is faster. With hamilton you encode multiple dimensions per byte (a byte component is a vector of 8 dimensions). So when you want 512 dimensions, you need a byte vector dimension of 64 to encode that. |
Because it provides different scores. Integer dot-product doesn't provide the same values (angle between vectors) and doesn't work for binary encoded data (vs. euclidean bit distance). Hamming distance is more a like And before you suggest "lets remove
If you are against this & will block it, then we need to provide a clean way for users to introduce their own similarities. I suggested making similarities pluggable in the past, but got shot down.
If hamming and cosine were comparable, then sure. But they are not. I do agree cosine should probably be removed (not because of hamming distance), but because dot_product exists. |
Can we do that for Lucene 10.0 ? |
|
||
@Override | ||
public float compare(byte[] v1, byte[] v2) { | ||
return (1f / (1 + binaryHammingDistance(v1, v2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on vector length, is this intended? I would have expected to have something like dimensions * 8 / (1 + distance)
. I know, it is not relevant for scoring purposes as it is a constant factor, but we have some normalization on other functions, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. The initial idea was to have the score bounded in (0, 1]
so to have more a "natural" way of interpreting it, i.e. 1 will always mean identical, and ~0 will mean that the two vectors are complements of each other (1/(1+dim)
). If we are to scale the score based on the number of dimensions, we move this to (0, dimensions*8]
which will effectively be the reverse of the distance. So for example if two vectors are identical, they would have a score of dimensions * 8
, whereas if one is complement of the other, their score would be ~1 (dim/(1+dim)
).
Don't have a strong opinion on this, happy to proceed with updating the normalization constant if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this looks fine now, benchmark on my Intel Laptop (Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1992 MHz, 4 cores):
Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryHammingDistanceScalar 1 thrpt 15 257,630 ± 33,441 ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar 128 thrpt 15 10,969 ± 0,473 ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar 207 thrpt 15 7,462 ± 0,450 ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar 256 thrpt 15 5,417 ± 0,845 ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar 300 thrpt 15 4,762 ± 0,677 ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar 512 thrpt 15 3,235 ± 0,048 ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar 702 thrpt 15 2,397 ± 0,030 ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar 1024 thrpt 15 1,637 ± 0,058 ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle 1 thrpt 15 259,372 ± 13,421 ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle 128 thrpt 15 58,047 ± 4,578 ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle 207 thrpt 15 36,495 ± 0,949 ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle 256 thrpt 15 40,539 ± 0,955 ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle 300 thrpt 15 34,629 ± 0,211 ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle 512 thrpt 15 23,137 ± 1,451 ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle 702 thrpt 15 15,098 ± 2,335 ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle 1024 thrpt 15 13,682 ± 0,189 ops/us
Deprecate it and warning of its imminent demise or remove it? Either should be possible. For users, they would have to add code to normalize vectors and store the magnitude for vector reconstitution (if they are using cosine). This could be seen as a heavy lift, but I honestly don't know. I am fine with either option, though I think we would deprecating it as existing indices that use the cosine similarity would still have to be read and searchable (and those vectors won't necessarily be normalized). Lucene 10 would need to read Lucene 9 indices correct? |
My question about supporting Lucene 9 indices is out of legit ignorance. I think we would still need to support reading and searching segments stored with Cosine in Lucene 10. But we could prevent NEW segments from being created using cosine. I am not sure how to do this off-hand, but I think it would be a good idea. |
In general, I'd like to rethink the plugabble VectorSimilarities (per field). IMHO, the VectorSimilarity class should NOT be an ENUM and instead be an SPI with a symbolic name (using I'd like to open a new issue about this. @rmuir and I were a bit shocked about the increase of similarity functions in the last year. This vector similarity discussed here should then first go into the sandbox module, so we do not need to keep backwards compatibility. |
Thanks uwe, thats exactly what is needed. The problem i see is a very immature field (vector search) that has no way to add new features (distance functions) without permanently impacting backwards compatibility. Of course all the functions are "different". "Different" isn't enough for us to provide years of backwards compatibility. |
I agree, enum doesn't make sense. SPI with a name lookup seems best to me. Here is my original issue that has since been closed. #12219 One difficulty is making sure the SPI interface is one we want (seems
All that can be part of the separate discussions. Thanks @uschindler & @rmuir ! |
which of the current functions really need to be in core? I guess the problem I see is that there are 6 functions today, 3 float, 3 byte. The byte functions don't perform well and never will. They require 32 bits to return Sure, if you have 10M vectors maybe you save a few megabytes, but if you have 10M vectors, that is such a drop in the bucket compared to the rest of your hardware, that it would be better to just have used faster floating-point vectors. So my question: which of all these 6 functions really needs to be supported in core? I don't think its needed to have a byte variant for every float function either (this PR shows that). So we shouldn't add functions "just because", but consider the cost. |
I've been experimenting with various potential optimisations and variants for some of these distance computations, and also pushing on some limitations of the Panama Vector API (in part to feedback into the JDK API). We should really be able to compare these things either on of off heap. In that way, I agree with the comparison function being Lucene core should not try to handle all possible variants of bit-size and distance function combinations, but rather support a subset of such along with the ability to extend. Extension allows different folk to experiment more easily in this evolving area - this is effectively what I'm doing locally. Successful and interesting experiments, when proven, can then be proposed separately on their own merit, maybe as a Lucene extension or misc package or maybe not at all. Ultimately, Lucene should benefit from "best in class" here, but not have to accept each and every variant into core. The addition of well thought out minimal extension points - for scoring - would be of long term benefit to Lucene. I'm happy to work on such, since I've been hacking around this area locally for a while now. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Closing in deference to extensibility added in: #13288 |
This PR adds support for binary Hamming distance as a similarity metric for byte vectors. The drive behind this is that there is an increasing interest in applying hashing techniques for embeddings (both in text & image based search applications e.g. Binary Passage Retriever) due to their much reduced size & performance gains that one might get. A natural way to compare binary vectors is to use Hamming distance by computing the population count of the XOR result of two embeddings, i.e. count how many different bits exist in the two binary vectors.
In Lucene, we can leverage the existing
byte[]
support, and use that to store the binary vectors. The size of the byte vectors would bed / 8
or(d / 8) + 1
ifd % 8 > 0
, where d is the dimension of a binary vector. So for example, a binary vector of 64 bits, could use aKnnByteVectorField
withvectorDimension=8
. However, this transformation is currently outside of the scope of this PR.To compute the Hamming distance, since we have a bounded pool of values ranging from
Byte.MIN_VALUE
toByte.MAX_VALUE
, this PR makes use of a look up table to retrieve the appropriate population count. Similarly, for the Panama implementation, we rely on the approach discussed here to compute the population count of low & high bits of the vectors' XOR result using a look-up table as well.To convert the computed distance to a similarity score we finaly normalize through
1 / (1 + hamming_distance)
Benchmarks for the scalar & vectorized implementations running on my M2 Pro (Neon) dev machine:
BINARY_HAMMING_DISTANCE
is currently only supported forbyte[]
vectors so I had to slightly refactor some tests to distinguish between float and byte versions of knn-search.