Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding binary Hamming distance as similarity option for byte vectors #13076

Closed
wants to merge 11 commits into from

Conversation

pmpailis
Copy link

@pmpailis pmpailis commented Feb 5, 2024

This PR adds support for binary Hamming distance as a similarity metric for byte vectors. The drive behind this is that there is an increasing interest in applying hashing techniques for embeddings (both in text & image based search applications e.g. Binary Passage Retriever) due to their much reduced size & performance gains that one might get. A natural way to compare binary vectors is to use Hamming distance by computing the population count of the XOR result of two embeddings, i.e. count how many different bits exist in the two binary vectors.

In Lucene, we can leverage the existing byte[] support, and use that to store the binary vectors. The size of the byte vectors would be d / 8 or (d / 8) + 1 if d % 8 > 0, where d is the dimension of a binary vector. So for example, a binary vector of 64 bits, could use a KnnByteVectorField with vectorDimension=8. However, this transformation is currently outside of the scope of this PR.

To compute the Hamming distance, since we have a bounded pool of values ranging from Byte.MIN_VALUE to Byte.MAX_VALUE, this PR makes use of a look up table to retrieve the appropriate population count. Similarly, for the Panama implementation, we rely on the approach discussed here to compute the population count of low & high bits of the vectors' XOR result using a look-up table as well.

To convert the computed distance to a similarity score we finaly normalize through 1 / (1 + hamming_distance)

Benchmarks for the scalar & vectorized implementations running on my M2 Pro (Neon) dev machine:

Benchmark                                        (size)   Mode  Cnt    Score   Error   Units
VectorUtilBenchmark.binaryHammingDistanceScalar       1  thrpt   15  514.613 ± 6.496  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar       8  thrpt   15  216.716 ± 1.682  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar      16  thrpt   15  135.528 ± 1.606  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar      32  thrpt   15   76.745 ± 0.654  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar      50  thrpt   15   52.226 ± 0.444  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar      64  thrpt   15   41.246 ± 0.139  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     100  thrpt   15   29.119 ± 0.086  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     128  thrpt   15   22.639 ± 0.138  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     207  thrpt   15   14.382 ± 0.058  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     256  thrpt   15   11.813 ± 0.060  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     300  thrpt   15   10.253 ± 0.257  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     512  thrpt   15    6.145 ± 0.015  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     702  thrpt   15    4.461 ± 0.133  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar    1024  thrpt   15    3.091 ± 0.003  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector       1  thrpt   15  499.861 ± 0.476  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector       8  thrpt   15  191.430 ± 3.243  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector      16  thrpt   15  298.697 ± 4.448  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector      32  thrpt   15  222.700 ± 5.461  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector      50  thrpt   15  129.853 ± 0.325  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector      64  thrpt   15  156.657 ± 4.337  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     100  thrpt   15   83.879 ± 1.864  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     128  thrpt   15   88.028 ± 2.156  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     207  thrpt   15   43.573 ± 1.085  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     256  thrpt   15   49.415 ± 0.865  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     300  thrpt   15   34.535 ± 0.408  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     512  thrpt   15   25.953 ± 0.197  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     702  thrpt   15   17.483 ± 0.033  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector    1024  thrpt   15   13.305 ± 0.085  ops/us

BINARY_HAMMING_DISTANCE is currently only supported for byte[] vectors so I had to slightly refactor some tests to distinguish between float and byte versions of knn-search.

@pmpailis pmpailis marked this pull request as draft February 5, 2024 12:57
@pmpailis pmpailis marked this pull request as ready for review February 5, 2024 13:15
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fail much earlier when using a KnnFloatVectorField with an inappropriate similarity function. Seems to me FieldType#setVectorAttributes should validate the encoding & similarity function compatibility.

if (a.length != b.length) {
throw new IllegalArgumentException("vector dimensions differ: " + a.length + "!=" + b.length);
}
return 1f / (1 + IMPL.binaryHammingDistance(a, b));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return IMPL.binaryHammingDistance(a, b).

The user's of this function will have to transform the distance to a score separately.


@Override
public float compare(byte[] v1, byte[] v2) {
return binaryHammingDistance(v1, v2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The score transformation for the distance should be here. Note squareDistance

import org.apache.lucene.util.hnsw.HnswGraphBuilder;
import org.junit.After;
import org.junit.Before;

/** Tests indexing of a knn-graph */
public class TestKnnGraph extends LuceneTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly large refactor adding test coverage, but I think it detracts from this PR. Could you change all this back and restrict the similarity function when using floats?

There are so many abstract functions, etc. it seems like this entire test case should be rewritten from the ground up if we were to 100% test byte vs. float for it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was in-between excluding hamming distance from float fields or adjusting tests, to also slightly increase coverage. But yeah, you're right that this is a bit outside of the scope of this PR. Will revert changes and simply exclude the similarity when we know it'll throw.

Comment on lines 121 to 123
public Set<VectorEncoding> supportedVectorEncodings() {
return EnumSet.of(VectorEncoding.BYTE);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public Set<VectorEncoding> supportedVectorEncodings() {
return EnumSet.of(VectorEncoding.BYTE);
}
public bool supportsVectorEncoding(VectorEncoding encoding) {
return encoding == VectorEncoding.BYTE;
}

I am not sure why this returns a set. Do we really do anything with the set except check membership?

Comment on lines 146 to 155

/**
* Defines which encodings are supported by the similarity function - used in tests to control
* randomization
*
* @return a list of all supported VectorEncodings for the given similarity
*/
public Set<VectorEncoding> supportedVectorEncodings() {
return EnumSet.of(VectorEncoding.BYTE, VectorEncoding.FLOAT32);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we do anything with the set except check membership? I think a simple bool supportsVectorEncoding(VectorEncoding) function would be preferred.

Comment on lines 43 to 45
case BINARY_HAMMING_DISTANCE -> throw new IllegalArgumentException(
"Cannot use Hamming distance with scalar quantization");
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if the error message had the literal BINARY_HAMMING_DISTANCE string to indicate the similarity enumeration we failed at.

@benwtrent
Copy link
Member

@pmpailis could you also push a CHANGES.txt update? It is would be under New Features for Lucene 9.10.0

// Need to break up the total ByteVector as the result might not
// fit in a byte
var acc1 = total.castShape(ShortVector.SPECIES_512, 0);
var acc2 = total.castShape(ShortVector.SPECIES_512, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector castShape() with part number > 0 really needs to be avoided. It is incredibly slow. Have you benchmarked non-mac machines with 256 or 512-bit vectors?

@rmuir
Copy link
Member

rmuir commented Feb 5, 2024

I'm confused about the use of lookup table. naively, i'd try to just xor + popcnt:

https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#XOR
https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#BIT_COUNT

I'm curious if any explicit vector code is needed actually at all. Integer.bitCount() has autovectorization support in hotspot.

@rmuir
Copy link
Member

rmuir commented Feb 5, 2024

even if it doesn't autovectorize, i suspect just gathering e.g. 4/8 bytes at a time with BitUtil varhandle and using single int/long xor + popcount would perform very well as a baseline.

@uschindler
Copy link
Contributor

Hi,
I don't want to discuss about sense/nonsense of this disatance, but the implementation could been made very simple and then we may not even need to have a Panama Vector variant:

@uschindler
Copy link
Contributor

The native order PR was merged.

@uschindler
Copy link
Contributor

uschindler commented Feb 5, 2024

Hi,
I modified the scalar variant like that:

  @Override
  public int binaryHammingDistance(byte[] a, byte[] b) {
    int distance = 0, i = 0;
    for (final int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound; i += Long.BYTES) {
      distance += Long.bitCount(((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i)) & 0xFFFFFFFFFFFFFFFFL);
    }
    for (final int upperBound = a.length & ~(Integer.BYTES - 1); i < upperBound; i += Integer.BYTES) {
      distance += Integer.bitCount(((int) BitUtil.VH_NATIVE_INT.get(a, i) ^ (int) BitUtil.VH_NATIVE_INT.get(b, i)) & 0xFFFFFFFF);
    }
    for (; i < a.length; i++) {
      distance += Integer.bitCount((a[i] ^ b[i]) & 0xFF);
    }
    return distance;
  }

This one only uses popcnt CPU instruction. I then ran your benchmakrk to compare the panama-vectorized one with my new implementation as above:

Benchmark                                        (size)   Mode  Cnt    Score    Error   Units
VectorUtilBenchmark.binaryHammingDistanceScalar       1  thrpt   15  258,511 ± 17,969  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     128  thrpt   15   62,364 ±  0,723  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     207  thrpt   15   40,302 ±  0,703  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     256  thrpt   15   42,025 ±  0,891  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     300  thrpt   15   35,065 ±  3,125  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     512  thrpt   15   24,391 ±  1,987  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar     702  thrpt   15   17,505 ±  0,684  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar    1024  thrpt   15   13,806 ±  0,102  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector       1  thrpt   15  231,651 ±  9,975  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     128  thrpt   15   16,760 ±  0,251  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     207  thrpt   15   10,317 ±  0,251  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     256  thrpt   15    8,887 ±  0,559  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     300  thrpt   15    7,466 ±  0,345  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     512  thrpt   15    4,706 ±  0,080  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector     702  thrpt   15    3,062 ±  0,566  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector    1024  thrpt   15    2,404 ±  0,024  ops/us

Please note: the SCALAR one here is the above variant. As this one four (!) times faster than the Panama variant, there is no need to have a panama-vectorized one and it looks like it is not working at all. I think the lookup table is a bad idea.

To make it short:

  • Remove the impl and the table from VectorUtilSupport (scalar and vectorized)
  • Implement my above methoid directly in VectorUtil class

This will be a short PR. Make sure to add more tests (there's only one for the VectorUtil method).

@uschindler
Copy link
Contributor

I am not sure if we really need the Integer tail. Mabye only implement the Long variant and the tail.

@rmuir
Copy link
Member

rmuir commented Feb 5, 2024

Seems to autovectorize just fine, i took uwe's branch and dumped assembly on my AVX2 machine and see e.g. 256-bit xor and population count logic. I checked the logic in openjdk and it will use vpopcntdq on AVX-512 if available, etc. So this solution is much better than some explicit vector stuff because it will do the right thing depending on CPU.

...
   0.35%            0x00007fffe0141fa3:   vmovdqu 0x10(%rax,%r8,1),%ymm9
                    0x00007fffe0141faa:   vpxor  0x10(%rdx,%r8,1),%ymm9,%ymm9
   0.04%            0x00007fffe0141fb1:   movabs $0xf0f0f0f,%r8
   0.35%            0x00007fffe0141fbb:   vmovq  %r8,%xmm10
                    0x00007fffe0141fc0:   vpbroadcastd %xmm10,%ymm10
                    0x00007fffe0141fc5:   vpsrlw $0x4,%ymm9,%ymm11
                    0x00007fffe0141fcb:   vpand  %ymm10,%ymm11,%ymm11
   0.40%            0x00007fffe0141fd0:   vpand  %ymm10,%ymm9,%ymm10
                    0x00007fffe0141fd5:   vmovdqu -0x59829d(%rip),%ymm12        # Stub::popcount_lut
                                                                              ;   {external_word}
                    0x00007fffe0141fdd:   vpshufb %ymm10,%ymm12,%ymm10
   0.02%            0x00007fffe0141fe2:   vpshufb %ymm11,%ymm12,%ymm11
   0.48%            0x00007fffe0141fe7:   vpaddb %ymm10,%ymm11,%ymm11
                    0x00007fffe0141fec:   vpxor  %ymm12,%ymm12,%ymm12
                    0x00007fffe0141ff1:   vpsadbw %ymm12,%ymm11,%ymm10
   0.07%            0x00007fffe0141ff6:   vpermilps $0x8,%ymm10,%ymm9
   0.35%            0x00007fffe0141ffc:   vpermpd $0x8,%ymm9,%ymm9
                    0x00007fffe0142002:   vpaddd %xmm9,%xmm1,%xmm1 
...

@uschindler
Copy link
Contributor

uschindler commented Feb 5, 2024

I removed the integer tail and have seen no difference (especially looked also at the non-aligned sizes):

  @Override
  public int binaryHammingDistance(byte[] a, byte[] b) {
    int distance = 0, i = 0;
    for (final int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound; i += Long.BYTES) {
      distance += Long.bitCount(((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i)) & 0xFFFFFFFFFFFFFFFFL);
    }
    // tail:
    for (; i < a.length; i++) {
      distance += Integer.bitCount((a[i] ^ b[i]) & 0xFF);
    }
    return distance;
  }

I think this code is simpliest and most effective. It may not be the best for vector (length % 8) == 5...7, but I think we can live with that.

@uschindler
Copy link
Contributor

Here's my branch: main...uschindler:lucene:binary_hamming_distance

I can merge this into this branch, but the code cleanup and removal of useless vectorization and those (public!!!!) lookup tables needs to be done after the merge.

@rmuir
Copy link
Member

rmuir commented Feb 5, 2024

Thanks @uschindler , this is the way to go: compiler does a good job. java already has all the necessary logic here to autovectorize and use e.g. vpopcntdq or AVX2 lookup-table counting algorithm depending on the cpu features detected.

@uschindler
Copy link
Contributor

I figured that the & 0xFFFF.... is useless. You only need it when widening into int. Will update my branch and paste code here.

@uschindler
Copy link
Contributor

This my final code:

  @Override
  public int binaryHammingDistance(byte[] a, byte[] b) {
    int distance = 0, i = 0;
    for (final int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound; i += Long.BYTES) {
      distance += Long.bitCount((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long) BitUtil.VH_NATIVE_LONG.get(b, i));
    }
    // tail:
    for (; i < a.length; i++) {
      distance += Integer.bitCount((a[i] ^ b[i]) & 0xFF);
    }
    return distance;
  }

@pmpailis
Copy link
Author

pmpailis commented Feb 6, 2024

Thank you so much @rmuir & @uschindler for taking such a close look and also running benchmarks. 🙇 The reason I went with the look up table was because there seemed to be some improvement in Neon compared to Integer.bitCount (hadn't checked using VarHandle tbf), and although I wasn't fond of the explicit lookup table either, in the case that we went ahead with something like that, I was hoping to discuss a better alternative (also vector based results seem much different).

I added the changes to use VarHandle and re-run the benchmarks. The following are from my local dev machine (Neon)

Benchmark                                             (size)   Mode  Cnt    Score    Error   Units
VectorUtilBenchmark.binaryHammingDistanceIntBitCount       1  thrpt   15  488.021 ±  4.800  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     128  thrpt   15    5.896 ±  0.038  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     207  thrpt   15    4.420 ±  0.065  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     256  thrpt   15    3.589 ±  0.032  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     300  thrpt   15    3.123 ±  0.040  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     512  thrpt   15    1.854 ±  0.017  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     702  thrpt   15    1.348 ±  0.045  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount    1024  thrpt   15    0.938 ±  0.015  ops/us

VectorUtilBenchmark.binaryHammingDistanceLookupTable       1  thrpt   15  502.334 ± 16.595  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     128  thrpt   15   18.142 ±  0.508  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     207  thrpt   15   11.611 ±  0.367  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     256  thrpt   15    9.426 ±  0.124  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     300  thrpt   15    7.932 ±  0.254  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     512  thrpt   15    4.762 ±  0.116  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     702  thrpt   15    3.532 ±  0.018  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable    1024  thrpt   15    2.425 ±  0.016  ops/us

VectorUtilBenchmark.binaryHammingDistanceVarHandle         1  thrpt   15  473.315 ±  5.442  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       128  thrpt   15   27.318 ±  0.152  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       207  thrpt   15   16.651 ±  0.540  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       256  thrpt   15   14.506 ±  0.046  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       300  thrpt   15   12.170 ±  0.023  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       512  thrpt   15    7.478 ±  0.020  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       702  thrpt   15    5.157 ±  0.314  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle      1024  thrpt   15    3.677 ±  0.085  ops/us

VectorUtilBenchmark.binaryHammingDistanceVector            1  thrpt   15  491.316 ± 14.116  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector          128  thrpt   15   87.343 ±  2.689  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector          207  thrpt   15   43.176 ±  1.220  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector          256  thrpt   15   48.915 ±  0.477  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector          300  thrpt   15   34.555 ±  0.326  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector          512  thrpt   15   26.251 ±  0.284  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector          702  thrpt   15   17.679 ±  0.204  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector         1024  thrpt   15   13.717 ±  0.056  ops/us

Also run the same experiments on a Xeon cloud instance with the following results:

Benchmark                                             (size)   Mode  Cnt    Score   Error   Units
VectorUtilBenchmark.binaryHammingDistanceIntBitCount       1  thrpt   15  407.490 ? 1.681  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     128  thrpt   15   13.283 ? 0.033  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     207  thrpt   15    8.201 ? 0.194  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     256  thrpt   15    6.775 ? 0.124  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     300  thrpt   15    5.658 ? 0.159  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     512  thrpt   15    3.488 ? 0.099  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount     702  thrpt   15    2.588 ? 0.046  ops/us
VectorUtilBenchmark.binaryHammingDistanceIntBitCount    1024  thrpt   15    1.866 ? 0.009  ops/us

VectorUtilBenchmark.binaryHammingDistanceLookupTable       1  thrpt   15  319.515 ? 0.776  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     128  thrpt   15   16.192 ? 0.222  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     207  thrpt   15    9.828 ? 0.057  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     256  thrpt   15    7.082 ? 0.044  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     300  thrpt   15    6.120 ? 0.090  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     512  thrpt   15    4.043 ? 0.058  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable     702  thrpt   15    2.625 ? 0.047  ops/us
VectorUtilBenchmark.binaryHammingDistanceLookupTable    1024  thrpt   15    1.954 ? 0.008  ops/us

VectorUtilBenchmark.binaryHammingDistanceVarHandle         1  thrpt   15  344.508 ? 1.039  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       128  thrpt   15  101.425 ? 1.319  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       207  thrpt   15   56.693 ? 6.604  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       256  thrpt   15   76.473 ? 0.201  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       300  thrpt   15   58.439 ? 1.204  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       512  thrpt   15   50.839 ? 1.050  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       702  thrpt   15   42.945 ? 0.974  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle      1024  thrpt   15   38.331 ? 0.215  ops/us

VectorUtilBenchmark.binaryHammingDistanceVector512         1  thrpt   15  281.455 ? 1.110  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector512       128  thrpt   15   31.618 ? 0.277  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector512       207  thrpt   15   19.928 ? 0.091  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector512       256  thrpt   15   16.684 ? 0.066  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector512       300  thrpt   15   11.351 ? 0.065  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector512       512  thrpt   15    8.520 ? 0.179  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector512       702  thrpt   15    5.596 ? 0.012  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector512      1024  thrpt   15    4.352 ? 0.021  ops/us

VectorUtilBenchmark.binaryHammingDistanceVector256         1  thrpt   15  280.541 ? 3.963  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector256       128  thrpt   15   22.965 ? 0.386  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector256       207  thrpt   15   14.085 ? 0.278  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector256       256  thrpt   15   12.248 ? 0.180  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector256       300  thrpt   15   10.086 ? 0.220  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector256       512  thrpt   15    6.216 ? 0.022  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector256       702  thrpt   15    4.288 ? 0.064  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector256      1024  thrpt   15    3.164 ? 0.007  ops/us

VectorUtilBenchmark.binaryHammingDistanceVector128         1  thrpt   15  281.373 ? 1.142  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector128       128  thrpt   15   27.610 ? 0.741  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector128       207  thrpt   15   16.567 ? 0.165  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector128       256  thrpt   15   14.946 ? 0.381  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector128       300  thrpt   15   11.887 ? 0.032  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector128       512  thrpt   15    7.735 ? 0.108  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector128       702  thrpt   15    5.430 ? 0.120  ops/us
VectorUtilBenchmark.binaryHammingDistanceVector128      1024  thrpt   15    3.870 ? 0.083  ops/us

where VarHandle clearly outperforms all other solutions.

As suggested, I'll proceed with adding this as the main and only implementation of hamming distance and remove both the Panama one and the leftovers from the existing implementation (i.e. lookup table).

@uschindler
Copy link
Contributor

Please also add a test like the panama vs scalar one where you compare the results of the varhandle variant with the simple byte-by-byte one from the tail loop. Make sure to use interesting vector lengths which are not multiplies of 8.

There is at moment only one static test without randomization in TestVectorUtil.

@uschindler
Copy link
Contributor

uschindler commented Feb 6, 2024

About NEON: Robert checked yesterday. There is a lot going on in Hotspot and optimizations are added all the time.

If neon is slower on your machine, it might be that there's still some optimization missing. Also the Apple NEON machines are a bit limited with their capabilities, so it's not representative for the whole modern ARM infrastructure.

Thanks anyways for verifying the results on x86.

You can keep the benchmark and maybe also add one without the varhandle to allow benchmarking the varhandle vs scalar one on older jdks.

@uschindler
Copy link
Contributor

P.S. the long support for bit count was added recently on x86. We may also compare with the integer one using the integer var handle (that's easy to check). Maybe that performs better on Neon.

In general as 64 bit optimizations for integer operations are added to hotspot all the time, we should stay with longs.

@pmpailis
Copy link
Author

pmpailis commented Feb 6, 2024

Thanks for the suggestion @uschindler - will add the suggested variant to benchmarks! To be honest, the reason I re-run on x86 was mainly of the vector performance differences (hence why I went for the Panama impl in the first place) trying to make sure that I wasn't imagining numbers 😅 . But good to see that we won't have to overcomplicate things, as I pretty much initially approached this (rather simple) issue :)

@rmuir
Copy link
Member

rmuir commented Feb 6, 2024

My question is why add this function when it's not that much faster than integer dot product? I see less than 20 percent improvement, which won't even translate to 20 percent indexing/search.

The issue is that folks just want to add, add, add these functions yet there are no ways to remove any function from this list ( they will scream "bwc" ).

So although this particular function is less annoying than others from a performance perspective, I'm -1 on adding it for this reason without any plans to address this.

@rmuir
Copy link
Member

rmuir commented Feb 6, 2024

A good way to get in a new function would be to actually improve our support o&m by removing a horribly performing one such as cosine first. That way we are actually improving rather than just piling on more code.

@uschindler
Copy link
Contributor

uschindler commented Feb 6, 2024

My question is why add this function when it's not that much faster than integer dot product? I see less than 20 percent improvement, which won't even translate to 20 percent indexing/search.

I think the idea is to have shorter vectors and so it is faster. With hamilton you encode multiple dimensions per byte (a byte component is a vector of 8 dimensions). So when you want 512 dimensions, you need a byte vector dimension of 64 to encode that.

@benwtrent
Copy link
Member

My question is why add this function when it's not that much faster than integer dot product?

Because it provides different scores. Integer dot-product doesn't provide the same values (angle between vectors) and doesn't work for binary encoded data (vs. euclidean bit distance).

Hamming distance is more a like euclidean. It is possible to do "hamming distance things" now, if users give specifically [0, 1, 0, 1, 1...] and use euclidean, but this has obvious draw backs (8x more vector operations and vector dims are 8x bigger).

And before you suggest "lets remove euclidean then", they are not compatible other than users providing literal 1s/0s.

The issue is that folks just want to add, add, add these functions yet there are no ways to remove any function from this list ( they will scream "bwc" ).

If you are against this & will block it, then we need to provide a clean way for users to introduce their own similarities.

I suggested making similarities pluggable in the past, but got shot down.

A good way to get in a new function would be to actually improve our support o&m by removing a horribly performing one such as cosine first. That way we are actually improving rather than just piling on more code.

If hamming and cosine were comparable, then sure. But they are not.

I do agree cosine should probably be removed (not because of hamming distance), but because dot_product exists.

@uschindler
Copy link
Contributor

I do agree cosine should probably be removed (not because of hamming distance), but because dot_product exists.

Can we do that for Lucene 10.0 ?


@Override
public float compare(byte[] v1, byte[] v2) {
return (1f / (1 + binaryHammingDistance(v1, v2)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on vector length, is this intended? I would have expected to have something like dimensions * 8 / (1 + distance). I know, it is not relevant for scoring purposes as it is a constant factor, but we have some normalization on other functions, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. The initial idea was to have the score bounded in (0, 1] so to have more a "natural" way of interpreting it, i.e. 1 will always mean identical, and ~0 will mean that the two vectors are complements of each other (1/(1+dim)). If we are to scale the score based on the number of dimensions, we move this to (0, dimensions*8] which will effectively be the reverse of the distance. So for example if two vectors are identical, they would have a score of dimensions * 8, whereas if one is complement of the other, their score would be ~1 (dim/(1+dim) ).

Don't have a strong opinion on this, happy to proceed with updating the normalization constant if you prefer.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this looks fine now, benchmark on my Intel Laptop (Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1992 MHz, 4 cores):

Benchmark                                           (size)   Mode  Cnt    Score    Error   Units
VectorUtilBenchmark.binaryHammingDistanceScalar          1  thrpt   15  257,630 ± 33,441  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar        128  thrpt   15   10,969 ±  0,473  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar        207  thrpt   15    7,462 ±  0,450  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar        256  thrpt   15    5,417 ±  0,845  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar        300  thrpt   15    4,762 ±  0,677  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar        512  thrpt   15    3,235 ±  0,048  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar        702  thrpt   15    2,397 ±  0,030  ops/us
VectorUtilBenchmark.binaryHammingDistanceScalar       1024  thrpt   15    1,637 ±  0,058  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle       1  thrpt   15  259,372 ± 13,421  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle     128  thrpt   15   58,047 ±  4,578  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle     207  thrpt   15   36,495 ±  0,949  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle     256  thrpt   15   40,539 ±  0,955  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle     300  thrpt   15   34,629 ±  0,211  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle     512  thrpt   15   23,137 ±  1,451  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle     702  thrpt   15   15,098 ±  2,335  ops/us
VectorUtilBenchmark.binaryHammingDistanceVarHandle    1024  thrpt   15   13,682 ±  0,189  ops/us

@benwtrent
Copy link
Member

Can we do that for Lucene 10.0 ?

Deprecate it and warning of its imminent demise or remove it?

Either should be possible. For users, they would have to add code to normalize vectors and store the magnitude for vector reconstitution (if they are using cosine). This could be seen as a heavy lift, but I honestly don't know.

I am fine with either option, though I think we would deprecating it as existing indices that use the cosine similarity would still have to be read and searchable (and those vectors won't necessarily be normalized).

Lucene 10 would need to read Lucene 9 indices correct?

@benwtrent
Copy link
Member

My question about supporting Lucene 9 indices is out of legit ignorance. I think we would still need to support reading and searching segments stored with Cosine in Lucene 10. But we could prevent NEW segments from being created using cosine. I am not sure how to do this off-hand, but I think it would be a good idea.

@rmuir @uschindler

@uschindler
Copy link
Contributor

In general, I'd like to rethink the plugabble VectorSimilarities (per field). IMHO, the VectorSimilarity class should NOT be an ENUM and instead be an SPI with a symbolic name (using NamedSPILoader for the lookup) and the name should be stored in FieldInfo.
I don't know how the current serialization to fieldinfos is done, but if it just stores the ENUM ordinal number we have a problem anyways (we can't remove constants then). If this is the case it would be top priority to change from ordinals to SPI names, because we can't remove enum constants if only the ordinal is used. For backwards compatibility we should have a hardcoded mapping of the old lookup keys in the older fieldinfos format.

I'd like to open a new issue about this. @rmuir and I were a bit shocked about the increase of similarity functions in the last year.

This vector similarity discussed here should then first go into the sandbox module, so we do not need to keep backwards compatibility.

@rmuir
Copy link
Member

rmuir commented Feb 6, 2024

Thanks uwe, thats exactly what is needed. The problem i see is a very immature field (vector search) that has no way to add new features (distance functions) without permanently impacting backwards compatibility.

Of course all the functions are "different". "Different" isn't enough for us to provide years of backwards compatibility.

@benwtrent
Copy link
Member

IMHO, the VectorSimilarity class should NOT be an ENUM and instead be an SPI with a symbolic name (using NamedSPILoader for the lookup) and the name should be stored in FieldInfo.

I agree, enum doesn't make sense. SPI with a name lookup seems best to me. Here is my original issue that has since been closed. #12219

One difficulty is making sure the SPI interface is one we want (seems float[] & byte[] is too restrictive?). Some other work from @ChrisHegarty (#12703) shows that we may want to move away from float[], float[] and potentially an interface like.

score(int vectorOrdinal1, int vectorOrdinal2) score(float[] queryVector, int vectorOrdinal) score(byte[] queryVector, int vectorOrdinal).

All that can be part of the separate discussions. Thanks @uschindler & @rmuir !

@rmuir
Copy link
Member

rmuir commented Feb 6, 2024

which of the current functions really need to be in core? I guess the problem I see is that there are 6 functions today, 3 float, 3 byte.

The byte functions don't perform well and never will. They require 32 bits to return int result so they aren't any faster than 32-bit float, just more overhead.

Sure, if you have 10M vectors maybe you save a few megabytes, but if you have 10M vectors, that is such a drop in the bucket compared to the rest of your hardware, that it would be better to just have used faster floating-point vectors.

So my question: which of all these 6 functions really needs to be supported in core? I don't think its needed to have a byte variant for every float function either (this PR shows that). So we shouldn't add functions "just because", but consider the cost.

@ChrisHegarty
Copy link
Contributor

One difficulty is making sure the SPI interface is one we want (seems float[] & byte[] is too restrictive?). Some other work from @ChrisHegarty (#12703) shows that we may want to move away from float[], float[] and potentially an interface like.

score(int vectorOrdinal1, int vectorOrdinal2) score(float[] queryVector, int vectorOrdinal) score(byte[] queryVector, int vectorOrdinal).

I've been experimenting with various potential optimisations and variants for some of these distance computations, and also pushing on some limitations of the Panama Vector API (in part to feedback into the JDK API). We should really be able to compare these things either on of off heap. In that way, I agree with the comparison function being score(int ord1, int ord2). It should not matter where the data actually is.

Lucene core should not try to handle all possible variants of bit-size and distance function combinations, but rather support a subset of such along with the ability to extend. Extension allows different folk to experiment more easily in this evolving area - this is effectively what I'm doing locally. Successful and interesting experiments, when proven, can then be proposed separately on their own merit, maybe as a Lucene extension or misc package or maybe not at all. Ultimately, Lucene should benefit from "best in class" here, but not have to accept each and every variant into core. The addition of well thought out minimal extension points - for scoring - would be of long term benefit to Lucene.

I'm happy to work on such, since I've been hacking around this area locally for a while now.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@benwtrent
Copy link
Member

Closing in deference to extensibility added in: #13288

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants