Core: Adds Utility Class for Implementing ZOrdering #3966

RussellSpitzer · 2022-01-24T22:45:05Z

First Chunk of Code For ZOrdering.

Common functions and code that can be shared amongst all execution engines. Transforms of Iceberg types into byte representations as well as interleaving code.

Closes #3960

emkornfield · 2022-01-27T04:47:43Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+   * To fix this, flip the sign bit so that all negatives are ordered before positives. This essentially
+   * shifts the 0 value so that we don't break our ordering when we cross the new 0 value.
+   */
+  public static byte[] orderIntLikeBytes(byte[] intBytes, int size) {


is it worth documenting assumed endianness and contract for nulls (always first in the sort order?)

I should probably mark that I assume Java byte representations, good point

Changed all the methods to use java primitives so we should be ok here now, since the endianness is now Java's representation.

emkornfield · 2022-01-27T04:53:32Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+    if (floatBytes == null) {
+      return new byte[size];
+    }
+    if ((floatBytes[0] & (1 << 7)) == 0) {


not sure if it is worth handling, but can NaN's on the JVM have different representations and do you want to ensure they are sorted together?

maybe documenting the assumption that NaN values are normalized to Double.NaN

Yeah I can doc that, I think hopefully it will not come up very often. In the spec we already have a discussion on -/+ Nan #2891 I'm ok with different weird nans end up in random places at least for now

emkornfield

LGTM. This seems like a good first approach. Have you benchmarked the overhead of it? There are definitely more sophisticated approaches for the interleave but I think they get a little complicated when you start assuming different width bytes fields for interleaving.

emkornfield · 2022-01-27T07:00:40Z

core/src/test/java/org/apache/iceberg/util/TestZOrderByteUtil.java

+      Assert.assertTrue(String.format(
+          "Ordering of ints should match ordering of bytes, %s ~ %s -> %s != %s ~ %s -> %s ",
+          aFloat, bFloat, floatCompare, Arrays.toString(aBytes), Arrays.toString(bBytes), byteCompare),
+          (floatCompare ^ byteCompare) >= 0);


nit: This way of asserting the same comparison is a little bit opaque to me. Comparing equality on Integer.signum would be clearer at least for me.

build.gradle

rdblue · 2022-01-28T01:13:38Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+   * Which means floats can be treated as sign magnitude integers which can then be converted into lexicographically
+   * comparable bytes
+   */
+  public static byte[] orderFloatLikeBytes(byte[] floatBytes, int size) {


Is the purpose of using byte[] so that we don't have a version for float and a version for double?

I use byte in all the functions since I'm assuming in some engines we can get byte representations without a conversion (like in spark). The "size" is so that I don't have a float and double version

We should make sure that assumption holds because converting floats and doubles is probably much faster if we use full values rather than individual bytes. The HBase OrderedBytes implementation for double is this:

long asLong = Double.doubleToLongBits(doubleValue); asLong ^= ((asLong >> (Long.SIZE - 1)) | Long.MIN_VALUE);

That avoids looping over the bytes and avoids an if statement in a tight loop by producing the correct value to XOR with to optionally the non-sign flip bits or not. That is probably much faster than fetching and flipping the bits of each byte individually, if we are starting off with a float or double.

Also, doubleToLongBits will normalize NaN values for you.

We can just use the Hbase code here, the Bytes assumption is more a part of my UDF implementation on the Spark side. Otherwise I have to generate a udf of specific types based on the schema of column's being chosen which ... I think I may be able to do. I'll check it out

Or Do something inside catalyst ... which also may be a better option

Actually I think I can still do this with duds for the prototype, Will code all this up Monday

I think that Catalyst should do okay with this because it uses Unsafe to pull values out of memory. I believe that happens by just treating an offset like its already a long or double, so it isn't doing a byte-wise operation. And for testing that also works better with GenericInternalRow, which actually stores the JVM types.

rdblue · 2022-01-28T01:32:42Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+
+      // Find next column with a byte we can use
+      do {
+        if (++sourceColumn == columnsBinary.length) {


I think it's usually a bad idea to use the output value of a ++ or -- operator. It's just hard to understand what's going on and I think prone to errors.

Removed and added more clear comments about what we are doing in this algorithm

emkornfield · 2022-01-28T01:43:06Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+   * @param columnsBinary an array of byte arrays, none of which are empty
+   * @return their bits interleaved
+   */
+  public static byte[] interleaveBits(byte[][] columnsBinary) {


I didn't mention it at first because it is an optimization but an alternative to this is a builder pattern that has specific methods for int/long/float/double/byte[] and a predertmined number of dimensions and a bit-width for each dimension. then the addFloat/addInt methods could be called in the correct sequence. I don't know if the JVM is smart enough to compile the ByteBuffer.of(...).put pattern to simple casts, but with the builder methods these could be avoided.

I think that would be a good thing for a future improvement. If you see the Spark request has some benchmarks and even with wrapping all these functions in UDFS and applying them to rows that way, it's still only about 2~3 x slower than sort with as many expressions. So I think the perf is probably ok to start with.

nastra · 2022-02-03T18:16:03Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+
+  /**
+   * Interleave bits using a naive loop.
+   * @param columnsBinary an array of byte arrays, none of which are empty


nit: is the none of which are empty commit still valid given that there are tests for empty arrays?

I'll fix up this comment, it's pretty unclear at the moment

nastra · 2022-02-03T18:22:14Z

core/src/test/java/org/apache/iceberg/util/TestZOrderByteUtil.java

+      short bShort = (short) (random.nextInt() % (Short.MAX_VALUE + 1));
+      int longCompare = Integer.signum(Long.compare(aShort, bShort));
+      byte[] aBytes = ZOrderByteUtils.longToOrderBytes(aShort);
+      byte[] bBytes = ZOrderByteUtils.longToOrderBytes(bShort);


I think those 2 should both be shortToOrderBytes

nastra · 2022-02-03T18:22:45Z

core/src/test/java/org/apache/iceberg/util/TestZOrderByteUtil.java

+      long bByte = (byte) (random.nextInt() % (Byte.MAX_VALUE + 1));
+      int longCompare = Integer.signum(Long.compare(aByte, bByte));
+      byte[] aBytes = ZOrderByteUtils.longToOrderBytes(aByte);
+      byte[] bBytes = ZOrderByteUtils.longToOrderBytes(bByte);


I think those 2 should be tinyintToOrderedBytes

nastra · 2022-02-03T18:25:14Z

core/src/test/java/org/apache/iceberg/util/TestZOrderByteUtil.java

+   * algorithms are identically wrong or are both identically correct.
+   */
+  @Test
+  public void testInterleaveRandomExamples() {


takes 17-18 seconds to run on my local machine, just wondering whether we want to have such long running unit tests

I'd probably @Ignore this test, but it's good to see it working.

I can cut down the number of example :) I just really wanted to have some kind of fuzzing since I know we are going to want to optimize the interleave example with fancier byte tricks in the future.

Changed the number of trials run so now this takes a few hundred ms. The longest test in the suite now is the String compare test which is around 1 second on my 2019 MBP

rdblue · 2022-02-06T23:43:34Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+   * shifts the 0 value so that we don't break our ordering when we cross the new 0 value.
+   */
+  public static byte[] intToOrderedBytes(int val) {
+    ByteBuffer bytes = ByteBuffer.allocate(Integer.BYTES);


I'm okay with this if you want to get this through quickly, but in the long term I think we should avoid the allocation cost. What I'd recommend since this is a util method is to add a reuse argument:

// this should probably live in the ByteBuffers util class public static ByteBuffer reuseOrAllocate(ByteBuffer reuse, int length) { if (reuse.hasArray() && reuse.arrayOffset() == 0 && reuse.capacity() == length) { reuse.position(0); reuse.limit(length); return reuse; } else { return ByteBuffer.allocate(length); } } public static ByteBuffer intToOrderedBytes(int val, ByteBuffer reused) { ByteBuffer bytes = reuseOrAllocate(reused, 4); bytes.putInt(val ^ 0x80000000); return bytes; }

Then the caller can always call ByteBuffer.array() to get the underlying bytes, but it can keep reusing the space:

v1Bytes = intToOrderedBytes(v1, v1Bytes); v2Bytes = intToOrderedBytes(v2, v2Bytes); zorderBytes = zorder(v1Bytes, v2Bytes, zorderBytes);

I think this is worthwhile to get in now, it isn't a huge change and should let us think a bit more about the Spark implementation's memory efficiency

Actually I wonder if we should have reuseOrAllocate or just reuse and throw an error if it cannot. Basically give two versions of the orderedByteFunctions. One which attempts to reuse and throws an error if the Buffer cannot be reused and one that allocates a new buffer if no buffer is passed. That way the user knows their buffer is being re-used and the behavior isn't changing at runtime.

Implemented this but didn't allow for the "orAllocate" option. Caller needs to create the new allocation if they want to use it.

rdblue · 2022-02-06T23:54:45Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+   */
+  public static byte[] stringToOrderedBytes(String val, int length) {
+    ByteBuffer bytes = ByteBuffer.allocate(length);
+    if (val != null) {


If you update this to accept a reused buffer, then we need to remember to zero out the bytes.

rdblue · 2022-02-06T23:56:25Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+    while (interleaveByte < interleavedSize) {
+      // Take what we have, Get the source Bit of the source Byte, move it to the interleaveBit position
+      interleavedBytes[interleaveByte] =
+          (byte) (interleavedBytes[interleaveByte] |


Why not use |= instead of adding this every time?

rdblue · 2022-02-06T23:59:37Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+      // Take what we have, Get the source Bit of the source Byte, move it to the interleaveBit position
+      interleavedBytes[interleaveByte] =
+          (byte) (interleavedBytes[interleaveByte] |
+              (columnsBinary[sourceColumn][sourceByte] & 1 << sourceBit) >> sourceBit << interleaveBit);


Shouldn't this use >>> sourceBit so that the sign bit isn't propagated? or are we guaranteed that won't happen because this gets converted into an int so it is too sparse to worry?

I actually haven't thought this through, This is one of the reasons I added the big fuzzing test of the interleave code since I have a hard time thinking through byte path in my head. I can switch to >>> though since I think that is correct

rdblue · 2022-02-07T00:03:54Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+        ++sourceColumn;
+        if (sourceColumn == columnsBinary.length) {
+          sourceColumn = 0;
+          if (--sourceBit == -1) {


I think it's much easier to read and reason about algorithms like this if you don't use the return value of ++ or -- operators.

rdblue · 2022-02-07T00:08:24Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+            sourceBit = 7;
+          }
+        }
+      } while (columnsBinary[sourceColumn].length <= sourceByte && interleaveByte < interleavedSize);


Does it make sense to have interleaveBytes < interleavedSize here? All that's going to happen is the first time through the loop, sourceColumn will be checked. But after that, if the interleave is finished then this will exit and the outer loop will exit. Then sourceColumn isn't used because the bytes are returned. So I think it would be equivalent to do this check above the do/while:

if (interleaveBit == -1) { ... } if (interleaveByte < interleavedSize) { break; } do { ... } while (columnsBinary[sourceColumn].length <= sourceByte)

You could also use an if block to avoid the break if that's your thing.

Yeah I'm not a big break person, but I think it complicated the code a bit since I wrote without them. Let me see what I can do to clean this up a bit

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

rdblue · 2022-02-07T00:23:04Z

core/src/test/java/org/apache/iceberg/util/TestZOrderByteUtil.java

+  }
+
+  @Test
+  public void testIntOrdering() {


Good tests.

rdblue · 2022-02-07T00:35:56Z

core/src/test/java/org/apache/iceberg/util/TestZOrderByteUtil.java

+  }
+
+  @Test
+  public void testInterleaveMixedBits() {


I'm not sure that this is the correct behavior in all cases.

I think that this is the correct behavior for strings because the length of the string doesn't matter. a is always less than bbb. However, this is not the right behavior for numbers. For example:

short v1 = 0xFFFF int v2 = 0x0000FFFF byte[] result = ZOrderByteUtils.interleaveBits(new byte[] { shortToOrderedBytes(v1), intToOrderedBytes(v2) }

Result is going to be: 0b1010101010101010101010101010101011111111. All values of v1 interleave with just the upper bits of v2. Is that the intended behavior?

This may be handy when trying to handle magnitude problems, assuming that the magnitude is always reflected in the type. But you could easily have a case where v1 and v2 are both in the range of 0x0000 to 0xFFFF and then this behavior would not actually produce a useful zorder.

Result is going to be: 0b1010101010101010101010101010101011111111. All values of v1 interleave with just the upper bits of v2. Is that the intended behavior?

nit: I think the result fails to account for *ToBytes calls which flip the first leading bits. Which should make the byte ordering similar to strings IIUC

This may be handy when trying to handle magnitude problems, assuming that the magnitude is always reflected in the type. But you could easily have a case where v1 and v2 are both in the range of 0x0000 to 0xFFFF and then this behavior would not actually produce a useful z-order.

Not sure if the nit above addresses your concerns and what exactly you mean by not a useful z-order, could you expand on them? My understanding is that as long as the input bytes that get interleaved have an equivalent lexicographic order to the original inputs it is a z-order and should maintain the clustering properties z-orders provide, if you then compare the output bytes lexicographically. The only other principled option I could see here would be to zero pad all numeric values to the same number of bytes, but without thinking too deeply about it I think this just takes extra space.

(strings are always expected to be padded to a fixed length)

Thinking about this some more, I think the padding (casting to highest bit-width) is the right thing to do (below are tables showing 1 x 2 bit examples (I think what @rdblue meant by useful vs not-useful can be seen by following). The top and outer left most columns represent the input numbers

Existing (this is essentially normal sorting by each element):

bits: l₀, t₁, t₀ 0 1 2 3

0 0 1 2 3

1 4 5 6 7

With padding:

bits: l₁, t₁, l₀, t₀ 0 1 2 3

0 0 1 4 5

1 (padded as 0b01) 2 3 6 7

We have been talking about this a bit, but I think the only way to really solve this is to know the range of all column values before getting their binary values and shifting then.

I think at this point in the algorithm (during interleaving) we don't have enough information to know whether we can bit shift a value into higher order bits but I do agree eventually we should have a representation that packs the information into the highest bits possible. Ideally I think this is a responsibility of the "toOrderedBinary" which I would like to have a "min/max" parameters in the future so we can shift and tighten the domain if possible.

I think at this point in the algorithm (during interleaving) we don't have enough information to know whether we can bit shift a value into higher order bits but I do agree eventually we should have a representation that packs the information into the highest bits possible. Ideally I think this is a responsibility of the "toOrderedBinary" which I would like to have a "min/max" parameters in the future so we can shift and tighten the domain if possible.

I agree. I think this might imply that we always want an equal number of bits on each input to interleave (need to think about this some more)? I guess falling back to traditional sorting isn't a bad fallback but we are wasting compute to make that happen.

I think the result fails to account for *ToBytes calls which flip the first leading bits. Which should make the byte ordering similar to strings IIUC

Yes, I didn't do the sign bit flip, but the result is pretty much the same. All of the relevant bits of the short are interleaved with the irrelevant bits of the int and you effectively get an expensive way to sort by the short and then by the int.

I think it makes the most sense to cast everything to the widest type. I think that is the behavior most people would expect: if you zorder a short with an int, it happens by magnitude not bit position.

I have the general feeling that different columns sharing the same magnitude values but different types is more of a special case but I don't have any empirical evidence for that. I'm fine with doing the upcasting but maybe it makes more sense to add it in as part of a more complicated builder as suggested by @emkornfield.

Like perhaps we have a class that takes all the column types (and in the future stats?) and it chooses the correct ordered byte representations and returns a ZOrder method for those specific input parameters.

There's another simple option that I think solves an additional challenge...

I'd propose that we cast everything to float so we don't have to worry about choosing a type and converting everything to it. That also gives us a more compact representation that is better for packing multiple columns. That's because floating point values are represented as scientific notation with base-2. The exponent encodes how many positions the fraction was shifted so that its first digit is a 1. Alternatively, you can think of it as encoding how many bits at the start of the number are 0, then storing a normalized fraction.

That effectively compresses the most significant parts of a number into the highest bits. For float32, there's a sign bit, 8 bits of exponent, and 23 bits of fraction (mantissa).

Converting all the values to floating point would solve the immediate problem because all numbers would use the same representation, giving what I think is the expected behavior (same result regardless of type). But it also means that we can fit more columns in 32 bits or 64 bits, which are going to be much faster for comparison once we pass them off to Spark because we could use native ints or longs rather than byte arrays.

For example, if we took 4 integers and interleaved them, it would take 128 bits. But if we convert those to float first, then the most significant parts can fit in half that: 16 bits per value can carry the sign bit, the full 8-bit exponent, and 7 bits of the fraction. That's about 128 partitions of the number space per exponent value.

If you don't want to lose so much of the value, then fewer columns do better. And we can always fall back to producing the full byte array if we want.

What do you guys think?

rdblue · 2022-02-07T00:37:10Z

core/src/test/java/org/apache/iceberg/util/TestZOrderByteUtil.java

+    test[1] = new byte[]{OOOOOOOI, OOOOOOOO, IIIIIIII};
+    test[2] = new byte[]{OOOOOOOI};
+    test[3] = new byte[]{OOOOOOOI};
+    byte[] expected = new byte[]{


You could argue that the last byte of test[0] should be ignored. Would it be better to do that so that you can zorder strings and know that you'll get just the prefix? Or do you think that this should be configured with a byte range in the strings?

My base implementation of this for spark just uses a fixed byte size for strings since each column has to contribute the same number of bytes every time. I put this responsibility in the toOrderedBytes function and not in the interleave itself. We need all entries from the same column to be the same length for the following situation

Imagine columns A,B, and C where A is between 0 and 2 bits, but B and C are always 2 bits. You could get ZValues like

(AA, BB, CC) - >ABCABC
(A_, BB, CC) - >ABCBC
(__, BB, CC) -> BCBC

The sorting is now dependent on the length of the column value and not the value itself. Shorter A's being clustered differently than longer A's.

As for in the trailing situation, I think it's worthwhile to keep those bits as well since we can just use them for hierarchal sort within clusters. You can imagine the interleave bits creating multidimensional clustered groups and the final bits producing a hierarchal sort within those groups based on the unshared portion of the longest element.

I do think in the future we may want to save space here and then we could trim the max length of a contributed byte array to the length of the second longest contributor.

One thing you could do since it doesn't matter is just break out of the interleave loop if moving to the next column doesn't change the column index. It would be nice to not allocate the space as well.

I think a good way to do this may be to change the api slightly to

public static byte[] interleaveBits(byte[][] columnsBinary) { return interleaveBits(columnsBinary, Arrays.stream(columnsBinary).mapToInt(column -> column.length).max().getAsInt()); } /** * Interleave bits using a naive loop. Variable length inputs are allowed but to get a consistent ordering it is * required that every column contribute the same number of bytes in each invocation. Bits are interleaved from all * columns that have a bit available at that position. Once a Column has no more bits to produce it is skipped in the * interleaving. * @param columnsBinary an array of ordered byte representations of the columns being ZOrdered * @param interleavedSize the number of bytes to use in the output * @return the columnbytes interleaved */ public static byte[] interleaveBits(byte[][] columnsBinary, int interleavedSize) {

So that instead of calculating the output-size, the caller can just pass in the number of bytes it wants out of the algorithm. This handles bailing out earlier as well as not allocating as much. I can leave the original signature for the test suite.

+1 to asking for a specific interleaved size. You could also have versions that produce int or long if you only want 4 or 8 bytes. That would work really well with Spark because you wouldn't have to store this in the variable-length part of UnsafeRow.

rdblue

I'd like to see less array allocation since that's needlessly expensive. I also think there's an open question about how the interleave should behave. The interleave looks correct to me and it may just come down to how we use it in practice.

findepi · 2022-02-07T16:47:52Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+    ByteBuffer bytes = ByteBuffer.allocate(length);
+    if (val != null) {
+      int maxLength = Math.min(length, val.length());
+      bytes.put(val.getBytes(), 0, maxLength);


The encoding must be fixed, and it must be UTF-8, if we want the ordering of the produced bytes be coherent with value ordering.

Also byte order will match string/varchar order only if bytes are treated as unsigned.
I guess he unsignedness applies to other types, so maybe it's documented already, i didn't notice.

findepi · 2022-02-07T16:50:23Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+  public static byte[] stringToOrderedBytes(String val, int length) {
+    ByteBuffer bytes = ByteBuffer.allocate(length);
+    if (val != null) {
+      int maxLength = Math.min(length, val.length());


val.length() is not "encoding-aware". Maybe add a comment that it doesn't matter.

Actually, don't add a comment. We must take byte-length into account, otherwise we would be truncating short non-all-ASCII string values.

... an example would be String "ą", single character, so length=1.
We would copy only the first byte of the two-byte sequence (0xC4 0x85) for the letter.

OH sorry I thought you were talking about something else, yeah let me switch this to the byte length of the encoded string

RussellSpitzer · 2022-02-07T21:33:50Z

Thanks everyone for taking a look, I tried to take into account the major points and I know we have lots of room for research and improvement in the future. Please let me know if you have any more feedback that we should address for the initial version of the feature

findepi · 2022-02-08T12:30:19Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

@@ -108,54 +108,70 @@ private ZOrderByteUtils() {
   * Strings are lexicographically sortable BUT if different byte array lengths will
   * ruin the Z-Ordering. (ZOrder requires that a given column contribute the same number of bytes every time).
   * This implementation just uses a set size to for all output byte representations. Truncating longer strings
-   * and right padding 0 for shorter strings.
+   * and right padding 0 for shorter strings. Requires UTF8 (or ASCII) encoding for ordering guarantees to hold.


Please don't add a comment. Just replace String.getBytes() with String.getBytes(java.nio.charset.StandardCharsets#UTF_8).

Of course, i don't know the rules of the project, but otherwise i don't see why we would want to rely on JVM's default encoding. This is z-order, not sorting, but z-order should be consistent with sorting, and varchar value ordering is codepoint-based (in the absence of collation which doesn't exist in Iceberg world). The UTF-8 encoding produces bytes that, if treated as unsigned, lead to the same order as codepoint-based lexicographical ordering.

I agree. We should ensure that we use UTF-8 to convert to bytes.

Also a patch for the test interleave method length calculation

rdblue · 2022-02-08T16:33:17Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+  public static byte[] tinyintToOrderedBytes(byte val, ByteBuffer reuse) {
+    ByteBuffer bytes = ByteBuffers.reuse(reuse, Byte.BYTES);
+    bytes.put((byte) (val ^ (0x80)));
+    return bytes.array();


Can we return the ByteBuffer instead since we're passing in a ByteBuffer?

rdblue · 2022-02-08T16:43:12Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+    if (val != null) {
+      int maxLength = Math.min(length, val.length());
+      // We may truncate mid-character
+      bytes.put(val.getBytes(StandardCharsets.UTF_8), 0, maxLength);


I think @findepi has a point about length. In addition, this call to getBytes is going to allocate a byte array. It doesn't help to reuse a buffer if we are going to do an allocation anyway.

One thing you can do is to use CharsetEncoder.encode(CharBuffer, ByteBuffer, boolean). That way, you'd get a CharBuffer wrapping the original string without reallocating and encode into the buffer that's passed in. You'd also get information about the result, but you probably don't really care what happens. Underflow means the string is short and overflow means that you ran out of space. Neither one really affects zorder much.

rdblue · 2022-02-08T16:43:32Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+   */
+  static byte[] interleaveBits(byte[][] columnsBinary) {
+    return interleaveBits(columnsBinary,
+        Arrays.stream(columnsBinary).mapToInt(column -> column.length).sum());


Should we only allow this when all the columns have the same length?

I don't think so? Shouldn't we be allowing columns to contribute different amounts of bytes?

This is one of the difficulties for working with strings and other types since not only are we going to need more bytes for strings, but their most significant bytes may be rather late in the string.

For example

"www.iceberg.org/index"
"www.iceberg.org/spark"

Only have a differing byte at position 17.

This is one of my reluctances to require similar contributions of byte sizes until we have a better "toOrderedBytes" function.

Should we only allow this when all the columns have the same length?

z_order(an_integer, a_bigint, 50_bytes_of_string) -- each column has different byte length.

@findepi, while those do have different lengths, is it valuable to interleave them directly? We had a discussion above about interleaving short and int. I think that the behavior most people would expect is to interleave by numeric magnitude, not the actual bytes. Basically, if two columns have the same numeric distribution, they should interleave well regardless of the type widths.

If you agree there, then what you'd do is cast the integer in your example to a bigint. Then you have 2 8-byte buffers and 1 50-byte buffer. The remainder of the 50 bytes after the first 8 doesn't matter.

There are cases where you may want to use different lengths, but to me you should be calling the version with a specified length for those.

cast the integer in your example to a bigint

that works, for numbers at leasts

The remainder of the 50 bytes after the first 8 doesn't matter.

It does:

z_order(a_short, z_order_interleaving) -- it's unlikely that 2 first bytes of a string suffice for anything useful (ht for urls)

unless it's covered somewhere else:
if you order by z_order_interleaving, c1, c2, c3, then it z_order_interleaving can be shorter

@findepi, I see your point. If you still want the string ordering then you have to preserve the string. You're right about that. I was focusing on interleaving the ordering across the columns, but it's of course necessary to still sort by the remaining bytes in the longest column!

…arsetEncoder.encode

findepi · 2022-02-08T21:10:28Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+ * Some implementation is taken from
+ * https://github.com/apache/hbase/blob/master/hbase-common/src/main/java/org/apache/hadoop/hbase/util/OrderedBytes.java
+ */
+public class ZOrderByteUtils {


It should be documented somewhere the produced byte[] should be compared lexicographically, with bytes as unsigned.

Yep, I will add this

findepi · 2022-02-08T21:11:32Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+   * This implementation just uses a set size to for all output byte representations. Truncating longer strings
+   * and right padding 0 for shorter strings.
+   */
+  public static byte[] stringToOrderedBytes(String val, int length, ByteBuffer reuse) {


How is the length going to be calculated in practice?
Is it going to be a fixed prefix, like 128 or something?

The debate we were having before was wether to limit it to the max common length of columns, or whether to let it go beyond that.

Like with (A, B, CC, DDD)

Do you return

A. : All bytes that are available
ABCDCDD

B. : All bytes that can be interleaved with at least one other column
ABCDCD

Or

C. : All bytes that can be interleaved with all other columns
ABCD

My current implementation in Spark just does A which is the sum of all column lengths, but we could do B and save some space at the cost of losing a bit of single column ordering. I don't think C actually makes a lot of sense unless we do some hind of bin hashing and actually generate bytes all of the same size.

In the our Spark implementation (see the follow up to this pr) we just set a length for the moment. Mostly to save us from unbounded strings increasing the size of our z-order algorithm and making sure the column always contributes the same number of bytes.

In the future I hope we can use statistics to do something smarter, or maybe not even use truncation and use some binning function or something like that based on the range of values in the column.

A. : All bytes that are available

i think this is what we should do (Maybe with some max length for safety) because

z-order on 1 column should be consistent with sort on that column

adding a low-cardinality, short column (like boolean, or tinyint, or varchar(1) if we had it), should not reduce z-order sorting strength; so one short columns should not stop us from ordering longer columns

findepi · 2022-02-08T21:13:33Z

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java

+  public static byte[] stringToOrderedBytes(String val, int length) {
+    ByteBuffer bytes = ByteBuffer.allocate(length);
+    if (val != null) {
+      int maxLength = Math.min(length, val.length());


... an example would be String "ą", single character, so length=1.
We would copy only the first byte of the two-byte sequence (0xC4 0x85) for the letter.

Core: Adds Utility Class for Implementing ZOrdering

ade1a1f

github-actions bot added build core labels Jan 24, 2022

Fix JavaDoc

781a121

RussellSpitzer mentioned this pull request Jan 26, 2022

Spark: Spark3 ZOrder Rewrite Strategy #3983

Merged

emkornfield reviewed Jan 27, 2022

View reviewed changes

emkornfield approved these changes Jan 27, 2022

View reviewed changes

rdblue reviewed Jan 28, 2022

View reviewed changes

build.gradle Outdated Show resolved Hide resolved

rdblue reviewed Jan 28, 2022

View reviewed changes

emkornfield reviewed Jan 28, 2022

View reviewed changes

RussellSpitzer added 4 commits January 31, 2022 12:31

Switch Implementations to work on Primitives instead of ByteArrays

fda817c

Clean up RandomStringUtilUsage

c6954e6

Fix JavaDoc

3f6fc92

Add Functions for Smaller Types

5ccf8f4

nastra reviewed Feb 3, 2022

View reviewed changes

rdblue reviewed Feb 6, 2022

View reviewed changes

rdblue reviewed Feb 7, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 7, 2022

View reviewed changes

core/src/test/java/org/apache/iceberg/util/TestZOrderByteUtil.java

}

@Test

public void testIntOrdering() {

Copy link

Contributor

rdblue Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good tests.

rdblue reviewed Feb 7, 2022

View reviewed changes

rdblue requested changes Feb 7, 2022

View reviewed changes

findepi reviewed Feb 7, 2022

View reviewed changes

Updates for reviewer comments

30c4633

github-actions bot added the API label Feb 7, 2022

Specify Output Size

10e561c

findepi reviewed Feb 8, 2022

View reviewed changes

Fix Encoding

74d20a4

Also a patch for the test interleave method length calculation

rdblue reviewed Feb 8, 2022

View reviewed changes

Methods return ByteBuffers, Strings are efit into our buffer using Ch…

848de3b

…arsetEncoder.encode

findepi reviewed Feb 8, 2022

View reviewed changes

RussellSpitzer added 2 commits February 8, 2022 21:49

Remove unused string length

83586f1

Update docs

57e1462

RussellSpitzer closed this Apr 22, 2022

Core: Adds Utility Class for Implementing ZOrdering #3966

Core: Adds Utility Class for Implementing ZOrdering #3966

Conversation

RussellSpitzer commented Jan 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

RussellSpitzer Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

findepi Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer commented Feb 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer Feb 8, 2022 • edited Loading

RussellSpitzer commented Jan 24, 2022 •

edited

Loading

rdblue Feb 7, 2022 •

edited

Loading

emkornfield Feb 7, 2022 •

edited

Loading

emkornfield Feb 7, 2022 •

edited

Loading

RussellSpitzer Feb 7, 2022 •

edited

Loading

rdblue Feb 7, 2022 •

edited

Loading

findepi Feb 7, 2022 •

edited

Loading

rdblue Feb 8, 2022 •

edited

Loading

RussellSpitzer Feb 8, 2022 •

edited

Loading