-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Adds Utility Class for Implementing ZOrdering #3966
Conversation
* To fix this, flip the sign bit so that all negatives are ordered before positives. This essentially | ||
* shifts the 0 value so that we don't break our ordering when we cross the new 0 value. | ||
*/ | ||
public static byte[] orderIntLikeBytes(byte[] intBytes, int size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth documenting assumed endianness and contract for nulls (always first in the sort order?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably mark that I assume Java byte representations, good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed all the methods to use java primitives so we should be ok here now, since the endianness is now Java's representation.
if (floatBytes == null) { | ||
return new byte[size]; | ||
} | ||
if ((floatBytes[0] & (1 << 7)) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it is worth handling, but can NaN's on the JVM have different representations and do you want to ensure they are sorted together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe documenting the assumption that NaN values are normalized to Double.NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can doc that, I think hopefully it will not come up very often. In the spec we already have a discussion on -/+ Nan #2891 I'm ok with different weird nans end up in random places at least for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This seems like a good first approach. Have you benchmarked the overhead of it? There are definitely more sophisticated approaches for the interleave but I think they get a little complicated when you start assuming different width bytes fields for interleaving.
Assert.assertTrue(String.format( | ||
"Ordering of ints should match ordering of bytes, %s ~ %s -> %s != %s ~ %s -> %s ", | ||
aFloat, bFloat, floatCompare, Arrays.toString(aBytes), Arrays.toString(bBytes), byteCompare), | ||
(floatCompare ^ byteCompare) >= 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This way of asserting the same comparison is a little bit opaque to me. Comparing equality on Integer.signum would be clearer at least for me.
* Which means floats can be treated as sign magnitude integers which can then be converted into lexicographically | ||
* comparable bytes | ||
*/ | ||
public static byte[] orderFloatLikeBytes(byte[] floatBytes, int size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the purpose of using byte[]
so that we don't have a version for float and a version for double?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use byte in all the functions since I'm assuming in some engines we can get byte representations without a conversion (like in spark). The "size" is so that I don't have a float and double version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure that assumption holds because converting floats and doubles is probably much faster if we use full values rather than individual bytes. The HBase OrderedBytes implementation for double is this:
long asLong = Double.doubleToLongBits(doubleValue);
asLong ^= ((asLong >> (Long.SIZE - 1)) | Long.MIN_VALUE);
That avoids looping over the bytes and avoids an if statement in a tight loop by producing the correct value to XOR with to optionally the non-sign flip bits or not. That is probably much faster than fetching and flipping the bits of each byte individually, if we are starting off with a float or double.
Also, doubleToLongBits
will normalize NaN values for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just use the Hbase code here, the Bytes assumption is more a part of my UDF implementation on the Spark side. Otherwise I have to generate a udf of specific types based on the schema of column's being chosen which ... I think I may be able to do. I'll check it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or Do something inside catalyst ... which also may be a better option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think I can still do this with duds for the prototype, Will code all this up Monday
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that Catalyst should do okay with this because it uses Unsafe to pull values out of memory. I believe that happens by just treating an offset like its already a long or double, so it isn't doing a byte-wise operation. And for testing that also works better with GenericInternalRow, which actually stores the JVM types.
|
||
// Find next column with a byte we can use | ||
do { | ||
if (++sourceColumn == columnsBinary.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's usually a bad idea to use the output value of a ++ or -- operator. It's just hard to understand what's going on and I think prone to errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and added more clear comments about what we are doing in this algorithm
* @param columnsBinary an array of byte arrays, none of which are empty | ||
* @return their bits interleaved | ||
*/ | ||
public static byte[] interleaveBits(byte[][] columnsBinary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mention it at first because it is an optimization but an alternative to this is a builder pattern that has specific methods for int/long/float/double/byte[] and a predertmined number of dimensions and a bit-width for each dimension. then the addFloat/addInt methods could be called in the correct sequence. I don't know if the JVM is smart enough to compile the ByteBuffer.of(...).put pattern to simple casts, but with the builder methods these could be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a good thing for a future improvement. If you see the Spark request has some benchmarks and even with wrapping all these functions in UDFS and applying them to rows that way, it's still only about 2~3 x slower than sort with as many expressions. So I think the perf is probably ok to start with.
|
||
/** | ||
* Interleave bits using a naive loop. | ||
* @param columnsBinary an array of byte arrays, none of which are empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is the none of which are empty
commit still valid given that there are tests for empty arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix up this comment, it's pretty unclear at the moment
short bShort = (short) (random.nextInt() % (Short.MAX_VALUE + 1)); | ||
int longCompare = Integer.signum(Long.compare(aShort, bShort)); | ||
byte[] aBytes = ZOrderByteUtils.longToOrderBytes(aShort); | ||
byte[] bBytes = ZOrderByteUtils.longToOrderBytes(bShort); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those 2 should both be shortToOrderBytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
long bByte = (byte) (random.nextInt() % (Byte.MAX_VALUE + 1)); | ||
int longCompare = Integer.signum(Long.compare(aByte, bByte)); | ||
byte[] aBytes = ZOrderByteUtils.longToOrderBytes(aByte); | ||
byte[] bBytes = ZOrderByteUtils.longToOrderBytes(bByte); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those 2 should be tinyintToOrderedBytes
* algorithms are identically wrong or are both identically correct. | ||
*/ | ||
@Test | ||
public void testInterleaveRandomExamples() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
takes 17-18 seconds to run on my local machine, just wondering whether we want to have such long running unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably @Ignore
this test, but it's good to see it working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can cut down the number of example :) I just really wanted to have some kind of fuzzing since I know we are going to want to optimize the interleave example with fancier byte tricks in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the number of trials run so now this takes a few hundred ms. The longest test in the suite now is the String compare test which is around 1 second on my 2019 MBP
* shifts the 0 value so that we don't break our ordering when we cross the new 0 value. | ||
*/ | ||
public static byte[] intToOrderedBytes(int val) { | ||
ByteBuffer bytes = ByteBuffer.allocate(Integer.BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this if you want to get this through quickly, but in the long term I think we should avoid the allocation cost. What I'd recommend since this is a util method is to add a reuse
argument:
// this should probably live in the ByteBuffers util class
public static ByteBuffer reuseOrAllocate(ByteBuffer reuse, int length) {
if (reuse.hasArray() && reuse.arrayOffset() == 0 && reuse.capacity() == length) {
reuse.position(0);
reuse.limit(length);
return reuse;
} else {
return ByteBuffer.allocate(length);
}
}
public static ByteBuffer intToOrderedBytes(int val, ByteBuffer reused) {
ByteBuffer bytes = reuseOrAllocate(reused, 4);
bytes.putInt(val ^ 0x80000000);
return bytes;
}
Then the caller can always call ByteBuffer.array()
to get the underlying bytes, but it can keep reusing the space:
v1Bytes = intToOrderedBytes(v1, v1Bytes);
v2Bytes = intToOrderedBytes(v2, v2Bytes);
zorderBytes = zorder(v1Bytes, v2Bytes, zorderBytes);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is worthwhile to get in now, it isn't a huge change and should let us think a bit more about the Spark implementation's memory efficiency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I wonder if we should have reuseOrAllocate or just reuse and throw an error if it cannot. Basically give two versions of the orderedByteFunctions. One which attempts to reuse and throws an error if the Buffer cannot be reused and one that allocates a new buffer if no buffer is passed. That way the user knows their buffer is being re-used and the behavior isn't changing at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented this but didn't allow for the "orAllocate" option. Caller needs to create the new allocation if they want to use it.
*/ | ||
public static byte[] stringToOrderedBytes(String val, int length) { | ||
ByteBuffer bytes = ByteBuffer.allocate(length); | ||
if (val != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you update this to accept a reused buffer, then we need to remember to zero out the bytes.
while (interleaveByte < interleavedSize) { | ||
// Take what we have, Get the source Bit of the source Byte, move it to the interleaveBit position | ||
interleavedBytes[interleaveByte] = | ||
(byte) (interleavedBytes[interleaveByte] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use |=
instead of adding this every time?
// Take what we have, Get the source Bit of the source Byte, move it to the interleaveBit position | ||
interleavedBytes[interleaveByte] = | ||
(byte) (interleavedBytes[interleaveByte] | | ||
(columnsBinary[sourceColumn][sourceByte] & 1 << sourceBit) >> sourceBit << interleaveBit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this use >>> sourceBit
so that the sign bit isn't propagated? or are we guaranteed that won't happen because this gets converted into an int so it is too sparse to worry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually haven't thought this through, This is one of the reasons I added the big fuzzing test of the interleave code since I have a hard time thinking through byte path in my head. I can switch to >>> though since I think that is correct
++sourceColumn; | ||
if (sourceColumn == columnsBinary.length) { | ||
sourceColumn = 0; | ||
if (--sourceBit == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's much easier to read and reason about algorithms like this if you don't use the return value of ++ or -- operators.
sourceBit = 7; | ||
} | ||
} | ||
} while (columnsBinary[sourceColumn].length <= sourceByte && interleaveByte < interleavedSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have interleaveBytes < interleavedSize
here? All that's going to happen is the first time through the loop, sourceColumn
will be checked. But after that, if the interleave is finished then this will exit and the outer loop will exit. Then sourceColumn
isn't used because the bytes are returned. So I think it would be equivalent to do this check above the do/while:
if (interleaveBit == -1) {
...
}
if (interleaveByte < interleavedSize) {
break;
}
do { ... } while (columnsBinary[sourceColumn].length <= sourceByte)
You could also use an if block to avoid the break
if that's your thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not a big break person, but I think it complicated the code a bit since I wrote without them. Let me see what I can do to clean this up a bit
core/src/main/java/org/apache/iceberg/util/ZOrderByteUtils.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void testIntOrdering() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good tests.
} | ||
|
||
@Test | ||
public void testInterleaveMixedBits() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is the correct behavior in all cases.
I think that this is the correct behavior for strings because the length of the string doesn't matter. a
is always less than bbb
. However, this is not the right behavior for numbers. For example:
short v1 = 0xFFFF
int v2 = 0x0000FFFF
byte[] result = ZOrderByteUtils.interleaveBits(new byte[] { shortToOrderedBytes(v1), intToOrderedBytes(v2) }
Result is going to be: 0b1010101010101010101010101010101011111111
. All values of v1 interleave with just the upper bits of v2. Is that the intended behavior?
This may be handy when trying to handle magnitude problems, assuming that the magnitude is always reflected in the type. But you could easily have a case where v1 and v2 are both in the range of 0x0000
to 0xFFFF
and then this behavior would not actually produce a useful zorder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result is going to be: 0b1010101010101010101010101010101011111111. All values of v1 interleave with just the upper bits of v2. Is that the intended behavior?
nit: I think the result fails to account for *ToBytes
calls which flip the first leading bits. Which should make the byte ordering similar to strings IIUC
This may be handy when trying to handle magnitude problems, assuming that the magnitude is always reflected in the type. But you could easily have a case where v1 and v2 are both in the range of 0x0000 to 0xFFFF and then this behavior would not actually produce a useful z-order.
Not sure if the nit above addresses your concerns and what exactly you mean by not a useful z-order, could you expand on them? My understanding is that as long as the input bytes that get interleaved have an equivalent lexicographic order to the original inputs it is a z-order and should maintain the clustering properties z-orders provide, if you then compare the output bytes lexicographically. The only other principled option I could see here would be to zero pad all numeric values to the same number of bytes, but without thinking too deeply about it I think this just takes extra space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(strings are always expected to be padded to a fixed length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this some more, I think the padding (casting to highest bit-width) is the right thing to do (below are tables showing 1 x 2 bit examples (I think what @rdblue meant by useful vs not-useful can be seen by following). The top and outer left most columns represent the input numbers
Existing (this is essentially normal sorting by each element):
bits: l0, t1, t0 | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
With padding:
bits: l1, t1, l0, t0 | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 0 | 1 | 4 | 5 |
1 (padded as 0b01 ) |
2 | 3 | 6 | 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been talking about this a bit, but I think the only way to really solve this is to know the range of all column values before getting their binary values and shifting then.
I think at this point in the algorithm (during interleaving) we don't have enough information to know whether we can bit shift a value into higher order bits but I do agree eventually we should have a representation that packs the information into the highest bits possible. Ideally I think this is a responsibility of the "toOrderedBinary" which I would like to have a "min/max" parameters in the future so we can shift and tighten the domain if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this point in the algorithm (during interleaving) we don't have enough information to know whether we can bit shift a value into higher order bits but I do agree eventually we should have a representation that packs the information into the highest bits possible. Ideally I think this is a responsibility of the "toOrderedBinary" which I would like to have a "min/max" parameters in the future so we can shift and tighten the domain if possible.
I agree. I think this might imply that we always want an equal number of bits on each input to interleave (need to think about this some more)? I guess falling back to traditional sorting isn't a bad fallback but we are wasting compute to make that happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the result fails to account for *ToBytes calls which flip the first leading bits. Which should make the byte ordering similar to strings IIUC
Yes, I didn't do the sign bit flip, but the result is pretty much the same. All of the relevant bits of the short are interleaved with the irrelevant bits of the int and you effectively get an expensive way to sort by the short and then by the int.
I think it makes the most sense to cast everything to the widest type. I think that is the behavior most people would expect: if you zorder a short with an int, it happens by magnitude not bit position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the general feeling that different columns sharing the same magnitude values but different types is more of a special case but I don't have any empirical evidence for that. I'm fine with doing the upcasting but maybe it makes more sense to add it in as part of a more complicated builder as suggested by @emkornfield.
Like perhaps we have a class that takes all the column types (and in the future stats?) and it chooses the correct ordered byte representations and returns a ZOrder method for those specific input parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another simple option that I think solves an additional challenge...
I'd propose that we cast everything to float so we don't have to worry about choosing a type and converting everything to it. That also gives us a more compact representation that is better for packing multiple columns. That's because floating point values are represented as scientific notation with base-2. The exponent encodes how many positions the fraction was shifted so that its first digit is a 1. Alternatively, you can think of it as encoding how many bits at the start of the number are 0, then storing a normalized fraction.
That effectively compresses the most significant parts of a number into the highest bits. For float32, there's a sign bit, 8 bits of exponent, and 23 bits of fraction (mantissa).
Converting all the values to floating point would solve the immediate problem because all numbers would use the same representation, giving what I think is the expected behavior (same result regardless of type). But it also means that we can fit more columns in 32 bits or 64 bits, which are going to be much faster for comparison once we pass them off to Spark because we could use native ints or longs rather than byte arrays.
For example, if we took 4 integers and interleaved them, it would take 128 bits. But if we convert those to float first, then the most significant parts can fit in half that: 16 bits per value can carry the sign bit, the full 8-bit exponent, and 7 bits of the fraction. That's about 128 partitions of the number space per exponent value.
If you don't want to lose so much of the value, then fewer columns do better. And we can always fall back to producing the full byte array if we want.
What do you guys think?
test[1] = new byte[]{OOOOOOOI, OOOOOOOO, IIIIIIII}; | ||
test[2] = new byte[]{OOOOOOOI}; | ||
test[3] = new byte[]{OOOOOOOI}; | ||
byte[] expected = new byte[]{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could argue that the last byte of test[0]
should be ignored. Would it be better to do that so that you can zorder strings and know that you'll get just the prefix? Or do you think that this should be configured with a byte range in the strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My base implementation of this for spark just uses a fixed byte size for strings since each column has to contribute the same number of bytes every time. I put this responsibility in the toOrderedBytes function and not in the interleave itself. We need all entries from the same column to be the same length for the following situation
Imagine columns A,B, and C where A is between 0 and 2 bits, but B and C are always 2 bits. You could get ZValues like
(AA, BB, CC) - >ABCABC
(A_, BB, CC) - >ABCBC
(__, BB, CC) -> BCBC
The sorting is now dependent on the length of the column value and not the value itself. Shorter A's being clustered differently than longer A's.
As for in the trailing situation, I think it's worthwhile to keep those bits as well since we can just use them for hierarchal sort within clusters. You can imagine the interleave bits creating multidimensional clustered groups and the final bits producing a hierarchal sort within those groups based on the unshared portion of the longest element.
I do think in the future we may want to save space here and then we could trim the max length of a contributed byte array to the length of the second longest contributor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing you could do since it doesn't matter is just break out of the interleave loop if moving to the next column doesn't change the column index. It would be nice to not allocate the space as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a good way to do this may be to change the api slightly to
public static byte[] interleaveBits(byte[][] columnsBinary) {
return interleaveBits(columnsBinary,
Arrays.stream(columnsBinary).mapToInt(column -> column.length).max().getAsInt());
}
/**
* Interleave bits using a naive loop. Variable length inputs are allowed but to get a consistent ordering it is
* required that every column contribute the same number of bytes in each invocation. Bits are interleaved from all
* columns that have a bit available at that position. Once a Column has no more bits to produce it is skipped in the
* interleaving.
* @param columnsBinary an array of ordered byte representations of the columns being ZOrdered
* @param interleavedSize the number of bytes to use in the output
* @return the columnbytes interleaved
*/
public static byte[] interleaveBits(byte[][] columnsBinary, int interleavedSize) {
So that instead of calculating the output-size, the caller can just pass in the number of bytes it wants out of the algorithm. This handles bailing out earlier as well as not allocating as much. I can leave the original signature for the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to asking for a specific interleaved size. You could also have versions that produce int or long if you only want 4 or 8 bytes. That would work really well with Spark because you wouldn't have to store this in the variable-length part of UnsafeRow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see less array allocation since that's needlessly expensive. I also think there's an open question about how the interleave should behave. The interleave looks correct to me and it may just come down to how we use it in practice.
ByteBuffer bytes = ByteBuffer.allocate(length); | ||
if (val != null) { | ||
int maxLength = Math.min(length, val.length()); | ||
bytes.put(val.getBytes(), 0, maxLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encoding must be fixed, and it must be UTF-8, if we want the ordering of the produced bytes be coherent with value ordering.
Also byte order will match string/varchar order only if bytes are treated as unsigned.
I guess he unsignedness applies to other types, so maybe it's documented already, i didn't notice.
public static byte[] stringToOrderedBytes(String val, int length) { | ||
ByteBuffer bytes = ByteBuffer.allocate(length); | ||
if (val != null) { | ||
int maxLength = Math.min(length, val.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val.length()
is not "encoding-aware". Maybe add a comment that it doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, don't add a comment. We must take byte-length into account, otherwise we would be truncating short non-all-ASCII string values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... an example would be String "ą", single character, so length=1.
We would copy only the first byte of the two-byte sequence (0xC4 0x85) for the letter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH sorry I thought you were talking about something else, yeah let me switch this to the byte length of the encoded string
Thanks everyone for taking a look, I tried to take into account the major points and I know we have lots of room for research and improvement in the future. Please let me know if you have any more feedback that we should address for the initial version of the feature |
@@ -108,54 +108,70 @@ private ZOrderByteUtils() { | |||
* Strings are lexicographically sortable BUT if different byte array lengths will | |||
* ruin the Z-Ordering. (ZOrder requires that a given column contribute the same number of bytes every time). | |||
* This implementation just uses a set size to for all output byte representations. Truncating longer strings | |||
* and right padding 0 for shorter strings. | |||
* and right padding 0 for shorter strings. Requires UTF8 (or ASCII) encoding for ordering guarantees to hold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't add a comment. Just replace String.getBytes()
with String.getBytes(java.nio.charset.StandardCharsets#UTF_8)
.
Of course, i don't know the rules of the project, but otherwise i don't see why we would want to rely on JVM's default encoding. This is z-order, not sorting, but z-order should be consistent with sorting, and varchar value ordering is codepoint-based (in the absence of collation which doesn't exist in Iceberg world). The UTF-8 encoding produces bytes that, if treated as unsigned, lead to the same order as codepoint-based lexicographical ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We should ensure that we use UTF-8 to convert to bytes.
Also a patch for the test interleave method length calculation
public static byte[] tinyintToOrderedBytes(byte val, ByteBuffer reuse) { | ||
ByteBuffer bytes = ByteBuffers.reuse(reuse, Byte.BYTES); | ||
bytes.put((byte) (val ^ (0x80))); | ||
return bytes.array(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return the ByteBuffer instead since we're passing in a ByteBuffer?
if (val != null) { | ||
int maxLength = Math.min(length, val.length()); | ||
// We may truncate mid-character | ||
bytes.put(val.getBytes(StandardCharsets.UTF_8), 0, maxLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @findepi has a point about length. In addition, this call to getBytes
is going to allocate a byte array. It doesn't help to reuse a buffer if we are going to do an allocation anyway.
One thing you can do is to use CharsetEncoder.encode(CharBuffer, ByteBuffer, boolean)
. That way, you'd get a CharBuffer
wrapping the original string without reallocating and encode into the buffer that's passed in. You'd also get information about the result, but you probably don't really care what happens. Underflow means the string is short and overflow means that you ran out of space. Neither one really affects zorder much.
*/ | ||
static byte[] interleaveBits(byte[][] columnsBinary) { | ||
return interleaveBits(columnsBinary, | ||
Arrays.stream(columnsBinary).mapToInt(column -> column.length).sum()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only allow this when all the columns have the same length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so? Shouldn't we be allowing columns to contribute different amounts of bytes?
This is one of the difficulties for working with strings and other types since not only are we going to need more bytes for strings, but their most significant bytes may be rather late in the string.
For example
"www.iceberg.org/index"
"www.iceberg.org/spark"
Only have a differing byte at position 17.
This is one of my reluctances to require similar contributions of byte sizes until we have a better "toOrderedBytes" function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only allow this when all the columns have the same length?
z_order(an_integer, a_bigint, 50_bytes_of_string)
-- each column has different byte length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi, while those do have different lengths, is it valuable to interleave them directly? We had a discussion above about interleaving short and int. I think that the behavior most people would expect is to interleave by numeric magnitude, not the actual bytes. Basically, if two columns have the same numeric distribution, they should interleave well regardless of the type widths.
If you agree there, then what you'd do is cast the integer in your example to a bigint. Then you have 2 8-byte buffers and 1 50-byte buffer. The remainder of the 50 bytes after the first 8 doesn't matter.
There are cases where you may want to use different lengths, but to me you should be calling the version with a specified length for those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast the integer in your example to a bigint
that works, for numbers at leasts
The remainder of the 50 bytes after the first 8 doesn't matter.
It does:
z_order(a_short, z_order_interleaving)
-- it's unlikely that 2 first bytes of a string suffice for anything useful (ht
for urls)
unless it's covered somewhere else:
if you order by z_order_interleaving, c1, c2, c3
, then it z_order_interleaving
can be shorter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi, I see your point. If you still want the string ordering then you have to preserve the string. You're right about that. I was focusing on interleaving the ordering across the columns, but it's of course necessary to still sort by the remaining bytes in the longest column!
…arsetEncoder.encode
* Some implementation is taken from | ||
* https://github.com/apache/hbase/blob/master/hbase-common/src/main/java/org/apache/hadoop/hbase/util/OrderedBytes.java | ||
*/ | ||
public class ZOrderByteUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be documented somewhere the produced byte[] should be compared lexicographically, with bytes as unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I will add this
* This implementation just uses a set size to for all output byte representations. Truncating longer strings | ||
* and right padding 0 for shorter strings. | ||
*/ | ||
public static byte[] stringToOrderedBytes(String val, int length, ByteBuffer reuse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the length
going to be calculated in practice?
Is it going to be a fixed prefix, like 128 or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debate we were having before was wether to limit it to the max common length of columns, or whether to let it go beyond that.
Like with (A, B, CC, DDD)
Do you return
A. : All bytes that are available
ABCDCDD
B. : All bytes that can be interleaved with at least one other column
ABCDCD
Or
C. : All bytes that can be interleaved with all other columns
ABCD
My current implementation in Spark just does A which is the sum of all column lengths, but we could do B and save some space at the cost of losing a bit of single column ordering. I don't think C actually makes a lot of sense unless we do some hind of bin hashing and actually generate bytes all of the same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the our Spark implementation (see the follow up to this pr) we just set a length for the moment. Mostly to save us from unbounded strings increasing the size of our z-order algorithm and making sure the column always contributes the same number of bytes.
In the future I hope we can use statistics to do something smarter, or maybe not even use truncation and use some binning function or something like that based on the range of values in the column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A. : All bytes that are available
i think this is what we should do (Maybe with some max length for safety) because
- z-order on 1 column should be consistent with sort on that column
- adding a low-cardinality, short column (like boolean, or tinyint, or varchar(1) if we had it), should not reduce z-order sorting strength; so one short columns should not stop us from ordering longer columns
public static byte[] stringToOrderedBytes(String val, int length) { | ||
ByteBuffer bytes = ByteBuffer.allocate(length); | ||
if (val != null) { | ||
int maxLength = Math.min(length, val.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... an example would be String "ą", single character, so length=1.
We would copy only the first byte of the two-byte sequence (0xC4 0x85) for the letter.
First Chunk of Code For ZOrdering.
Common functions and code that can be shared amongst all execution engines. Transforms of Iceberg types into byte representations as well as interleaving code.
Closes #3960