-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fixed(L) type to variant spec #481
Conversation
cc @emkornfield @gene-db and @RussellSpitzer |
VariantEncoding.md
Outdated
@@ -399,6 +399,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a | |||
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | | |||
| TimestampNTZ | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian | | |||
| UUID | uuid | `24` | UUID | 16-byte big-endian | | |||
| Fixed(L) | Byte array of length L | `25` | FIXED_LEN_BYTE_ARRAY[L] | 4 byte little-endian size L, followed by length-L big-endian bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using big-endian
bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the size? Shouldn't that be in the type description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big-endian bytes: this is to keep in sync with the others like UUID which is a fixed(16). And it makes sense to write a bytes in big endian since the engine can write the bytes in the buffer in order, not requiring buffering the whole string.
The required size: I initially avoided adding the fixed(L) type because I believed we couldn't support fixed(L) if we try to include L in the type description, as there wouldn't be enough bits available to represent the length, given that only 5 bits are allocated for the type.
The way here to add the fixed(L) type is to add the length in the value field - we are duplicating the length for each value but I don't see other ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I'm not sure endianness makes sense for fixed(L), endianess only applies to multi-bytes structures? Fixed(L) each bytes is independent.
-
I think the current proposal is reasonable and matches how things like decimal with arbitrary precisions are encoded. It is also consistent with string representation, if we are worried about overhead of 4 bytes then we could use a variable width encoding schema (or have two types Short-fixed(L) with 1 byte and fixed(L) with 4 buytes. Unfortunately, IIUC we can't have a 'short-fixed L' like we have for string because I think we are already use the entire number range there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: Size - makes sense, i forgot we are storing type information for every row anyway so it probably makes more sense to store it here in the value section of the Variant.
I agree with @emkornfield that we probably don't need to specify endianess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Make sense that we don't need endianness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield number range - you mean the type range? We can have from 0 - 31 with a few left. I think it makes sense to to add Short-fixed(L). Let me know if you agree.
we can't have a 'short-fixed L' like we have for string because I think we are already use the entire number range there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield number range - you mean the type range? We can have from 0 - 31 with a few left. I think it makes sense to to add Short-fixed(L). Let me know if you agree.
we can't have a 'short-fixed L' like we have for string because I think we are already use the entire number range there.
I was talking about the range for "basic type", I'm OK adding it as a separate "primitive type" though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I see. Yeah. We have used all of them for "basic type". We can't add like short-string.
Let me make the change and share with you.
bf150b8
to
ccdd571
Compare
ccdd571
to
a5d8502
Compare
@aihuaxu I am not sure why we need this type in the Variant binary encoding. Doesn't this just duplicate the |
Yeah. You are right. The storage is the same as the binary since we can't have the length in the type description. Really no need to add that. We can just use binary to store fixed(L). cc @RussellSpitzer and @emkornfield |
So I think the difference here is semantics. I don't have a strong opinion one way or another, but both Parquet natively and iceberg distinguish between bytes and Fixed(L). One could argue thisis purely for optimization purposes, and when we are paying the cost anyway of storing individual lengths per field they are equivelant. Ultimately, the place where this would make a difference is shredding where Fixed(L) can be mapped to FLBA which would save some amount of storage. I don't feel too strongly one way or another on adding the type. |
I think this is one of those cases where an engine shredding a variable length binary could decide whether the shredded type can become a fixed length when shredding. So it probably doesn't matter if untyped matter has multiple representations So I think i'm more of a +0 now on the idea since I think a shredder could do the optimization even if it doesn't know the actual type isn't fixed. |
For variant, I agree that there isn't much value here. Every variant can contain a different fixed length so there isn't a significant difference from binary. I'd avoid adding this and increasing the size of the spec and the number of types that must be supported. |
Yes, I agree. There isn't much value in the semi-structured Variant world, where there is no "global schema" of Variant values. I'd prefer to keep the spec more simple, and just use the Variant |
If we want to allow this type of shredding lets make sure to update the spec. CC @rdblue |
For now, I'm closing the request based on the discussion above. |
Rationale for this change
What changes are included in this PR?
fixed(L) type is not added for variant.
Update the spec to add the presentation for fixed(L): the type is fixed and the value includes the length and byte array.
Do these changes have PoC implementations?
Closes #480