Add fixed(L) type to variant spec #481

aihuaxu · 2025-01-16T02:03:39Z

Rationale for this change

What changes are included in this PR?

fixed(L) type is not added for variant.
Update the spec to add the presentation for fixed(L): the type is fixed and the value includes the length and byte array.

Do these changes have PoC implementations?

Closes #480

aihuaxu · 2025-01-16T02:06:37Z

cc @emkornfield @gene-db and @RussellSpitzer

wgtmac · 2025-01-16T15:18:14Z

VariantEncoding.md

@@ -399,6 +399,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a
 | Timestamp            | timestamp with time zone   | `22`    | TIMESTAMP(isAdjustedToUTC=true, NANOS)       | 8-byte little-endian                                                                        |
 | TimestampNTZ         | timestamp without time zone | `23`    | TIMESTAMP(isAdjustedToUTC=false, NANOS)      | 8-byte little-endian                                                                        |
 | UUID                 | uuid                        | `24`    | UUID                         | 16-byte big-endian                                                                         |
+| Fixed(L)             | Byte array of length L | `25`    | FIXED_LEN_BYTE_ARRAY[L]     |  4 byte little-endian size L, followed by length-L big-endian bytes    |


Why using big-endian bytes?

Why do we need the size? Shouldn't that be in the type description?

big-endian bytes: this is to keep in sync with the others like UUID which is a fixed(16). And it makes sense to write a bytes in big endian since the engine can write the bytes in the buffer in order, not requiring buffering the whole string.

The required size: I initially avoided adding the fixed(L) type because I believed we couldn't support fixed(L) if we try to include L in the type description, as there wouldn't be enough bits available to represent the length, given that only 5 bits are allocated for the type.

The way here to add the fixed(L) type is to add the length in the value field - we are duplicating the length for each value but I don't see other ways.

I'm not sure endianness makes sense for fixed(L), endianess only applies to multi-bytes structures? Fixed(L) each bytes is independent.

I think the current proposal is reasonable and matches how things like decimal with arbitrary precisions are encoded. It is also consistent with string representation, if we are worried about overhead of 4 bytes then we could use a variable width encoding schema (or have two types Short-fixed(L) with 1 byte and fixed(L) with 4 buytes. Unfortunately, IIUC we can't have a 'short-fixed L' like we have for string because I think we are already use the entire number range there.

Re: Size - makes sense, i forgot we are storing type information for every row anyway so it probably makes more sense to store it here in the value section of the Variant.

I agree with @emkornfield that we probably don't need to specify endianess.

Yeah. Make sense that we don't need endianness.

@emkornfield number range - you mean the type range? We can have from 0 - 31 with a few left. I think it makes sense to to add Short-fixed(L). Let me know if you agree.

we can't have a 'short-fixed L' like we have for string because I think we are already use the entire number range there.

@emkornfield number range - you mean the type range? We can have from 0 - 31 with a few left. I think it makes sense to to add Short-fixed(L). Let me know if you agree.

we can't have a 'short-fixed L' like we have for string because I think we are already use the entire number range there.

I was talking about the range for "basic type", I'm OK adding it as a separate "primitive type" though.

Oh. I see. Yeah. We have used all of them for "basic type". We can't add like short-string.

Let me make the change and share with you.

gene-db · 2025-01-17T19:30:30Z

@aihuaxu I am not sure why we need this type in the Variant binary encoding. Doesn't this just duplicate the binary type? We don't need two different ways to store binary data.

aihuaxu · 2025-01-18T03:42:42Z

@aihuaxu I am not sure why we need this type in the Variant binary encoding. Doesn't this just duplicate the binary type? We don't need two different ways to store binary data.

Yeah. You are right. The storage is the same as the binary since we can't have the length in the type description. Really no need to add that. We can just use binary to store fixed(L). cc @RussellSpitzer and @emkornfield

emkornfield · 2025-01-18T15:05:12Z

@aihuaxu I am not sure why we need this type in the Variant binary encoding. Doesn't this just duplicate the binary type? We don't need two different ways to store binary data.

Yeah. You are right. The storage is the same as the binary since we can't have the length in the type description. Really no need to add that. We can just use binary to store fixed(L). cc @RussellSpitzer and @emkornfield

So I think the difference here is semantics. I don't have a strong opinion one way or another, but both Parquet natively and iceberg distinguish between bytes and Fixed(L). One could argue thisis purely for optimization purposes, and when we are paying the cost anyway of storing individual lengths per field they are equivelant. Ultimately, the place where this would make a difference is shredding where Fixed(L) can be mapped to FLBA which would save some amount of storage.

I don't feel too strongly one way or another on adding the type.

RussellSpitzer · 2025-01-21T17:47:20Z

@aihuaxu I am not sure why we need this type in the Variant binary encoding. Doesn't this just duplicate the binary type? We don't need two different ways to store binary data.

Yeah. You are right. The storage is the same as the binary since we can't have the length in the type description. Really no need to add that. We can just use binary to store fixed(L). cc @RussellSpitzer and @emkornfield

So I think the difference here is semantics. I don't have a strong opinion one way or another, but both Parquet natively and iceberg distinguish between bytes and Fixed(L). One could argue thisis purely for optimization purposes, and when we are paying the cost anyway of storing individual lengths per field they are equivelant. Ultimately, the place where this would make a difference is shredding where Fixed(L) can be mapped to FLBA which would save some amount of storage.

I don't feel too strongly one way or another on adding the type.

I think this is one of those cases where an engine shredding a variable length binary could decide whether the shredded type can become a fixed length when shredding. So it probably doesn't matter if untyped matter has multiple representations

So I think i'm more of a +0 now on the idea since I think a shredder could do the optimization even if it doesn't know the actual type isn't fixed.

rdblue · 2025-01-21T21:16:22Z

For variant, I agree that there isn't much value here. Every variant can contain a different fixed length so there isn't a significant difference from binary. I'd avoid adding this and increasing the size of the spec and the number of types that must be supported.

gene-db · 2025-01-21T21:55:37Z

For variant, I agree that there isn't much value here. Every variant can contain a different fixed length so there isn't a significant difference from binary. I'd avoid adding this and increasing the size of the spec and the number of types that must be supported.

Yes, I agree. There isn't much value in the semi-structured Variant world, where there is no "global schema" of Variant values. I'd prefer to keep the spec more simple, and just use the Variant binary type for bytes.

emkornfield · 2025-01-22T17:20:35Z

I think this is one of those cases where an engine shredding a variable length binary could decide whether the shredded type can become a fixed length when shredding. So it probably doesn't matter if untyped matter has multiple representations

If we want to allow this type of shredding lets make sure to update the spec. CC @rdblue

aihuaxu · 2025-01-24T00:15:19Z

For now, I'm closing the request based on the discussion above.

wgtmac reviewed Jan 16, 2025

View reviewed changes

aihuaxu force-pushed the add-fixed-type-variant branch from bf150b8 to ccdd571 Compare January 16, 2025 21:26

Add fixed(L) type to variant spec

a5d8502

aihuaxu force-pushed the add-fixed-type-variant branch from ccdd571 to a5d8502 Compare January 16, 2025 23:41

aihuaxu requested review from emkornfield, RussellSpitzer and wgtmac January 16, 2025 23:42

aihuaxu closed this Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fixed(L) type to variant spec #481

Add fixed(L) type to variant spec #481

aihuaxu commented Jan 16, 2025 •

edited

Loading

aihuaxu commented Jan 16, 2025

wgtmac Jan 16, 2025

RussellSpitzer Jan 16, 2025

aihuaxu Jan 16, 2025

emkornfield Jan 16, 2025

RussellSpitzer Jan 16, 2025

aihuaxu Jan 16, 2025

aihuaxu Jan 16, 2025

emkornfield Jan 16, 2025

aihuaxu Jan 16, 2025

gene-db commented Jan 17, 2025

aihuaxu commented Jan 18, 2025

emkornfield commented Jan 18, 2025

RussellSpitzer commented Jan 21, 2025 •

edited

Loading

rdblue commented Jan 21, 2025

gene-db commented Jan 21, 2025

emkornfield commented Jan 22, 2025

aihuaxu commented Jan 24, 2025

Add fixed(L) type to variant spec #481

Add fixed(L) type to variant spec #481

Conversation

aihuaxu commented Jan 16, 2025 • edited Loading

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

aihuaxu commented Jan 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gene-db commented Jan 17, 2025

aihuaxu commented Jan 18, 2025

emkornfield commented Jan 18, 2025

RussellSpitzer commented Jan 21, 2025 • edited Loading

rdblue commented Jan 21, 2025

gene-db commented Jan 21, 2025

emkornfield commented Jan 22, 2025

aihuaxu commented Jan 24, 2025

aihuaxu commented Jan 16, 2025 •

edited

Loading

RussellSpitzer commented Jan 21, 2025 •

edited

Loading