Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fixed(L) type to variant spec #481

Closed
wants to merge 1 commit into from

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Jan 16, 2025

Rationale for this change

What changes are included in this PR?

fixed(L) type is not added for variant.
Update the spec to add the presentation for fixed(L): the type is fixed and the value includes the length and byte array.

Do these changes have PoC implementations?

Closes #480

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Jan 16, 2025

cc @emkornfield @gene-db and @RussellSpitzer

@@ -399,6 +399,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian |
| UUID | uuid | `24` | UUID | 16-byte big-endian |
| Fixed(L) | Byte array of length L | `25` | FIXED_LEN_BYTE_ARRAY[L] | 4 byte little-endian size L, followed by length-L big-endian bytes |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using big-endian bytes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the size? Shouldn't that be in the type description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big-endian bytes: this is to keep in sync with the others like UUID which is a fixed(16). And it makes sense to write a bytes in big endian since the engine can write the bytes in the buffer in order, not requiring buffering the whole string.

The required size: I initially avoided adding the fixed(L) type because I believed we couldn't support fixed(L) if we try to include L in the type description, as there wouldn't be enough bits available to represent the length, given that only 5 bits are allocated for the type.

The way here to add the fixed(L) type is to add the length in the value field - we are duplicating the length for each value but I don't see other ways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'm not sure endianness makes sense for fixed(L), endianess only applies to multi-bytes structures? Fixed(L) each bytes is independent.

  • I think the current proposal is reasonable and matches how things like decimal with arbitrary precisions are encoded. It is also consistent with string representation, if we are worried about overhead of 4 bytes then we could use a variable width encoding schema (or have two types Short-fixed(L) with 1 byte and fixed(L) with 4 buytes. Unfortunately, IIUC we can't have a 'short-fixed L' like we have for string because I think we are already use the entire number range there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: Size - makes sense, i forgot we are storing type information for every row anyway so it probably makes more sense to store it here in the value section of the Variant.

I agree with @emkornfield that we probably don't need to specify endianess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Make sense that we don't need endianness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield number range - you mean the type range? We can have from 0 - 31 with a few left. I think it makes sense to to add Short-fixed(L). Let me know if you agree.

we can't have a 'short-fixed L' like we have for string because I think we are already use the entire number range there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield number range - you mean the type range? We can have from 0 - 31 with a few left. I think it makes sense to to add Short-fixed(L). Let me know if you agree.

we can't have a 'short-fixed L' like we have for string because I think we are already use the entire number range there.

I was talking about the range for "basic type", I'm OK adding it as a separate "primitive type" though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I see. Yeah. We have used all of them for "basic type". We can't add like short-string.

Let me make the change and share with you.

@aihuaxu aihuaxu force-pushed the add-fixed-type-variant branch from bf150b8 to ccdd571 Compare January 16, 2025 21:26
@gene-db
Copy link
Contributor

gene-db commented Jan 17, 2025

@aihuaxu I am not sure why we need this type in the Variant binary encoding. Doesn't this just duplicate the binary type? We don't need two different ways to store binary data.

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Jan 18, 2025

@aihuaxu I am not sure why we need this type in the Variant binary encoding. Doesn't this just duplicate the binary type? We don't need two different ways to store binary data.

Yeah. You are right. The storage is the same as the binary since we can't have the length in the type description. Really no need to add that. We can just use binary to store fixed(L). cc @RussellSpitzer and @emkornfield

@emkornfield
Copy link
Contributor

@aihuaxu I am not sure why we need this type in the Variant binary encoding. Doesn't this just duplicate the binary type? We don't need two different ways to store binary data.

Yeah. You are right. The storage is the same as the binary since we can't have the length in the type description. Really no need to add that. We can just use binary to store fixed(L). cc @RussellSpitzer and @emkornfield

So I think the difference here is semantics. I don't have a strong opinion one way or another, but both Parquet natively and iceberg distinguish between bytes and Fixed(L). One could argue thisis purely for optimization purposes, and when we are paying the cost anyway of storing individual lengths per field they are equivelant. Ultimately, the place where this would make a difference is shredding where Fixed(L) can be mapped to FLBA which would save some amount of storage.

I don't feel too strongly one way or another on adding the type.

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Jan 21, 2025

@aihuaxu I am not sure why we need this type in the Variant binary encoding. Doesn't this just duplicate the binary type? We don't need two different ways to store binary data.

Yeah. You are right. The storage is the same as the binary since we can't have the length in the type description. Really no need to add that. We can just use binary to store fixed(L). cc @RussellSpitzer and @emkornfield

So I think the difference here is semantics. I don't have a strong opinion one way or another, but both Parquet natively and iceberg distinguish between bytes and Fixed(L). One could argue thisis purely for optimization purposes, and when we are paying the cost anyway of storing individual lengths per field they are equivelant. Ultimately, the place where this would make a difference is shredding where Fixed(L) can be mapped to FLBA which would save some amount of storage.

I don't feel too strongly one way or another on adding the type.

I think this is one of those cases where an engine shredding a variable length binary could decide whether the shredded type can become a fixed length when shredding. So it probably doesn't matter if untyped matter has multiple representations

So I think i'm more of a +0 now on the idea since I think a shredder could do the optimization even if it doesn't know the actual type isn't fixed.

@rdblue
Copy link
Contributor

rdblue commented Jan 21, 2025

For variant, I agree that there isn't much value here. Every variant can contain a different fixed length so there isn't a significant difference from binary. I'd avoid adding this and increasing the size of the spec and the number of types that must be supported.

@gene-db
Copy link
Contributor

gene-db commented Jan 21, 2025

For variant, I agree that there isn't much value here. Every variant can contain a different fixed length so there isn't a significant difference from binary. I'd avoid adding this and increasing the size of the spec and the number of types that must be supported.

Yes, I agree. There isn't much value in the semi-structured Variant world, where there is no "global schema" of Variant values. I'd prefer to keep the spec more simple, and just use the Variant binary type for bytes.

@emkornfield
Copy link
Contributor

I think this is one of those cases where an engine shredding a variable length binary could decide whether the shredded type can become a fixed length when shredding. So it probably doesn't matter if untyped matter has multiple representations

If we want to allow this type of shredding lets make sure to update the spec. CC @rdblue

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Jan 24, 2025

For now, I'm closing the request based on the discussion above.

@aihuaxu aihuaxu closed this Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MINOR: add fixed(L) data type to Variant spec
6 participants