Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: Add FixedSizeBytesStats #27856

Merged

Conversation

ParkMyCar
Copy link
Member

@ParkMyCar ParkMyCar commented Jun 24, 2024

This PR adds a new variant of BytesStats called FixedSizedBytesStats. It's very similar to AtomicBytesStats in that they can't be trimmed, but they're explicitly for types that implement the new FixedSizeCodec.

Motivation

Related to #24830

As we evolve our encodings in Persist, what bytes are recorded will change. For example, today stats for ScalarType::Time are AtomicBytesStats of a ProtoNaiveTime, but with our new columnar encodings it will be a PackedNaiveTime. We need a way to distinguish between the two.

An alternative is adding a "kind" field to AtomicBytesStats, I originally implemented this but we determined the better approach would be a whole separate type which allows for a cleaner separation.

Checklist

Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we discussed using a part-level version number to manage migrations at the actual data level; wondering if that would make sense here.

Personally I kind of appreciate the PackedBytesStats idea... while in many ways it is similar to the atomic stats as you mention, there are important differences: it has a meaningful sort, and it can be computed off the columnar representation instead of the raw data. (One option would be to call it FixedLengthBytesStats or whatever, which stresses that while the proto types are similar we're dealing with a different sort of thing at the arrow level.)

src/repr/src/row/encoding.rs Outdated Show resolved Hide resolved
@ParkMyCar
Copy link
Member Author

IIRC we discussed using a part-level version number to manage migrations at the actual data level; wondering if that would make sense here.

For sure! I thought about adding a version number to stats, and it definitely should be possible, just a bit heavier weight. In the meantime I realized at the moment the only type of stats that we really need to evolve are the AtomicBytesStats and just adding a "kind" here was faster than adding a version.

Personally I kind of appreciate the PackedBytesStats idea... while in many ways it is similar to the atomic stats as you mention, there are important differences: it has a meaningful sort, and it can be computed off the columnar representation instead of the raw data. (One option would be to call it FixedLengthBytesStats or whatever, which stresses that while the proto types are similar we're dealing with a different sort of thing at the arrow level.)

Totally fair, and it is a cleaner separation between the older stats types and the newer types. I'll update this PR to add a PackedBytesStats instead of adding a "kind" to AtomicBytesStats

@ParkMyCar ParkMyCar force-pushed the persist/stats-add-kind-to-atomic-byte branch from 682cf9f to 5849291 Compare June 25, 2024 15:54
@ParkMyCar ParkMyCar changed the title perist: Add "kind" to AtomicBytesStats perist: Add FixedSizeBytesStats Jun 25, 2024
@antiguru antiguru changed the title perist: Add FixedSizeBytesStats persist: Add FixedSizeBytesStats Jun 26, 2024
Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

@ParkMyCar ParkMyCar merged commit 32430f0 into MaterializeInc:main Jun 26, 2024
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants