Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: Refactor stats away from Box<dyn DynStats> and instead use an enum #27783

Merged
merged 3 commits into from
Jun 24, 2024

Conversation

ParkMyCar
Copy link
Member

@ParkMyCar ParkMyCar commented Jun 20, 2024

This PR refactors stats handling away from using Box<dyn DynStats> and to using a concrete type called ColumnarStats, which is structured very similar to the ProtoDynStats message type, i.e. it has an inner enum of all the stats types that we can match against.

It should not result in any behavioral change, the statistics_stability test that asserts the generated statistics match a snapshot, is still passing.

Previously the flow worked like ProtoDynStats -> Box<dyn DynStats> -downcast-> T::Stats. The flow in this PR is still very similar, ProtoDynStats -> ColumnarStats -> T::Stats, but now ColumnarStats contains an enum ColumnStatsKind that we can match against in a follow up PR.

Motivation

Related to #24830
Related to #27084

As we write structured data for all of our column types we'll evolve the kind of stats we keep. Currently the kind of stats are 1:1 with the implementation of trait Data that has an associated type Stats. Today we downcast a Box<dyn DynStats> to this associated Stats type, but this doesn't work if there are possibly two different kinds of stats that might exist for a column type. Also, I recently re-worked our column encoders away from the Data trait, and thus the associated Stats type as well.

We don't do it in this PR, but we're now setup to match a column type against a ColumnStatsKind enum which will make it easy to evolve stats over time.

Alternatives

There are two alternatives to match-ing between column type and stats type.

  1. For a given column type, downcast to known stats types until we find a match. This works, but feels very non-Rusty.
  2. Provide a version number with each Part and then given a version and column type, downcast to a specific kind of stats. This approach doesn't require a refactor but I think we'd still end up with a big match statement?

Checklist

@ParkMyCar ParkMyCar requested review from a team as code owners June 20, 2024 21:10
@ParkMyCar ParkMyCar requested a review from bkirwi June 20, 2024 21:10
#[derive(Debug, Clone)]
pub struct ColumnarStats {
/// Expected to be `None` if the associated column is non-nullable.
pub nulls: Option<ColumnNullStats>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about inlining the null count here, and setting it to zero for a non-nullable type?

Seems like it could simplify some downstream code, and I'm not sure the distinction between None and Some(0) is buying us any safety...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I nested the inner struct so we could differentiate between columns that could have nulls but does not contain any (Some(0)) vs non-nullable columns (None). I don't know if making that distinction is necessary though, so I'm more than happy to flatten it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if making that distinction is necessary though [...]

Right - I suspected it's not, but you're the one who will know for sure!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to inline the null count for now, in a stacked PR being able to make the extra assertions is kind of nice. Can definitely change this later on though if we want!

src/persist-types/src/stats/primitive.rs Outdated Show resolved Hide resolved
src/storage-types/src/stats.rs Show resolved Hide resolved
src/storage-types/src/stats.rs Show resolved Hide resolved
@ParkMyCar ParkMyCar merged commit f9818bb into MaterializeInc:main Jun 24, 2024
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants