persist: Refactor stats away from `Box<dyn DynStats>` and instead use an enum #27783

ParkMyCar · 2024-06-20T21:10:31Z

This PR refactors stats handling away from using Box<dyn DynStats> and to using a concrete type called ColumnarStats, which is structured very similar to the ProtoDynStats message type, i.e. it has an inner enum of all the stats types that we can match against.

It should not result in any behavioral change, the statistics_stability test that asserts the generated statistics match a snapshot, is still passing.

Previously the flow worked like ProtoDynStats -> Box<dyn DynStats> -downcast-> T::Stats. The flow in this PR is still very similar, ProtoDynStats -> ColumnarStats -> T::Stats, but now ColumnarStats contains an enum ColumnStatsKind that we can match against in a follow up PR.

Motivation

Related to #24830
Related to #27084

As we write structured data for all of our column types we'll evolve the kind of stats we keep. Currently the kind of stats are 1:1 with the implementation of trait Data that has an associated type Stats. Today we downcast a Box<dyn DynStats> to this associated Stats type, but this doesn't work if there are possibly two different kinds of stats that might exist for a column type. Also, I recently re-worked our column encoders away from the Data trait, and thus the associated Stats type as well.

We don't do it in this PR, but we're now setup to match a column type against a ColumnStatsKind enum which will make it easy to evolve stats over time.

Alternatives

There are two alternatives to match-ing between column type and stats type.

For a given column type, downcast to known stats types until we find a match. This works, but feels very non-Rusty.
Provide a version number with each Part and then given a version and column type, downcast to a specific kind of stats. This approach doesn't require a refactor but I think we'd still end up with a big match statement?

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- N/a

bkirwi · 2024-06-20T22:05:39Z

src/persist-types/src/stats.rs

+#[derive(Debug, Clone)]
+pub struct ColumnarStats {
+    /// Expected to be `None` if the associated column is non-nullable.
+    pub nulls: Option<ColumnNullStats>,


How would you feel about inlining the null count here, and setting it to zero for a non-nullable type?

Seems like it could simplify some downstream code, and I'm not sure the distinction between None and Some(0) is buying us any safety...

I nested the inner struct so we could differentiate between columns that could have nulls but does not contain any (Some(0)) vs non-nullable columns (None). I don't know if making that distinction is necessary though, so I'm more than happy to flatten it!

I don't know if making that distinction is necessary though [...]

Right - I suspected it's not, but you're the one who will know for sure!

I'm not going to inline the null count for now, in a stacked PR being able to make the extra assertions is kind of nice. Can definitely change this later on though if we want!

src/persist-types/src/stats/primitive.rs

src/storage-types/src/stats.rs

ParkMyCar added 2 commits June 20, 2024 16:26

start, refactor away from DynStats

93e95b3

add comment to DynStats trait

1e3fb8f

ParkMyCar requested review from a team as code owners June 20, 2024 21:10

ParkMyCar requested a review from bkirwi June 20, 2024 21:10

bkirwi reviewed Jun 20, 2024

View reviewed changes

small refactor

852cf34

ParkMyCar force-pushed the persist/remove-dyn-stats branch from a0d3bfe to 852cf34 Compare June 21, 2024 15:36

bkirwi approved these changes Jun 21, 2024

View reviewed changes

ParkMyCar merged commit f9818bb into MaterializeInc:main Jun 24, 2024
76 checks passed

materialize-bot mentioned this pull request Jun 27, 2024

release: v0.106.0 required reviews #27930

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persist: Refactor stats away from `Box<dyn DynStats>` and instead use an enum #27783

persist: Refactor stats away from `Box<dyn DynStats>` and instead use an enum #27783

ParkMyCar commented Jun 20, 2024 •

edited

Loading

bkirwi Jun 20, 2024

ParkMyCar Jun 21, 2024

bkirwi Jun 21, 2024

ParkMyCar Jun 24, 2024

persist: Refactor stats away from Box<dyn DynStats> and instead use an enum #27783

persist: Refactor stats away from Box<dyn DynStats> and instead use an enum #27783

Conversation

ParkMyCar commented Jun 20, 2024 • edited Loading

Motivation

Alternatives

Checklist

bkirwi Jun 20, 2024

Choose a reason for hiding this comment

ParkMyCar Jun 21, 2024

Choose a reason for hiding this comment

bkirwi Jun 21, 2024

Choose a reason for hiding this comment

ParkMyCar Jun 24, 2024

Choose a reason for hiding this comment

persist: Refactor stats away from `Box<dyn DynStats>` and instead use an enum #27783

persist: Refactor stats away from `Box<dyn DynStats>` and instead use an enum #27783

ParkMyCar commented Jun 20, 2024 •

edited

Loading