-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist: Refactor stats away from Box<dyn DynStats>
and instead use an enum
#27783
persist: Refactor stats away from Box<dyn DynStats>
and instead use an enum
#27783
Conversation
#[derive(Debug, Clone)] | ||
pub struct ColumnarStats { | ||
/// Expected to be `None` if the associated column is non-nullable. | ||
pub nulls: Option<ColumnNullStats>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about inlining the null count here, and setting it to zero for a non-nullable type?
Seems like it could simplify some downstream code, and I'm not sure the distinction between None
and Some(0)
is buying us any safety...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I nested the inner struct so we could differentiate between columns that could have nulls but does not contain any (Some(0)
) vs non-nullable columns (None
). I don't know if making that distinction is necessary though, so I'm more than happy to flatten it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if making that distinction is necessary though [...]
Right - I suspected it's not, but you're the one who will know for sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to inline the null count for now, in a stacked PR being able to make the extra assertions is kind of nice. Can definitely change this later on though if we want!
a0d3bfe
to
852cf34
Compare
This PR refactors stats handling away from using
Box<dyn DynStats>
and to using a concrete type calledColumnarStats
, which is structured very similar to theProtoDynStats
message type, i.e. it has an innerenum
of all the stats types that we can match against.It should not result in any behavioral change, the
statistics_stability
test that asserts the generated statistics match a snapshot, is still passing.Previously the flow worked like
ProtoDynStats
->Box<dyn DynStats>
-downcast->T::Stats
. The flow in this PR is still very similar,ProtoDynStats
->ColumnarStats
->T::Stats
, but nowColumnarStats
contains an enumColumnStatsKind
that we can match against in a follow up PR.Motivation
Related to #24830
Related to #27084
As we write structured data for all of our column types we'll evolve the kind of stats we keep. Currently the kind of stats are 1:1 with the implementation of
trait Data
that has an associatedtype Stats
. Today we downcast aBox<dyn DynStats>
to this associatedStats
type, but this doesn't work if there are possibly two different kinds of stats that might exist for a column type. Also, I recently re-worked our column encoders away from theData
trait, and thus the associatedStats
type as well.We don't do it in this PR, but we're now setup to
match
a column type against aColumnStatsKind
enum which will make it easy to evolve stats over time.Alternatives
There are two alternatives to
match
-ing between column type and stats type.Part
and then given a version and column type, downcast to a specific kind of stats. This approach doesn't require a refactor but I think we'd still end up with a bigmatch
statement?Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.