-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist: Stats for the new Columnar encoders #27857
Conversation
4e8367f
to
2e21e03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only type we do this for at the moment is jsonb.
Do we have plans to do this for other types?
Right now I suspect this trait is not pulling its weight; it's trying to support two mostly-disjoint use cases and is ~never used from a generic context. (For example, nearly all the include
impls seem to be dead code?) If this is just an intermediate state, all good! But otherwise I suspect it would be more straightforward / concise to replace the trait with a set of freestanding functions and some JSON-specific business.
Second, it makes stats collection eager.
Appreciate this!
@@ -305,26 +305,31 @@ pub trait ColumnDecoder<T> { | |||
pub trait ColumnEncoder<T> { | |||
/// Type of column that this encoder returns when finalized. | |||
type FinishedColumn: arrow::array::Array + Debug + 'static; | |||
/// Type of statistics this encoder returns when finalized. | |||
type FinishedStats: DynStats + 'static; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All callers of finish
immediately convert the result of this to ColumnarStats
. (Which is great - love that new type.) Could we make finish
return ColumnarStats
directly and skip the intermediate / associated type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thought! I tried to do this originally but what throws a small wrench in it is OptionStats
. It's nice to have the implementation for OptionStats
require the the inner type impl Into<ColumnStatsKind>
to prevent nested OptionStats
. Not an impossible problem to solve but a little tricky
src/persist-types/src/stats2.rs
Outdated
/// We collect stats for all primitive types in exactly the same way. This | ||
/// macro de-duplicates some of that logic. | ||
/// | ||
/// Note: If at any point someone finds this macro too complext, they should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Note: If at any point someone finds this macro too complext, they should | |
/// Note: If at any point someone finds this macro too complex, they should |
(I don't find it that complex, but I think it's a bit of a smell that we're generating all these include
impls etc. that are never called...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally fair, at the moment I wanted to maintain parity with the other trait impls, but can totally revisit this
update stats2::col_values
3e2b695
to
743ef5e
Compare
I was thinking of incrementally collecting stats for all types that we persist some encoded version of, e.g. the
Good point! I'll keep the trait for now, but see how it evolves over the next few PRs or so |
This PR adds statistics to the
DatumColumnarEncoder
inrepr/src/row/encoding2.rs
, and maintains parity with the existing statistics.There is two functional changes in this PR.
First is we add a new trait
ColumnarStatisticsBuilder
which allows us to incrementally collect statistics on a column, the only type we do this for at the moment isjsonb
. Currently to collect stats on ajsonb
column we decode all of the data through protobuf then collect stats. Incrementally collecting stats while we encode prevents the need to do this.Second, it makes stats collection eager. This isn't that big of a change because we do always compute stats, this just changes the functional flow a bit.
Motivation
Progress towards https://github.com/MaterializeInc/database-issues/issues/7411
Tips for reviewer
This PR is split into separate commits which could be reviewed separately:
ColumnarStatsBuilder
trait, and a few new structsStatistics
as an associated type on theSchema2
traitRowColumnarEncoder::finish
jsonb
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.