Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: Stats for the new Columnar encoders #27857

Merged
merged 8 commits into from
Jun 27, 2024

Conversation

ParkMyCar
Copy link
Member

@ParkMyCar ParkMyCar commented Jun 24, 2024

This PR adds statistics to the DatumColumnarEncoder in repr/src/row/encoding2.rs, and maintains parity with the existing statistics.

There is two functional changes in this PR.

First is we add a new trait ColumnarStatisticsBuilder which allows us to incrementally collect statistics on a column, the only type we do this for at the moment is jsonb. Currently to collect stats on a jsonb column we decode all of the data through protobuf then collect stats. Incrementally collecting stats while we encode prevents the need to do this.

Second, it makes stats collection eager. This isn't that big of a change because we do always compute stats, this just changes the functional flow a bit.

Motivation

Progress towards https://github.com/MaterializeInc/database-issues/issues/7411

Tips for reviewer

This PR is split into separate commits which could be reviewed separately:

  1. Introduce the new ColumnarStatsBuilder trait, and a few new structs
  2. Adds Statistics as an associated type on the Schema2 trait
  3. Compute and return stats with RowColumnarEncoder::finish
  4. Return (lower, upper) bounds for supported stat types, add test
  5. Benchmarks for jsonb

Checklist

@ParkMyCar ParkMyCar force-pushed the persist/stats2 branch 2 times, most recently from 4e8367f to 2e21e03 Compare June 26, 2024 15:36
@ParkMyCar ParkMyCar marked this pull request as ready for review June 26, 2024 15:36
@ParkMyCar ParkMyCar requested review from a team as code owners June 26, 2024 15:36
Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only type we do this for at the moment is jsonb.

Do we have plans to do this for other types?

Right now I suspect this trait is not pulling its weight; it's trying to support two mostly-disjoint use cases and is ~never used from a generic context. (For example, nearly all the include impls seem to be dead code?) If this is just an intermediate state, all good! But otherwise I suspect it would be more straightforward / concise to replace the trait with a set of freestanding functions and some JSON-specific business.

Second, it makes stats collection eager.

Appreciate this!

@@ -305,26 +305,31 @@ pub trait ColumnDecoder<T> {
pub trait ColumnEncoder<T> {
/// Type of column that this encoder returns when finalized.
type FinishedColumn: arrow::array::Array + Debug + 'static;
/// Type of statistics this encoder returns when finalized.
type FinishedStats: DynStats + 'static;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All callers of finish immediately convert the result of this to ColumnarStats. (Which is great - love that new type.) Could we make finish return ColumnarStats directly and skip the intermediate / associated type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thought! I tried to do this originally but what throws a small wrench in it is OptionStats. It's nice to have the implementation for OptionStats require the the inner type impl Into<ColumnStatsKind> to prevent nested OptionStats. Not an impossible problem to solve but a little tricky

/// We collect stats for all primitive types in exactly the same way. This
/// macro de-duplicates some of that logic.
///
/// Note: If at any point someone finds this macro too complext, they should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note: If at any point someone finds this macro too complext, they should
/// Note: If at any point someone finds this macro too complex, they should

(I don't find it that complex, but I think it's a bit of a smell that we're generating all these include impls etc. that are never called...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally fair, at the moment I wanted to maintain parity with the other trait impls, but can totally revisit this

src/repr/src/stats2.rs Outdated Show resolved Hide resolved
@ParkMyCar
Copy link
Member Author

ParkMyCar commented Jun 27, 2024

Do we have plans to do this for other types?

I was thinking of incrementally collecting stats for all types that we persist some encoded version of, e.g. the FixedSizeCodec types. I didn't include this here to make the change smaller.

Right now I suspect this trait is not pulling its weight; it's trying to support two mostly-disjoint use cases and is ~never used from a generic context. (For example, nearly all the include impls seem to be dead code?) If this is just an intermediate state, all good! But otherwise I suspect it would be more straightforward / concise to replace the trait with a set of freestanding functions and some JSON-specific business.

Good point! I'll keep the trait for now, but see how it evolves over the next few PRs or so

@ParkMyCar ParkMyCar merged commit cf93d01 into MaterializeInc:main Jun 27, 2024
192 of 194 checks passed
@ParkMyCar ParkMyCar deleted the persist/stats2 branch October 16, 2024 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants