Support Apache Parquet format for cloud storage sinks #59819

amruss · 2021-02-04T20:17:52Z

Describe the solution you'd like
We have a user request to support the parquet output format for changefeeds into S3 buckets

Additional context
Apache Parquet needs to wait for blocks to get finished, therefore we will need to micro-batch changes before sending through the changefeed. Ideally we would include an option to give control over the size/timing of microbatch transmissions.

Specifically we may want to limit this to initial_scan=only as we did with CSV

Jira issue: CRDB-3227

Epic CRDB-15071

lancel66 · 2021-07-16T16:31:19Z

Re-upping this: 2 large customers have asked for this feature as they make heavy use of it within their organizations.

blathers-crl · 2022-08-12T17:54:09Z

cc @cockroachdb/cdc

Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Currently parquet format can only be used with initial scan only. Release note (enterprise change): CHANGEFEEDS now have limited support for the parquet format. Resolves: #59819

89451: cdc: add parquet support to CDC r=biradarganesh25 a=biradarganesh25 Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Release note (enterprise change): CHANGEFEEDS now have support for the parquet format. Resolves: #59819 90769: metrics: add tsdb persistence to AggHistogram r=rafiss,aadityasondhi a=dhartunian Previously, `AggHistogram` instances would not persist their quantiles to tsdb due to a missing interface implementation of `metrics.WindowedHistogram`. This PR adds a trivial implementation that delegates to the aggregate histogram instance within the struct. This is relatively safe to do even though an `AggHistogram` could contain many children because we are only exporting a single set of aggregate quantiles per-`AggHistogram`. The children are only iterated over via the `PrometheusIterable` interface which is used by the prometheus exporter, but not by the metrics recorder. Release note (bug fix, ops change): Previously, certain aggregate histograms would appear in `_status/vars` but not be available for graphing in the DB Console. These are now made available. They include changefeed-related histograms, and row-level-TTL histograms. Epic: None 90777: sqlliveness: get rid of sqlinstance.Provider r=andreimatei a=andreimatei This interface, together with its implementation, were the worst. It combined an AddressResolver with allocating an instance ID, which are two things have very little to do with one another. The implementation also did a 3rd thing - it was remembering the allocated instance ID, seemingly giving this object state, although nobody was benefiting from that. The implementation was also hiding clues about how the instance ID was being created, namely with the help of a session (i.e. a sqlliveness.Instance). This connection between the sqlliveness package and the sqlinstance package was quite hidden (for example it was not apparent in this Provider interface), and it deserves to be more visible. This patch gets rid of both the interface and the implementation, in favor of breaking down the two main parts. The SQLServer now has a instancestorage.Reader and a instancestorage.Storage; these are concrete types, so there's no more interface hell. These fields are nil for single-tenant instead of dummy interface implementations and that seems to be fine, so the need for dummy interfaces was reduced. Release note: None Epic: None Co-authored-by: Ganeshprasad Rajashekhar Biradar <ganeshprasad.biradar@cockroachlabs.com> Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

amruss added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 4, 2021

shermanCRL added the E-starter Might be suitable for a starter project for new employees or team members. label Apr 14, 2021

exalate-issue-sync bot mentioned this issue Nov 24, 2021

CDC: Add GitHub Tracking Issues to Limitations cockroachdb/docs#12295

Closed

amruss mentioned this issue Dec 3, 2021

changefeedccl: expand available output formats across sinks #73432

Closed

exalate-issue-sync bot added the T-cdc label Aug 12, 2022

biradarganesh25 mentioned this issue Oct 18, 2022

cdc: add parquet support to CDC #89451

Merged

craig bot closed this as completed in f29c829 Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Apache Parquet format for cloud storage sinks #59819

Support Apache Parquet format for cloud storage sinks #59819

amruss commented Feb 4, 2021 •

edited

Loading

lancel66 commented Jul 16, 2021

blathers-crl bot commented Aug 12, 2022

Support Apache Parquet format for cloud storage sinks #59819

Support Apache Parquet format for cloud storage sinks #59819

Comments

amruss commented Feb 4, 2021 • edited Loading

lancel66 commented Jul 16, 2021

blathers-crl bot commented Aug 12, 2022

amruss commented Feb 4, 2021 •

edited

Loading