Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Apache Parquet format for cloud storage sinks #59819

Closed
amruss opened this issue Feb 4, 2021 · 2 comments · Fixed by #89451
Closed

Support Apache Parquet format for cloud storage sinks #59819

amruss opened this issue Feb 4, 2021 · 2 comments · Fixed by #89451
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members. T-cdc

Comments

@amruss
Copy link
Contributor

amruss commented Feb 4, 2021

Describe the solution you'd like
We have a user request to support the parquet output format for changefeeds into S3 buckets

Additional context
Apache Parquet needs to wait for blocks to get finished, therefore we will need to micro-batch changes before sending through the changefeed. Ideally we would include an option to give control over the size/timing of microbatch transmissions.

Specifically we may want to limit this to initial_scan=only as we did with CSV

Jira issue: CRDB-3227

Epic CRDB-15071

@amruss amruss added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 4, 2021
@shermanCRL shermanCRL added the E-starter Might be suitable for a starter project for new employees or team members. label Apr 14, 2021
@lancel66
Copy link

Re-upping this: 2 large customers have asked for this feature as they make heavy use of it within their organizations.

@blathers-crl
Copy link

blathers-crl bot commented Aug 12, 2022

cc @cockroachdb/cdc

biradarganesh25 added a commit that referenced this issue Oct 18, 2022
Parquet is a columnar storage format with compression and the ability to
access a subset of columns for a query being its main advantages:

https://parquet.apache.org/docs/overview/

This commit adds support to emit changefeeds in parquet format. Currently
parquet format can only be used with initial scan only.

Release note (enterprise change): CHANGEFEEDS now have limited support
for the parquet format.

Resolves: #59819
biradarganesh25 added a commit that referenced this issue Oct 18, 2022
Parquet is a columnar storage format with compression and the ability to
access a subset of columns for a query being its main advantages:

https://parquet.apache.org/docs/overview/

This commit adds support to emit changefeeds in parquet format. Currently
parquet format can only be used with initial scan only.

Release note (enterprise change): CHANGEFEEDS now have limited support
for the parquet format.

Resolves: #59819
biradarganesh25 added a commit that referenced this issue Oct 25, 2022
Parquet is a columnar storage format with compression and the ability to
access a subset of columns for a query being its main advantages:

https://parquet.apache.org/docs/overview/

This commit adds support to emit changefeeds in parquet format. Currently
parquet format can only be used with initial scan only.

Release note (enterprise change): CHANGEFEEDS now have limited support
for the parquet format.

Resolves: #59819
biradarganesh25 added a commit that referenced this issue Oct 25, 2022
Parquet is a columnar storage format with compression and the ability to
access a subset of columns for a query being its main advantages:

https://parquet.apache.org/docs/overview/

This commit adds support to emit changefeeds in parquet format. Currently
parquet format can only be used with initial scan only.

Release note (enterprise change): CHANGEFEEDS now have limited support
for the parquet format.

Resolves: #59819
biradarganesh25 added a commit that referenced this issue Oct 25, 2022
Parquet is a columnar storage format with compression and the ability to
access a subset of columns for a query being its main advantages:

https://parquet.apache.org/docs/overview/

This commit adds support to emit changefeeds in parquet format. Currently
parquet format can only be used with initial scan only.

Release note (enterprise change): CHANGEFEEDS now have limited support
for the parquet format.

Resolves: #59819
biradarganesh25 added a commit that referenced this issue Oct 27, 2022
Parquet is a columnar storage format with compression and the ability to
access a subset of columns for a query being its main advantages:

https://parquet.apache.org/docs/overview/

This commit adds support to emit changefeeds in parquet format. Currently
parquet format can only be used with initial scan only.

Release note (enterprise change): CHANGEFEEDS now have limited support
for the parquet format.

Resolves: #59819
biradarganesh25 added a commit that referenced this issue Oct 27, 2022
Parquet is a columnar storage format with compression and the ability to
access a subset of columns for a query being its main advantages:

https://parquet.apache.org/docs/overview/

This commit adds support to emit changefeeds in parquet format. Currently
parquet format can only be used with initial scan only.

Release note (enterprise change): CHANGEFEEDS now have limited support
for the parquet format.

Resolves: #59819
craig bot pushed a commit that referenced this issue Oct 27, 2022
89451: cdc: add parquet support to CDC r=biradarganesh25 a=biradarganesh25

Parquet is a columnar storage format with compression and the ability to
access a subset of columns for a query being its main advantages:

https://parquet.apache.org/docs/overview/

This commit adds support to emit changefeeds in parquet format. 

Release note (enterprise change): CHANGEFEEDS now have support
for the parquet format.

Resolves: #59819

90769: metrics: add tsdb persistence to AggHistogram r=rafiss,aadityasondhi a=dhartunian

Previously, `AggHistogram` instances would not persist their quantiles to tsdb due to a missing interface implementation of `metrics.WindowedHistogram`. This PR adds a trivial implementation that delegates to the aggregate histogram instance within the struct.

This is relatively safe to do even though an `AggHistogram` could contain many children because we are only exporting a single set of aggregate quantiles per-`AggHistogram`. The children are only iterated over via the `PrometheusIterable` interface which is used by the prometheus exporter, but not by the metrics recorder.

Release note (bug fix, ops change): Previously, certain aggregate histograms would appear in `_status/vars` but not be available for graphing in the DB Console. These are now made available. They include changefeed-related histograms, and row-level-TTL histograms.

Epic: None

90777: sqlliveness: get rid of sqlinstance.Provider  r=andreimatei a=andreimatei

This interface, together with its implementation, were the worst. It
combined an AddressResolver with allocating an instance ID, which are
two things have very little to do with one another. The implementation
also did a 3rd thing - it was remembering the allocated instance ID,
seemingly giving this object state, although nobody was benefiting from
that. The implementation was also hiding clues about how the instance ID
was being created, namely with the help of a session (i.e. a
sqlliveness.Instance). This connection between the sqlliveness package
and the sqlinstance package was quite hidden (for example it was not
apparent in this Provider interface), and it deserves to be more
visible.

This patch gets rid of both the interface and the implementation, in
favor of breaking down the two main parts. The SQLServer now has a
instancestorage.Reader and a instancestorage.Storage; these are concrete
types, so there's no more interface hell. These fields are nil for
single-tenant instead of dummy interface implementations and that seems
to be fine, so the need for dummy interfaces was reduced.

Release note: None
Epic: None

Co-authored-by: Ganeshprasad Rajashekhar Biradar <ganeshprasad.biradar@cockroachlabs.com>
Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig craig bot closed this as completed in f29c829 Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members. T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants