-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Apache Parquet format for cloud storage sinks #59819
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
E-starter
Might be suitable for a starter project for new employees or team members.
T-cdc
Comments
amruss
added
the
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
label
Feb 4, 2021
shermanCRL
added
the
E-starter
Might be suitable for a starter project for new employees or team members.
label
Apr 14, 2021
Re-upping this: 2 large customers have asked for this feature as they make heavy use of it within their organizations. |
cc @cockroachdb/cdc |
biradarganesh25
added a commit
that referenced
this issue
Oct 18, 2022
Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Currently parquet format can only be used with initial scan only. Release note (enterprise change): CHANGEFEEDS now have limited support for the parquet format. Resolves: #59819
biradarganesh25
added a commit
that referenced
this issue
Oct 18, 2022
Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Currently parquet format can only be used with initial scan only. Release note (enterprise change): CHANGEFEEDS now have limited support for the parquet format. Resolves: #59819
biradarganesh25
added a commit
that referenced
this issue
Oct 25, 2022
Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Currently parquet format can only be used with initial scan only. Release note (enterprise change): CHANGEFEEDS now have limited support for the parquet format. Resolves: #59819
biradarganesh25
added a commit
that referenced
this issue
Oct 25, 2022
Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Currently parquet format can only be used with initial scan only. Release note (enterprise change): CHANGEFEEDS now have limited support for the parquet format. Resolves: #59819
biradarganesh25
added a commit
that referenced
this issue
Oct 25, 2022
Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Currently parquet format can only be used with initial scan only. Release note (enterprise change): CHANGEFEEDS now have limited support for the parquet format. Resolves: #59819
biradarganesh25
added a commit
that referenced
this issue
Oct 27, 2022
Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Currently parquet format can only be used with initial scan only. Release note (enterprise change): CHANGEFEEDS now have limited support for the parquet format. Resolves: #59819
biradarganesh25
added a commit
that referenced
this issue
Oct 27, 2022
Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Currently parquet format can only be used with initial scan only. Release note (enterprise change): CHANGEFEEDS now have limited support for the parquet format. Resolves: #59819
craig bot
pushed a commit
that referenced
this issue
Oct 27, 2022
89451: cdc: add parquet support to CDC r=biradarganesh25 a=biradarganesh25 Parquet is a columnar storage format with compression and the ability to access a subset of columns for a query being its main advantages: https://parquet.apache.org/docs/overview/ This commit adds support to emit changefeeds in parquet format. Release note (enterprise change): CHANGEFEEDS now have support for the parquet format. Resolves: #59819 90769: metrics: add tsdb persistence to AggHistogram r=rafiss,aadityasondhi a=dhartunian Previously, `AggHistogram` instances would not persist their quantiles to tsdb due to a missing interface implementation of `metrics.WindowedHistogram`. This PR adds a trivial implementation that delegates to the aggregate histogram instance within the struct. This is relatively safe to do even though an `AggHistogram` could contain many children because we are only exporting a single set of aggregate quantiles per-`AggHistogram`. The children are only iterated over via the `PrometheusIterable` interface which is used by the prometheus exporter, but not by the metrics recorder. Release note (bug fix, ops change): Previously, certain aggregate histograms would appear in `_status/vars` but not be available for graphing in the DB Console. These are now made available. They include changefeed-related histograms, and row-level-TTL histograms. Epic: None 90777: sqlliveness: get rid of sqlinstance.Provider r=andreimatei a=andreimatei This interface, together with its implementation, were the worst. It combined an AddressResolver with allocating an instance ID, which are two things have very little to do with one another. The implementation also did a 3rd thing - it was remembering the allocated instance ID, seemingly giving this object state, although nobody was benefiting from that. The implementation was also hiding clues about how the instance ID was being created, namely with the help of a session (i.e. a sqlliveness.Instance). This connection between the sqlliveness package and the sqlinstance package was quite hidden (for example it was not apparent in this Provider interface), and it deserves to be more visible. This patch gets rid of both the interface and the implementation, in favor of breaking down the two main parts. The SQLServer now has a instancestorage.Reader and a instancestorage.Storage; these are concrete types, so there's no more interface hell. These fields are nil for single-tenant instead of dummy interface implementations and that seems to be fine, so the need for dummy interfaces was reduced. Release note: None Epic: None Co-authored-by: Ganeshprasad Rajashekhar Biradar <ganeshprasad.biradar@cockroachlabs.com> Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
E-starter
Might be suitable for a starter project for new employees or team members.
T-cdc
Describe the solution you'd like
We have a user request to support the parquet output format for changefeeds into S3 buckets
Additional context
Apache Parquet needs to wait for blocks to get finished, therefore we will need to micro-batch changes before sending through the changefeed. Ideally we would include an option to give control over the size/timing of microbatch transmissions.
Specifically we may want to limit this to initial_scan=only as we did with CSV
Jira issue: CRDB-3227
Epic CRDB-15071
The text was updated successfully, but these errors were encountered: