-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add column.include.list for debezium column selection support #20055
Add column.include.list for debezium column selection support #20055
Conversation
Affected Connector ReportNOTE
|
Connector | Version | Changelog | Publish |
---|---|---|---|
source-alloydb |
1.0.17 |
✅ | ✅ |
source-alloydb-strict-encrypt |
1.0.17 |
✅ | 🔵 (ignored) |
source-bigquery |
0.2.3 |
✅ | ✅ |
source-clickhouse |
0.1.14 |
✅ | ✅ |
source-clickhouse-strict-encrypt |
0.1.14 |
✅ | 🔵 (ignored) |
source-cockroachdb |
0.1.18 |
✅ | ✅ |
source-cockroachdb-strict-encrypt |
0.1.18 |
✅ | 🔵 (ignored) |
source-db2 |
0.1.16 |
✅ | ✅ |
source-db2-strict-encrypt |
0.1.16 |
✅ | 🔵 (ignored) |
source-dynamodb |
0.1.0 |
✅ | ✅ |
source-e2e-test |
2.1.3 |
✅ | ✅ |
source-e2e-test-cloud |
2.1.1 |
⚠ (doc not found) |
⚠ (not in seed) |
source-elasticsearch |
0.1.1 |
✅ | ✅ |
source-jdbc |
0.3.5 |
⚠ (doc not found) |
⚠ (not in seed) |
source-kafka |
0.2.3 |
✅ | ✅ |
source-mongodb-strict-encrypt |
0.1.19 |
⚠ (doc not found) |
🔵 (ignored) |
source-mongodb-v2 |
0.1.19 |
✅ | ✅ |
source-mssql |
0.4.25 |
✅ | ✅ |
source-mssql-strict-encrypt |
0.4.25 |
✅ | 🔵 (ignored) |
source-mysql |
1.0.15 |
✅ | ✅ |
source-mysql-strict-encrypt |
1.0.15 |
✅ | 🔵 (ignored) |
source-oracle |
0.3.21 |
✅ | ✅ |
source-oracle-strict-encrypt |
0.3.21 |
✅ | 🔵 (ignored) |
source-postgres |
1.0.31 |
✅ | ✅ |
source-postgres-strict-encrypt |
1.0.31 |
✅ | 🔵 (ignored) |
source-redshift |
0.3.15 |
✅ | ✅ |
source-scaffold-java-jdbc |
0.1.0 |
⚠ (doc not found) |
⚠ (not in seed) |
source-sftp |
0.1.2 |
✅ | ✅ |
source-snowflake |
0.1.26 |
✅ | ✅ |
source-tidb |
0.2.1 |
✅ | ✅ |
- See "Actionable Items" below for how to resolve warnings and errors.
✅ Destinations (0)
Connector | Version | Changelog | Publish |
---|
- See "Actionable Items" below for how to resolve warnings and errors.
✅ Other Modules (0)
Actionable Items
(click to expand)
Category | Status | Actionable Item |
---|---|---|
Version | ❌ mismatch |
The version of the connector is different from its normal variant. Please bump the version of the connector. |
⚠ doc not found |
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug. |
|
Changelog | ⚠ doc not found |
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug. |
❌ changelog missing |
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog. | |
Publish | ⚠ not in seed |
The connector is not in the seed file (e.g. source_definitions.yaml ), so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that it is not a bug. |
❌ diff seed version |
The connector exists in the seed file, but the latest version is not listed there. This usually means that the latest version is not published. Please use the /publish command to publish the latest version. |
/test connector=connectors/source-postgres
Build FailedTest summary info:
|
/test connector=connectors/source-mssql
Build PassedTest summary info:
|
/test connector=connectors/source-mysql
|
…column names, including regex control characters (command, dot, asterisk etc.)
/test connector=connectors/source-postgres
Build PassedTest summary info:
|
/test connector=connectors/source-mysql
|
…sources-in-debezium' into 19701-use-column-filter-for-cdc-sources-in-debezium
return catalog.getStreams().stream() | ||
.filter(s -> s.getSyncMode() == SyncMode.INCREMENTAL) | ||
.map(ConfiguredAirbyteStream::getStream) | ||
.map(stream -> stream.getNamespace() + "." + stream.getName()) | ||
// debezium needs commas escaped to split properly | ||
.map(x -> StringUtils.escape(x, new char[] {','}, "\\,")) | ||
.map(x -> StringUtils.escape(Pattern.quote(x), ",".toCharArray(), "\\,")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Pattern.quote()
surrounds the string with \Q … \E
which makes it literal for the purpose of regex matching, which is how debezium collects all tables and columns for sync.
This is painful to read but necessary as strings may include any unicode character, including all regex control characters.
/test connector=connectors/source-postgres
Build PassedTest summary info:
|
/test connector=connectors/source-mysql
Build PassedTest summary info:
|
The only characters that won't work as a name for a table or a column is a backslash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test where a table has multiple columns but we only pass in a subset of those columns in the catalog and assert that the data read only contains the columns that we passed (and this test should be for cdc (snapshot+incremental) and non-cdc) as well
final String tableWhitelist = getTableWhitelist(catalog); | ||
props.setProperty("table.include.list", tableWhitelist); | ||
|
||
props.setProperty("table.include.list", getTableIncludelist(catalog)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious : Does coping the table and column include lists to only the streams/tables we are interested in lead to any performance gains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should since data that is out of scope will be omitted by debezium, hopefully on the db side.
Because this is a new feature for airbyte as a whole, today we sync every column so we don't have point of comparison to know just how much of a gain this is.
/test connector=connectors/source-postgres
Build PassedTest summary info:
|
…sources-in-debezium' into 19701-use-column-filter-for-cdc-sources-in-debezium
/test connector=connectors/source-mysql
Build PassedTest summary info:
|
What
This change adds column selection to CDC sync in mysql, mssql and postgres sources.
How
By Configuring the column.include.list during CDC sys startup and adding a configuration matching whatever is in the catalog, debezium will only sync the requested columns of tables.
In order to be able to support all allowed characters, I had to get all the escaping correctly.
Case in point: Postgres allows using any character in a name of a db, schema, table or column. This include also regex control characters that may mix up with debezium matching of table/columns.
🚨 User Impact 🚨
Anything that works today should work the same going forward.
When future UI supports selective configuration of tables columns in catalog, CDC sync should support it out of the box, similarly to standard sync.