Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: Metamorphic roachtests #111066

Open
miretskiy opened this issue Sep 21, 2023 · 1 comment
Open

cdc: Metamorphic roachtests #111066

miretskiy opened this issue Sep 21, 2023 · 1 comment
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Sep 21, 2023

Implement metamorphic roachtests tests to increase feature coverage confidence.

Some of the ideas are:

  • Run changefeed export with different formats: verify data is the same.
  • Run CDC export and backup: verify data is the same
  • Ensure different export formats have the same output
  • Ensure regular changefeeds: csv & json format emits the same data

Jira issue: CRDB-31748

@miretskiy miretskiy added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 21, 2023
Copy link

blathers-crl bot commented Nov 3, 2023

cc @cockroachdb/cdc

@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Nov 3, 2023
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Nov 16, 2023
Prior to this commit, roachtest/cdc relies solely on periodic checks of changefeed status and latency. This patch takes
the first step to introduce a metamorphic testing framework.

Given the absence of a way to evaluate the output file correctness yet, this new approach involves running two
changefeeds with different configurations, retrieving their roachtests’ output files, and comparing their data outputs.

Due to potential duplicates in the changefeed output, the test follows these steps”
1. create two empty tables with the same scheme as the workload tables
2. convert parquet data to datums
3. execute `UPSERT` statements on the tables with the datums to eliminate duplicates
4. confirm the identical content of the two tables by checking their fingerprints

Limitations with this approach include:
- This solution only works for parquet files as of now. (A round trip conversion is guaranteed between parquet data
   format and datums. Other data formats are more complicated.)
- INSERT is the only operation involved.
- Due to the large file size, the test randomly selects one target table for changefeeds.
- Currently, the changefeeds use the same configurations. However, we plan to change this soon following a discussion
   to determine the specfic configurations that will be randomized.

Part of: cockroachdb#111066

Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Nov 16, 2023
Prior to this commit, roachtest/cdc relies solely on periodic checks of changefeed status and latency. This patch takes
the first step to introduce a metamorphic testing framework.

Given the absence of a way to evaluate the output file correctness yet, this new approach involves running two
changefeeds with different configurations, retrieving their roachtests’ output files, and comparing their data outputs.

Due to potential duplicates in the changefeed output, the test follows these steps”
1. create two empty tables with the same scheme as the workload tables
2. convert parquet data to datums
3. execute `UPSERT` statements on the tables with the datums to eliminate duplicates
4. confirm the identical content of the two tables by checking their fingerprints

Limitations with this approach include:
- This solution only works for parquet files as of now. (A round trip conversion is guaranteed between parquet data
   format and datums. Other data formats are more complicated.)
- INSERT is the only operation involved.
- Due to the large file size, the test randomly selects one target table for changefeeds.
- Currently, the changefeeds use the same configurations. However, we plan to change this soon following a discussion
   to determine the specfic configurations that will be randomized.

Part of: cockroachdb#111066

Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Nov 16, 2023
Prior to this commit, roachtest/cdc relies solely on periodic checks of changefeed status and latency. This patch takes
the first step to introduce a metamorphic testing framework.

Given the absence of a way to evaluate the output file correctness yet, this new approach involves running two
changefeeds with different configurations, retrieving their roachtests’ output files, and comparing their data outputs.

Due to potential duplicates in the changefeed output, the test follows these steps”
1. create two empty tables with the same scheme as the workload tables
2. convert parquet data to datums
3. execute `UPSERT` statements on the tables with the datums to eliminate duplicates
4. confirm the identical content of the two tables by checking their fingerprints

Limitations with this approach include:
- This solution only works for parquet files as of now. (A round trip conversion is guaranteed between parquet data
   format and datums. Other data formats are more complicated.)
- INSERT is the only operation involved.
- Due to the large file size, the test randomly selects one target table for changefeeds.
- Currently, the changefeeds use the same configurations. However, we plan to change this soon following a discussion
   to determine the specfic configurations that will be randomized.

Part of: cockroachdb#111066

Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Feb 15, 2024
Prior to this commit, roachtest/cdc relies solely on periodic checks of changefeed status and latency. This patch takes
the first step to introduce a metamorphic testing framework.

Given the absence of a way to evaluate the output file correctness yet, this new approach involves running two
changefeeds with different configurations, retrieving their roachtests’ output files, and comparing their data outputs.

Due to potential duplicates in the changefeed output, the test follows these steps”
1. create two empty tables with the same scheme as the workload tables
2. convert parquet data to datums
3. execute `UPSERT` statements on the tables with the datums to eliminate duplicates
4. confirm the identical content of the two tables by checking their fingerprints

Limitations with this approach include:
- This solution only works for parquet files as of now. (A round trip conversion is guaranteed between parquet data
   format and datums. Other data formats are more complicated.)
- INSERT is the only operation involved.
- Due to the large file size, the test randomly selects one target table for changefeeds.
- Currently, the changefeeds use the same configurations. However, we plan to change this soon following a discussion
   to determine the specfic configurations that will be randomized.

Part of: cockroachdb#111066

Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Feb 15, 2024
Prior to this commit, roachtest/cdc relies solely on periodic checks of changefeed status and latency. This patch takes
the first step to introduce a metamorphic testing framework.

Given the absence of a way to evaluate the output file correctness yet, this new approach involves running two
changefeeds with different configurations, retrieving their roachtests’ output files, and comparing their data outputs.

Due to potential duplicates in the changefeed output, the test follows these steps”
1. create two empty tables with the same scheme as the workload tables
2. convert parquet data to datums
3. execute `UPSERT` statements on the tables with the datums to eliminate duplicates
4. confirm the identical content of the two tables by checking their fingerprints

Limitations with this approach include:
- This solution only works for parquet files as of now. (A round trip conversion is guaranteed between parquet data
   format and datums. Other data formats are more complicated.)
- INSERT is the only operation involved.
- Due to the large file size, the test randomly selects one target table for changefeeds.
- Currently, the changefeeds use the same configurations. However, we plan to change this soon following a discussion
   to determine the specfic configurations that will be randomized.

Part of: cockroachdb#111066

Release note: None
craig bot pushed a commit that referenced this issue Feb 16, 2024
114504: roachtest/tests: introduce metamorphic testing to cdc r=jayshrivastava a=wenyihu6

Prior to this commit, roachtest/cdc relies solely on periodic checks of
changefeed status and latency. This patch takes the first step to introduce a
metamorphic testing framework.

Given the absence of a way to evaluate the output file correctness yet, this new
approach involves running two changefeeds with different configurations,
retrieving their roachtests’ output files, and comparing their data outputs.

Due to potential duplicates in the changefeed output, the test follows these
steps:
1. create two empty tables with the same scheme as the workload tables 
2. convert parquet data to datums 
3. execute `UPSERT` statements on the tables with the datums to eliminate
duplicates 
4. confirm the identical content of the two tables by checking their
fingerprints

Limitations with this approach include: 
- This solution only works for parquet files as of now. (A round trip conversion
is guaranteed between parquet data format and datums. Other data formats are
more complicated.) 
- INSERT is the only operation involved. 
- Due to the large file size, the test randomly selects one target table for
changefeeds. 
- Currently, the changefeeds use the same configurations. However, we plan to
change this soon following a discussion to determine the specfic configurations
that will be randomized.

Part of: #111066

Release note: None

Co-authored-by: Wenyi Hu <wenyi@cockroachlabs.com>
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
Prior to this commit, roachtest/cdc relies solely on periodic checks of changefeed status and latency. This patch takes
the first step to introduce a metamorphic testing framework.

Given the absence of a way to evaluate the output file correctness yet, this new approach involves running two
changefeeds with different configurations, retrieving their roachtests’ output files, and comparing their data outputs.

Due to potential duplicates in the changefeed output, the test follows these steps”
1. create two empty tables with the same scheme as the workload tables
2. convert parquet data to datums
3. execute `UPSERT` statements on the tables with the datums to eliminate duplicates
4. confirm the identical content of the two tables by checking their fingerprints

Limitations with this approach include:
- This solution only works for parquet files as of now. (A round trip conversion is guaranteed between parquet data
   format and datums. Other data formats are more complicated.)
- INSERT is the only operation involved.
- Due to the large file size, the test randomly selects one target table for changefeeds.
- Currently, the changefeeds use the same configurations. However, we plan to change this soon following a discussion
   to determine the specfic configurations that will be randomized.

Part of: cockroachdb#111066

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc
Projects
None yet
Development

No branches or pull requests

1 participant