Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement 1s1t DATs for destination-bigquery #27852

Merged
merged 54 commits into from
Jul 13, 2023
Merged

Conversation

edgao
Copy link
Contributor

@edgao edgao commented Jun 29, 2023

What

closes #27778. There are now test cases for all 4 sync modes (full refresh overwrite / append; incremental append / dedup) and fixes a few bugs that these tests uncovered.

Test failures look like this:
image

Test cases that I chose not to import from DATs:

  • a few that were redundant (testSecondSync, various incremental/normalization-related things)
  • stuff that's purely about the non-write parts of the protocol (spec, check, etc)
  • testSyncVeryBigRecords - this literally only runs for redshift, and we want to change that behavior anyway. We can implement this anew when the time comes
  • custom dbt transformation stuff
  • testStressPerformance - I think this is actually redundant with the new performance test harness
  • some old v1 data types stuff that still exists in DATs (infinity/nan handling)

How

Define a new BaseTypingDedupingTest class. It's kind of like DestinationAcceptanceTest, but more specialized to 1s1t. I ended up doing a similar Base -> Abstract -> concrete inheritance chain; it seems like that's the best we can do with JUnit. At the very least, the destination-specific classes are relatively simple.... :(

There are also new CI secrets in GSM. These are just duplicates of existing secrets, but with the 1s1t config option enabled.

Bugfixes:

  • inserting new records from raw to final should only check _ab_cdc_deleted_at in dedup syncs, and only if _ab_cdc_deleted_at exists in the schema
  • deduping on composite PKs now works

Recommended reading order

  1. RecordDiffer - feel free to just skim this, it's doing a lot of jsonnode comparisons but nothing particularly interesting. Just take note of what the constructor is doing + verifySyncResult / diffRawTableRecords / diffFinalTableRecords.
  2. BigQuerySqlGeneratorIntegrationTest - refactor to use RecordDiffer + add toJsonRecords
  3. BaseTypingDedupingTest.java + AbstractBigQueryTypingDedupingTest + BigQueryGcsTypingDedupingTest + BigQueryStandardInsertsTypingDedupingTest
    1. Start with the abstract methods; take a look at the bigquery implementations
    2. Those implementations depend on some minor tweaks in BigQueryDestinationTestUtils, and BigQueryDestination
    3. Then go through the junit-annotated methods (@BeforeEach, @AfterEach, @Test)
    4. The tests depend on the schema.json + *.jsonl files in the resources directory
    5. Glance very quickly at the stuff around setupProcessFactory - this was mostly copied from DestinationAcceptanceTest.
  4. BigQuerySqlGenerator - this is some simple tweaks to the T+D logic
  5. build.gradle + settings.gradle + airbyte-integration-test-java.gradle

🚨 User Impact 🚨

none, this only touches test code + 1s1t code

@octavia-squidington-iii octavia-squidington-iii added the area/connectors Connector related issues label Jun 29, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 29, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan and you've followed all steps in the Breaking Changes Checklist
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • The connector tests are passing in CI
  • You've updated the connector's metadata.yaml file (new!)
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ edgao
❌ octavia-approvington
You have signed the CLA already but the status is still pending? Let us recheck it.

@edgao edgao marked this pull request as ready for review July 3, 2023 19:32
@edgao edgao requested a review from a team as a code owner July 3, 2023 19:32
Copy link
Contributor

@evantahler evantahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the pattern of an 'raw' and 'final' file for each test, which we can do for every destination

@@ -0,0 +1,5 @@
{"_airbyte_extracted_at": "1970-01-01T00:00:02Z", "_airbyte_data": {"id1": 1, "id2": 200, "updated_at": "2000-01-02T00:00:00Z", "_ab_cdc_deleted_at": null, "name": "Alice", "address": {"city": "Seattle", "state": "WA"}}}
// Keep the record that deleted Bob, but delete the other records associated with id=(1, 201)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have comments in JSONl?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can if you're the one doing the parsing :P https://github.com/airbytehq/airbyte/pull/27852/files#diff-411f8457a8ec3a20dd71664fb4d47faed575ca681a68d041382d6768cea32aadR427

afaik jsonl doesn't have a "real" standard syntax, but vs code at least has syntax highlighting for this comment style. So I figure it's good enough 🚛
image

@edgao edgao marked this pull request as ready for review July 11, 2023 21:39
@edgao
Copy link
Contributor Author

edgao commented Jul 11, 2023

watching the DATs run for real and will merge if they pass.

@octavia-squidington-iii

This comment was marked as outdated.

@octavia-squidington-iii

This comment was marked as outdated.

@octavia-squidington-iii

This comment was marked as outdated.

@octavia-squidington-iii

This comment was marked as outdated.

@octavia-squidington-iii

This comment was marked as outdated.

@octavia-squidington-iii

This comment was marked as outdated.

@octavia-squidington-iii

This comment was marked as outdated.

@octavia-squidington-iii

This comment was marked as outdated.

@octavia-squidington-iii
Copy link
Collaborator

destination-bigquery-denormalized test report (commit 9d3275aa9a) - ✅

⏲️ Total pipeline duration: 19mn15s

Step Result
Validate airbyte-integrations/connectors/destination-bigquery-denormalized/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-bigquery-denormalized docker image for platform linux/x86_64
./gradlew :airbyte-integrations:connectors:destination-bigquery-denormalized:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery-denormalized test

@octavia-squidington-iii
Copy link
Collaborator

destination-bigquery test report (commit 9d3275aa9a) - ✅

⏲️ Total pipeline duration: 16.77s

Step Result

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

@octavia-squidington-iii
Copy link
Collaborator

destination-bigquery test report (commit 9d3275aa9a) - ✅

⏲️ Total pipeline duration: 29mn27s

Step Result
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-bigquery docker image for platform linux/x86_64
Build airbyte/normalization:dev
./gradlew :airbyte-integrations:connectors:destination-bigquery:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

@edgao
Copy link
Contributor Author

edgao commented Jul 13, 2023

connectors base build was broken on master; merged latest and hopefully that will fix

@octavia-squidington-iii
Copy link
Collaborator

destination-bigquery-denormalized test report (commit 9d3275aa9a) - ✅

⏲️ Total pipeline duration: 18mn35s

Step Result
Validate airbyte-integrations/connectors/destination-bigquery-denormalized/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-bigquery-denormalized docker image for platform linux/x86_64
./gradlew :airbyte-integrations:connectors:destination-bigquery-denormalized:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery-denormalized test

@octavia-squidington-iii
Copy link
Collaborator

destination-bigquery test report (commit b32f17a3d1) - ✅

⏲️ Total pipeline duration: 03mn08s

Step Result
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-bigquery docker image for platform linux/x86_64
Build airbyte/normalization:dev
./gradlew :airbyte-integrations:connectors:destination-bigquery:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

@octavia-squidington-iii
Copy link
Collaborator

destination-bigquery-denormalized test report (commit b32f17a3d1) - ✅

⏲️ Total pipeline duration: 17mn25s

Step Result
Validate airbyte-integrations/connectors/destination-bigquery-denormalized/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-bigquery-denormalized docker image for platform linux/x86_64
./gradlew :airbyte-integrations:connectors:destination-bigquery-denormalized:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery-denormalized test

@edgao edgao merged commit bf65992 into edgao/1s1t_redeploy Jul 13, 2023
14 checks passed
@edgao edgao deleted the edgao/1s1t/dats branch July 13, 2023 15:20
octavia-approvington added a commit that referenced this pull request Jul 14, 2023
* Revert "Revert "Destination Bigquery: Scaffolding for destinations v2 (#27268)""

This reverts commit 348c577.

* version bumps+changelog

* Speed up BQ by having 2 queries, and not an OR (#27981)

* 🐛 Destination Bigquery: fix bug in standard inserts for syncs >10K records (#27856)

* only run t+d code if it's enabled

* dockerfile+changelog

* remove changelog entry

* Destinations V2: handle optional fields for `object` and `array` types (#27898)

* catch null schema

* fix null properties

* clean up

* consolidate + add more tests

* try catch

* empty json test

* Automated Commit - Formatting Changes

* remove todo

* destination bigquery: misc updates to 1s1t code (#28057)

* switch to checkedconsumer

* add unit test for buildColumnId

* use flag

* restructure prefix check

* fix build

* more type-parsing fixes (#28100)

* more type-parsing fixes

* handle duplicates

* Automated Commit - Format and Process Resources Changes

* add tests for asColumns

* Automated Commit - Format and Process Resources Changes

* log warnings instead of throwing exception

* better log message

* error level

---------

Co-authored-by: edgao <edgao@users.noreply.github.com>

* Automated Commit - Formatting Changes

* Improve protocol type parsing (#28126)

* Automated Commit - Formatting Changes

* Change from T&D every 10k records to an increasing time based interval (#28130)

* fifteen minute t&d

* add typing and deduping operation valve for increased intervals of typing and deduping

* Automated Commit - Format and Process Resources Changes

* resolve bizarre merge conflict

* Automated Commit - Format and Process Resources Changes

---------

Co-authored-by: jbfbell <jbfbell@users.noreply.github.com>

* Simplify and speed up CDC delete support [DestinationsV2] (#28029)

* Simplify and speed up CDC delete support [DestinationsV2]

* better QUOTE

* spotbugs?

* recompile dbt image for local arch and use that when building images

* things compile, but tests fail

* tests working-ish

* comment

* fix logic to re-insert deleted records for cursor comparison.

tests pass!

* remove comment

* Skip CDC re-include logic if there are no CDC columns

* stop hardcoding pk (#28092)

* wip

* remove TODOs

---------

Co-authored-by: Edward Gao <edward.gao@airbyte.io>

* update method name

* Automated Commit - Formatting Changes

* depend on pinned normalization version

* implement 1s1t DATs for destination-bigquery (#27852)

* intiial implementation

* Automated Commit - Formatting Changes

* add second sync to test

* do concurrent things

* Automated Commit - Formatting Changes

* clarify comment

* minor tweaks

* more stuff

* Automated Commit - Formatting Changes

* minor cleanup

* lots of fixes

* handle sql vs json null better
* verify extra columns
* only check deleted_at if in DEDUP mode and the column exists
* add full refresh append test case

* Automated Commit - Formatting Changes

* add tests for the remaining sync modes

* Automated Commit - Formatting Changes

* readability stuff

* Automated Commit - Formatting Changes

* add test for gcs mode

* remove static fields

* Automated Commit - Formatting Changes

* add more test cases, tweak test scaffold

* cleanup

* Automated Commit - Formatting Changes

* extract recorddiffer

* and use it in the sql generator test

* fix

* comment

* naming+comment

* one more comment

* better assert

* remove unnecessary thing

* one last thing

* Automated Commit - Formatting Changes

* enable concurrent execution on all java integration tests

* add test for default namespace

* Automated Commit - Formatting Changes

* implement a 2-stream test

* Automated Commit - Formatting Changes

* extract methods

* invert jsonNodesNotEquivalent

* Automated Commit - Formatting Changes

* fix conditional

* pull out diffSingleRecord

* Automated Commit - Formatting Changes

* handle nulls correctly

* remove raw-specific handling; break up methods

* Automated Commit - Formatting Changes

---------

Co-authored-by: edgao <edgao@users.noreply.github.com>
Co-authored-by: octavia-approvington <octavia-approvington@users.noreply.github.com>

* Destinations V2: move create raw tables earlier (#28255)

* move create raw tables

* better log message

* stop building normalization (#28256)

* fix ability to run tests

* disable incremental t+d for now

* Automated Commit - Formatting Changes

---------

Co-authored-by: Evan Tahler <evan@airbyte.io>
Co-authored-by: Cynthia Yin <cynthia@airbyte.io>
Co-authored-by: cynthiaxyin <cynthiaxyin@users.noreply.github.com>
Co-authored-by: edgao <edgao@users.noreply.github.com>
Co-authored-by: Joe Bell <joseph.bell@airbyte.io>
Co-authored-by: jbfbell <jbfbell@users.noreply.github.com>
Co-authored-by: octavia-approvington <octavia-approvington@users.noreply.github.com>
efimmatytsin pushed a commit to scentbird/airbyte that referenced this pull request Jul 27, 2023
* Revert "Revert "Destination Bigquery: Scaffolding for destinations v2 (airbytehq#27268)""

This reverts commit 348c577.

* version bumps+changelog

* Speed up BQ by having 2 queries, and not an OR (airbytehq#27981)

* 🐛 Destination Bigquery: fix bug in standard inserts for syncs >10K records (airbytehq#27856)

* only run t+d code if it's enabled

* dockerfile+changelog

* remove changelog entry

* Destinations V2: handle optional fields for `object` and `array` types (airbytehq#27898)

* catch null schema

* fix null properties

* clean up

* consolidate + add more tests

* try catch

* empty json test

* Automated Commit - Formatting Changes

* remove todo

* destination bigquery: misc updates to 1s1t code (airbytehq#28057)

* switch to checkedconsumer

* add unit test for buildColumnId

* use flag

* restructure prefix check

* fix build

* more type-parsing fixes (airbytehq#28100)

* more type-parsing fixes

* handle duplicates

* Automated Commit - Format and Process Resources Changes

* add tests for asColumns

* Automated Commit - Format and Process Resources Changes

* log warnings instead of throwing exception

* better log message

* error level

---------

Co-authored-by: edgao <edgao@users.noreply.github.com>

* Automated Commit - Formatting Changes

* Improve protocol type parsing (airbytehq#28126)

* Automated Commit - Formatting Changes

* Change from T&D every 10k records to an increasing time based interval (airbytehq#28130)

* fifteen minute t&d

* add typing and deduping operation valve for increased intervals of typing and deduping

* Automated Commit - Format and Process Resources Changes

* resolve bizarre merge conflict

* Automated Commit - Format and Process Resources Changes

---------

Co-authored-by: jbfbell <jbfbell@users.noreply.github.com>

* Simplify and speed up CDC delete support [DestinationsV2] (airbytehq#28029)

* Simplify and speed up CDC delete support [DestinationsV2]

* better QUOTE

* spotbugs?

* recompile dbt image for local arch and use that when building images

* things compile, but tests fail

* tests working-ish

* comment

* fix logic to re-insert deleted records for cursor comparison.

tests pass!

* remove comment

* Skip CDC re-include logic if there are no CDC columns

* stop hardcoding pk (airbytehq#28092)

* wip

* remove TODOs

---------

Co-authored-by: Edward Gao <edward.gao@airbyte.io>

* update method name

* Automated Commit - Formatting Changes

* depend on pinned normalization version

* implement 1s1t DATs for destination-bigquery (airbytehq#27852)

* intiial implementation

* Automated Commit - Formatting Changes

* add second sync to test

* do concurrent things

* Automated Commit - Formatting Changes

* clarify comment

* minor tweaks

* more stuff

* Automated Commit - Formatting Changes

* minor cleanup

* lots of fixes

* handle sql vs json null better
* verify extra columns
* only check deleted_at if in DEDUP mode and the column exists
* add full refresh append test case

* Automated Commit - Formatting Changes

* add tests for the remaining sync modes

* Automated Commit - Formatting Changes

* readability stuff

* Automated Commit - Formatting Changes

* add test for gcs mode

* remove static fields

* Automated Commit - Formatting Changes

* add more test cases, tweak test scaffold

* cleanup

* Automated Commit - Formatting Changes

* extract recorddiffer

* and use it in the sql generator test

* fix

* comment

* naming+comment

* one more comment

* better assert

* remove unnecessary thing

* one last thing

* Automated Commit - Formatting Changes

* enable concurrent execution on all java integration tests

* add test for default namespace

* Automated Commit - Formatting Changes

* implement a 2-stream test

* Automated Commit - Formatting Changes

* extract methods

* invert jsonNodesNotEquivalent

* Automated Commit - Formatting Changes

* fix conditional

* pull out diffSingleRecord

* Automated Commit - Formatting Changes

* handle nulls correctly

* remove raw-specific handling; break up methods

* Automated Commit - Formatting Changes

---------

Co-authored-by: edgao <edgao@users.noreply.github.com>
Co-authored-by: octavia-approvington <octavia-approvington@users.noreply.github.com>

* Destinations V2: move create raw tables earlier (airbytehq#28255)

* move create raw tables

* better log message

* stop building normalization (airbytehq#28256)

* fix ability to run tests

* disable incremental t+d for now

* Automated Commit - Formatting Changes

---------

Co-authored-by: Evan Tahler <evan@airbyte.io>
Co-authored-by: Cynthia Yin <cynthia@airbyte.io>
Co-authored-by: cynthiaxyin <cynthiaxyin@users.noreply.github.com>
Co-authored-by: edgao <edgao@users.noreply.github.com>
Co-authored-by: Joe Bell <joseph.bell@airbyte.io>
Co-authored-by: jbfbell <jbfbell@users.noreply.github.com>
Co-authored-by: octavia-approvington <octavia-approvington@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write 1s1t DAT tests
6 participants