Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 fixed OOM error when splitting a stream into several files #7074

Merged
merged 8 commits into from
Oct 18, 2021

Conversation

andriikorotkov
Copy link
Contributor

What

AWS S3 Staging COPY is writing records from different table in the same raw table

How

It is optimal to write every 10,000,000 records to a new file. This will make it easier to work with files and speed up the recording of large amounts of data. In addition, for a large number of records, we will not get a drop in the copy request to QUERY_TIMEOUT when the records from the file are copied to the staging table.

2021-10-15_15-48

Recommended reading order

  1. x.java
  2. y.python

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions
  • Connector added to connector index like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions
  • Connector version bumped like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

Connector Generator

  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed.

@andriikorotkov andriikorotkov temporarily deployed to more-secrets October 15, 2021 13:10 Inactive
@github-actions github-actions bot added the area/connectors Connector related issues label Oct 15, 2021
@andriikorotkov
Copy link
Contributor Author

andriikorotkov commented Oct 15, 2021

/test connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1346052847
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1346052847
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 370    218    41%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1062    403    62%

@andriikorotkov
Copy link
Contributor Author

andriikorotkov commented Oct 15, 2021

/test connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1346053543
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1346053543
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 370    218    41%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1062    403    62%

@jrhizor jrhizor temporarily deployed to more-secrets October 15, 2021 13:16 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets October 15, 2021 13:16 Inactive
@tuliren
Copy link
Contributor

tuliren commented Oct 15, 2021

@andriikorotkov, the previous PR #6949 is titled as a fix to the bug that writes multiple streams to the same table. However, does it actually fix this bug? There was no unit test in that PR to show that the bug had been fixed. Could you add unit tests in this PR?

@andriikorotkov andriikorotkov marked this pull request as ready for review October 15, 2021 17:49
@andriikorotkov andriikorotkov changed the title bug AWS S3 Staging COPY is writing records from different table in the same raw table 🐛 fixed OOM error when splitting a stream into several files Oct 15, 2021
@andriikorotkov
Copy link
Contributor Author

@tuliren, you are right, this is not entirely true. I saw that contributors in the troubleshooting group had an OOM error at this point. I tested this again with a lot of data (as you can see it in the screenshots) and fixed it in no time. If you have time, I would like to know your opinion on this fix and get any remarks from you.

@andriikorotkov andriikorotkov temporarily deployed to more-secrets October 15, 2021 18:23 Inactive
@andriikorotkov andriikorotkov temporarily deployed to more-secrets October 15, 2021 18:32 Inactive
@andriikorotkov
Copy link
Contributor Author

andriikorotkov commented Oct 15, 2021

/test connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1347114192
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1347114192
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 370    218    41%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1062    403    62%

@jrhizor jrhizor temporarily deployed to more-secrets October 15, 2021 18:50 Inactive
@andriikorotkov
Copy link
Contributor Author

andriikorotkov commented Oct 15, 2021

/test connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1347213500
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1347213500
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 370    218    41%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1062    403    62%

@jrhizor jrhizor temporarily deployed to more-secrets October 15, 2021 19:24 Inactive
@andriikorotkov
Copy link
Contributor Author

andriikorotkov commented Oct 15, 2021

/test connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1347573288

@andriikorotkov andriikorotkov temporarily deployed to more-secrets October 15, 2021 21:36 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets October 15, 2021 21:37 Inactive
@andriikorotkov andriikorotkov temporarily deployed to more-secrets October 16, 2021 08:15 Inactive
@tuliren
Copy link
Contributor

tuliren commented Oct 16, 2021

Thanks for adding the testSyncWithBillionRecords test case. However, it looks like it takes too long to run. I think it's fine to not include a load test here. We will solve the stress test somewhere else.

@andriikorotkov
Copy link
Contributor Author

andriikorotkov commented Oct 16, 2021

/test connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1348878751
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1348878751
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 370    218    41%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1062    403    62%

@andriikorotkov
Copy link
Contributor Author

andriikorotkov commented Oct 16, 2021

/test connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1348878868
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1348878868
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 370    218    41%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1062    403    62%

@andriikorotkov andriikorotkov temporarily deployed to more-secrets October 16, 2021 09:52 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets October 16, 2021 09:53 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets October 16, 2021 09:53 Inactive
@andriikorotkov
Copy link
Contributor Author

@tuliren I marked a test with a lot of records as disabled.

@tuliren tuliren temporarily deployed to more-secrets October 18, 2021 09:30 Inactive
@tuliren
Copy link
Contributor

tuliren commented Oct 18, 2021

/test connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1353995297
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1353995297
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 370    218    41%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1062    403    62%

@tuliren
Copy link
Contributor

tuliren commented Oct 18, 2021

/test connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1353995661
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1353995661
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 370    218    41%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1062    403    62%

Copy link
Contributor

@tuliren tuliren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated the filename generation logic to its own class, StagingFilenameGenerator, so that the logics can be unit tested. It turns out that there was an off-by-one bug in the original code, which has been fixed now.

@jrhizor jrhizor temporarily deployed to more-secrets October 18, 2021 09:32 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets October 18, 2021 09:32 Inactive
@andriikorotkov
Copy link
Contributor Author

andriikorotkov commented Oct 18, 2021

/publish connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1354194548
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1354194548

@jrhizor jrhizor temporarily deployed to more-secrets October 18, 2021 10:30 Inactive
@andriikorotkov
Copy link
Contributor Author

andriikorotkov commented Oct 18, 2021

/publish connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1354324703
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1354324703

@jrhizor jrhizor temporarily deployed to more-secrets October 18, 2021 11:08 Inactive
@andriikorotkov andriikorotkov merged commit ad2b6db into master Oct 18, 2021
@andriikorotkov andriikorotkov deleted the akorotkov|copy-s3-fix branch October 18, 2021 12:00
schlattk pushed a commit to schlattk/airbyte that referenced this pull request Jan 4, 2022
…hq#7074)

* UPDATED DESTINATION JDBC

* fixed remarks

* fixed remarks

* added new test for testing largeNumberRecords

* updated test

* fixed remarks

* Separate staging filename generator and add tests

Co-authored-by: Liren Tu <tuliren.git@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants