Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Snowflake load_file using native COPY INTO #544

Merged
merged 17 commits into from
Jul 26, 2022
Merged

Conversation

tatiana
Copy link
Collaborator

@tatiana tatiana commented Jul 15, 2022

Fix: #430
closes: #481

Reduced the time to load to Snowflake by 20% for 5GB datasets (from 24.46 min to 5.49 min). Further details in the PR results file.

Co-authored-by: Ankit Chaurasia ankit.chaurasia@astronomer.io

@tatiana tatiana force-pushed the issue-492-last branch 2 times, most recently from 507ba2a to 9aacba0 Compare July 18, 2022 10:55
@tatiana tatiana changed the title WIP: Add Snowflake load_file optimization Optimize Snowflake load_file using native COPY INTO Jul 18, 2022
@tatiana tatiana marked this pull request as ready for review July 18, 2022 12:10
@codecov
Copy link

codecov bot commented Jul 18, 2022

Codecov Report

Merging #544 (17c69ac) into main (a0d1912) will increase coverage by 0.08%.
The diff coverage is 96.29%.

❗ Current head 17c69ac differs from pull request most recent head 4737231. Consider uploading reports for the commit 4737231 to get more accurate results

@@            Coverage Diff             @@
##             main     #544      +/-   ##
==========================================
+ Coverage   92.59%   92.68%   +0.08%     
==========================================
  Files          40       40              
  Lines        1594     1613      +19     
  Branches      205      206       +1     
==========================================
+ Hits         1476     1495      +19     
  Misses         93       93              
  Partials       25       25              
Impacted Files Coverage Δ
src/astro/databases/snowflake.py 95.69% <96.29%> (+0.48%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0d1912...4737231. Read the comment docs.

Copy link
Contributor

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me with a minor comment. I think code coverage might need to be improved.

:param source_file: File from which we need to transfer data
:param target_table: Table to which the content of the file will be loaded to
:param if_exists: Strategy used to load (currently supported: "append" or "replace")
:param native_support_kwargs: kwargs to be used by method involved in native support flow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be it useful for users to know what are different native_support_kwargs supported? or may be reference to COPY INTO documentation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunank200, please, have a look now!
I had already added the reference to the official docs on the bottom of the docstring, do you think I should add some other? Or place it in another place?

Copy link
Contributor

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is calling wrong method i assume. It should have called load_file_to_table_natively(). Thats the reason coverage shows this test not being run.

tests/databases/test_snowflake.py Show resolved Hide resolved
@tatiana tatiana force-pushed the issue-492-last branch 2 times, most recently from 0d603ac to d5cd2cd Compare July 19, 2022 13:59
@tatiana
Copy link
Collaborator Author

tatiana commented Jul 26, 2022

We added the benchmark results to the PR and fixed the broken tests.

Two checks are failing probably because of pypa/pip#11298 - which affects all branches. I just re-run the tests for a branch which was passing 4h ago, that wasn't changed and the same tests failed.

I've contacted TP to get his advice, and in the meantime, we'll merge this PR.

@tatiana tatiana merged commit 461e63e into main Jul 26, 2022
@tatiana tatiana deleted the issue-492-last branch July 26, 2022 01:11
"""


@aql.dataframe(identifiers_as_lower=False)
@aql.dataframe()
Copy link
Collaborator Author

@tatiana tatiana Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to handle capitalization more consistently across dataframes / loading file after implementing: #564

@@ -67,11 +67,11 @@ def example_amazon_s3_snowflake_transform():
)

temp_table_1 = aql.load_file(
input_file=File(path=f"{s3_bucket}/ADOPTION_CENTER_1.csv"),
input_file=File(path=f"{s3_bucket}/ADOPTION_CENTER_1_unquoted.csv"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSV was very odd beforehand in the format:

"header1","header2"
"value1","value2"

The COPY INTO command wasn't acceptable in this format, and we decided that this file could be cleaned beforehand. Users may face similar issues in future.

| snowflake | hundred_kb | 9.19s | 45.4MB | 2.55s | 120.0ms |
| snowflake | ten_mb | 10.9s | 47.51MB | 2.58s | 160.0ms |
| snowflake | one_gb | 1.07min | 47.94MB | 8.7s | 5.67s |
| snowflake | five_gb | 5.49min | 53.69MB | 18.76s | 1.6s |
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll log a ticket for us to further investigate why we weren't able to run the benchmark with 10 GB for this implementation.

# Snowflake doesn't handle well mixed capitalisation of column name chars
# we are handling this more gracefully in a separate PR
if dataframe is not None:
dataframe.columns.str.upper()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to handle capitalization more consistently across dataframes / loading file after implementing: #564

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants