Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to disable schema check and creation on load_file #1922

Merged
merged 8 commits into from
May 5, 2023

Conversation

tatiana
Copy link
Collaborator

@tatiana tatiana commented May 5, 2023

Support running load_file without checking if the table schema exists or trying to create it.

Recently a user reported that the cost of checking if the schema exists is very high for Snowflake:
"I have a (load_file) task that took 1:36 minutes to run, and it was 1:30 running the information schema query."
This is likely happening for other databases as well.

Introduce two ways of disabling schema checks:

  1. On a per-task basis, by exposing the argument schema_exists in aql.load_file
    When this argument is True, the SDK will not check if the schema exists or try to create it.
    It is False by default, and the Python SDK will behave as of 1.6 (running schema check and, if needed, trying to create the schema)

  2. Globally, by exposing the Airflow configuration load_table_schema_exists in the [astro-sdk] section. This can also be set using the environment variable AIRFLOW__ASTRO_SDK__LOAD_TABLE_SCHEMA_EXISTS. The global configuration can be overridden per task, using [1].

Closes: #1921

@codecov
Copy link

codecov bot commented May 5, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.51 ⚠️

Comparison is base (af36feb) 85.31% compared to head (0499044) 84.81%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1922      +/-   ##
==========================================
- Coverage   85.31%   84.81%   -0.51%     
==========================================
  Files         104      104              
  Lines        5952     5959       +7     
  Branches      677      678       +1     
==========================================
- Hits         5078     5054      -24     
- Misses        735      762      +27     
- Partials      139      143       +4     
Flag Coverage Δ
PythonSDK 92.36% <100.00%> (-0.72%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python-sdk/src/astro/databases/base.py 91.86% <100.00%> (-1.60%) ⬇️
python-sdk/src/astro/databases/databricks/delta.py 85.71% <100.00%> (+0.09%) ⬆️
python-sdk/src/astro/settings.py 100.00% <100.00%> (ø)
python-sdk/src/astro/sql/operators/load_file.py 98.21% <100.00%> (+1.81%) ⬆️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@tatiana tatiana marked this pull request as ready for review May 5, 2023 10:09
@utkarsharma2
Copy link
Collaborator

@tatiana Do you think a global setting as a default can be helpful? like an env variable?

@tatiana
Copy link
Collaborator Author

tatiana commented May 5, 2023

@tatiana Do you think a global setting as a default can be helpful? like an env variable?

@utkarsharma2 that's a great idea, I'll add to this PR.

@tatiana
Copy link
Collaborator Author

tatiana commented May 5, 2023

@utkarsharma2 I added the global config, please, let me know your thoughts!

@tatiana tatiana merged commit 74a6894 into main May 5, 2023
@tatiana tatiana deleted the issue-1921 branch May 5, 2023 13:02
tatiana added a commit that referenced this pull request May 5, 2023
This is a follow-up for #1922. In that PR we allowed users to skip
schema check & creation for `aql.load_file`, but we missed the fact that
`aql.transform` and `aql.transform_file` had the same issue. This PR
aims to address this limitation.

Changes included in this PR:
* Rename config `load_table_schema_exists` to `assume_schema_exists`
* Rename (`load_file`) argument `schema_exists` to
`assume_schema_exists`
* Refactor where the check for `assume_schema_exists` happens. Before,
it happened only inside the `load_file_to_table`. Now, it is part of
`create_schema_if_applicable`. This makes this feature available in the
`aql.transform` task as well
* Rename `Database.create_schema_if_needed` to
`Database.create_schema_if_applicable`
* Expose `assume_schema_exists` in `aql.transform`
* Release 1.7.0a2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow users to disable schema check & creation on load_file
3 participants