-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create and run accurate SQL statements when using ExecutionMode.AIRFLOW_ASYNC
#1474
Conversation
✅ Deploy Preview for sunny-pastelito-5ecb04 canceled.
|
Deploying astronomer-cosmos with Cloudflare Pages
|
9590f54
to
e521051
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pankajkoti I'm very excited that we now have a more reliable way of calculating the full dbt SQL query. This approach fixes #1260 and solves many of the async tickets we have open.
Monkey-patching always carries a risk, but it is worth it at this stage.
It would be great if - either as part of this PR - or as a priority follow-up PR, we have an efficient way of testing that the monkey patching works in multiple versions of dbt, including the latest releases, and that the transformation is not being executed when we run the dbt command. I believe this must be done before we release this feature in 1.9.0
I've logged two follow-up tickets that are relevant:
- one is to consider the re-introduction of the compile task, if that means we can avoid having dbt installed in all dbt worker nodes, and executing the command in most Cosmos tasks: Re-evaluate adding compile task when using
ExecutionMode.AIRFLOW_ASYNC
#1477 - the other is to support
TestBehavior.BUILD
SupportTestBehavior.BUILD
when usingExecutionMode.AIRFLOW_ASYNC
#1476
It would be great if these could be accomplished before 1.9.0 release, but I'm also happy with us sticking to approach if time does not allow further analysis / implementation.
cc: @joppevos for visibility on the ongoing work |
8ef4ac5
to
bd85529
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
ExecutionMode.AIRFLOW_ASYNC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pankajkoti Congratulations on the outstanding work in this PR and on your patience in addressing and fixing each bug that popped up during the development of this feature. I can't wait to see this feature used in production.
In addition to the feedback that I gave previously, there are two change requests:
- This comment: https://github.com/astronomer/astronomer-cosmos/pull/1474/files#r1941988073
- And this comment: https://github.com/astronomer/astronomer-cosmos/pull/1474/files#r1941991943
Given the size of this PR and all the challenges already overcome, I do not want my design requests to block its merging. So, your PR is approved. However, please create a follow-up ticket and prioritise it over any other work so the interfaces can be simplified as soon as possible. Other tasks planned for the 1.9 release will depend on these interface changes, so please prioritise them over any other work so we can wrap this up.
Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>
Overview
This PR introduces a reliable way to extract SQL statements run by
dbt-core
so Airflow asynchronous operators can use them. It fixes the experimental BQ implementation ofExecutionMode.AIRFLOW_ASYNC
introduced in Cosmos 1.7 (#1230).Previously, in #1230, we attempted to understand the implementation of how
dbt-core
runs--full-refresh
for BQ, and we hard-coded the SQL header in Cosmos as an experimental feature. Since then, we realised that this approach was prone to errors (e.g. #1260) and that it is unrealistic for Cosmos to try to recreate the logic of howdbt-core
and its adaptors generate all the SQL statements for different operations, data warehouses, and types of materialisation.With this PR, we use
dbt-core
to create the complete SQL statements withoutdbt-core
running those transformations. This enables better compatibility with variousdbt-core
features while ensuring correctness in running models.The drawback of the current approach is that it relies on monkey patching, a technique used to dynamically update the behaviour of a piece of code at run-time. Cosmos is monkey patching
dbt-core
adaptors methods at the moment that they would generally execute SQL statements - Cosmos modifies this behaviour so that the SQL statements are writen to disk without performing any operations to the actual data warehouse.The main drawback of this strategy is in case dbt changes its interface. For this reason, we logged the follow-up ticket #1489 to make sure we test the latest version of dbt and its adapters and confirm the monkey patching works as expected regardless of the version being used. That said, since the method being monkey patched is part of the
dbt-core
interface with its adaptors, we believe the risks of breaking changes will be low.The other challenge with the current approach is that every Cosmos task relies on the following:
dbt-core
being installed alongside the Airflow installationdbtRunner
logicWe have logged a follow-up ticket to evaluate the possibility of overcoming these challenges: #1477
Key Changes
_mock_bigquery_adapter()
to overrideBigQueryConnectionManager.execute
, ensuring SQL is only written to thetarget
directory and skipping execution in the warehouse.AbstractDbtBaseOperator
:AbstractDbtBaseOperator
inheritedBaseOperator
, causing conflicts when used withBigQueryInsertJobOperator
with ourEXECUTIONMODE.AIRFLOW_ASYNC
classes and the interface built in Add structure to support multiple db for async operator execution #1483AbstractDbtBase
(no longer inheritingBaseOperator
), requiring explicitBaseOperator
initialization in all derived operators.BaseOperator
:DbtAzureContainerInstanceBaseOperator
DbtDockerBaseOperator
DbtGcpCloudRunJobBaseOperator
DbtKubernetesBaseOperator
_add_dbt_compile_task
, which previously pre-generated SQL and uploaded it to remote storage and subsequent task downloaded this compiled SQL for their execution.dbt run
is now directly invoked in each task using the mocked adapter to generate the full SQL.Issue updates
The PR fixes the following issues:
ExecutionMode.AIRFLOW_ASYNC
query #1260ExecutionMode.AIRFLOW_ASYNC
#1271--full-refresh
when usingExecutionMode.AIRFLOW_ASYNC
#1265ExecutionMode.AIRFLOW_ASYNC
#1264EXECUTIONMODE.AIRFLOW_ASYNC
too with this PRExample DAG showing
EXECUTIONMODE.AIRFLOW_ASYNC
deferring tasks and the dynamic query submitted in the logsNext Steps & Considerations:
ExecutionMode.AIRFLOW_ASYNC
by seeking feedback from users by testing alpha https://github.com/astronomer/astronomer-cosmos/releases/tag/astronomer-cosmos-v1.9.0a5 created with changes from this PR.ExecutionMode.AIRFLOW_ASYNC
#1477, Compare the efficiency of generating SQL dynamically vs. pre-compiling and uploading SQL via a separate task.