You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since Cosmos 1.1, it creates Airflow inlets and outlets for every dbt model/seed/snapshot task, which allows end-users to leverage Airflow Data-aware scheduling.
In the past, Cosmos had identified these inlets and outlets using URIs that were not representative of the dataset being created. The one advantage with this approach is that the identifiers could be created during DAG parsing/processing time.
This changed in the 1.1 release, when we decided to adopt the OpenLineage naming convention to describe Airflow Datasets created by Cosmos (inlets/outlets). They became something similar to: "postgres://0.0.0.0:5432/postgres.public.stg_customers". The downside with this approach was: we started using a library openlineage-integration-common that can only create the resources URIs after the dbt command was run, since it currently relies on dbt-core artefacts. This means we started creating inlets/outlets during task execution.
A side-effect of this change was that Airflow <= 2.9 was not designed to support setting inlets and outlets during task execution, which resulted in this long-standing issue: #522
Another side effect was that, since we started relying on task execution to determine the Airflow dataset identifier, we didn't expose end-users to a method for easily determining it. More context on #1036.
After several discussions with @uranusjr, he proposed introducing the concept of DatasetAliases to Airflow 2.10. @Lee-W worked on this: apache/airflow#40478
This feature will be released as part of Airlfow 2.10.
The goal of this epic is to leverage Airflow DatasetAliasses in Cosmos, so that:
users can clearly see datasets created by Cosmos, during task execution, in the Airflow UI
We can have non-OpenLineage Dataset Aliases that can be added during DAG parsing time. We can expose methods for users than to be able to retrieve these.
Initially planned tasks, more to be added as part of the PoC ticket:
I made significant progress on this task, as can be seen in PR #1217.
Yesterday, I implemented the changes to the code itself (no tests, just a quick PoC).
Today, I validated and made a minor adjustment to make it work.
The change works as expected in Astro CLI. Using Airflow standalone doesn't work so well. I connected with Wei about this and he'll further investigate.
I was able to see the Datasets/Datasets Alias in the Airflow UI.
I was also able to see a DAG being triggered. I'll soon share more information on this.
Description co-authored by @tatiana @pankajastro
Since Cosmos 1.1, it creates Airflow inlets and outlets for every dbt model/seed/snapshot task, which allows end-users to leverage Airflow Data-aware scheduling.
In the past, Cosmos had identified these inlets and outlets using URIs that were not representative of the dataset being created. The one advantage with this approach is that the identifiers could be created during DAG parsing/processing time.
This changed in the 1.1 release, when we decided to adopt the OpenLineage naming convention to describe Airflow Datasets created by Cosmos (inlets/outlets). They became something similar to: "postgres://0.0.0.0:5432/postgres.public.stg_customers". The downside with this approach was: we started using a library
openlineage-integration-common
that can only create the resources URIs after the dbt command was run, since it currently relies on dbt-core artefacts. This means we started creating inlets/outlets during task execution.A side-effect of this change was that Airflow <= 2.9 was not designed to support setting inlets and outlets during task execution, which resulted in this long-standing issue:
#522
Another side effect was that, since we started relying on task execution to determine the Airflow dataset identifier, we didn't expose end-users to a method for easily determining it. More context on #1036.
The community very often raises that.
We created an issue in Airflow:
apache/airflow#34206
After several discussions with @uranusjr, he proposed introducing the concept of DatasetAliases to Airflow 2.10. @Lee-W worked on this: apache/airflow#40478
This feature will be released as part of Airlfow 2.10.
The goal of this epic is to leverage Airflow DatasetAliasses in Cosmos, so that:
Initially planned tasks, more to be added as part of the PoC ticket:
DatasetAlias
in Cosmos #1135The text was updated successfully, but these errors were encountered: