-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support using DatasetAlias
and fix orphaning unreferenced dataset
#1217
Conversation
✅ Deploy Preview for sunny-pastelito-5ecb04 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
0cc9774
to
46c86a2
Compare
26ba5f8
to
f2c8cf1
Compare
04cc589
to
d9bbc02
Compare
…oduce the airflow standalone issue we're experiencing
7ca4067
to
1a32630
Compare
DatasetAlias
and fix orphaning unreferenced dataset
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1217 +/- ##
==========================================
+ Coverage 95.78% 95.82% +0.04%
==========================================
Files 65 66 +1
Lines 3745 3783 +38
==========================================
+ Hits 3587 3625 +38
Misses 158 158
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, excited for this one! 👏🏽
New Features * Introduction of experimental support to run dbt BQ models using Airflow deferrable operators by @pankajkoti @pankajastro @tatiana in #1224 #1230. This is a first step in this journey and we would really appreciate feedback from the community. For more information, check the documentation: https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes.html#airflow-async-experimental This work has been inspired by the talk "Airflow at Monzo: Evolving our data platform as the bank scales" by @jonathanrainer @ed-sparkes given at Airflow Summit 2023: https://airflowsummit.org/sessions/2023/airflow-at-monzo-evolving-our-data-platform-as-the-bank-scales/. * Support using ``DatasetAlias`` and fix orphaning unreferenced dataset by @tatiana in #1217 #1240 Documentation: https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html#data-aware-scheduling * Add GCP_CLOUD_RUN_JOB execution mode by @ags-de #1153 Learn more about it: https://astronomer.github.io/astronomer-cosmos/getting_started/gcp-cloud-run-job.html Enhancements * Create single virtualenv when ``DbtVirtualenvBaseOperator`` has ``virtualenv_dir=None`` and ``is_virtualenv_dir_temporary=True`` by @kesompochy in #1200 * Consistently handle build and imports in ``cosmos/__init__.py`` by @tatiana in #1215 * Add enum constants to init for direct import by @fabiomx in #1184 Bug fixes * URL encode dataset names to support multibyte characters by @t0momi219 in #1198 * Fix invalid argument (``full_refresh``) passed to DbtTestAwsEksOperator (and others) by @johnhoran in #1175 * Fix ``printer_width`` arg type in ``DbtProfileConfigVars`` by @jessicaschueler in #1191 * Fix task owner fallback by @jmaicher in #1195 Docs * Add scarf to readme and docs for website analytics by @cmarteepants in #1221 * Add ``virtualenv_dir`` param to ``ExecutionConfig`` docs by @pankajkoti in #1173 * Give credits to @LennartKloppenburg in CHANGELOG.rst by @tatiana #1174 * Refactor docs for async mode execution by @pankajkoti in #1241 Others * Remove PR branch added for testing a change in CI in #1224 by @pankajkoti in #1233 * Fix CI wrt broken coverage upload artifact @pankajkoti in #1210 * Fix CI issues - Upgrade actions/upload-artifact & actions/download-artifact to v4 and set min version for packaging by @pankajkoti in #1208 * Resolve CI failures for Apache Airflow 2.7 jobs by @pankajkoti in #1182 * CI: Update GCP manifest file path based on new secret update by @pankajkoti in #1237 * Pre-commit hook updates in #1176 #1186, #1186, #1201, #1219, #1231
New Features * Introduction of experimental support to run dbt BQ models using Airflow deferrable operators by @pankajkoti @pankajastro @tatiana in #1224 #1230. This is a first step in this journey and we would really appreciate feedback from the community. For more information, check the documentation: https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes.html#airflow-async-experimental This work has been inspired by the talk "Airflow at Monzo: Evolving our data platform as the bank scales" by @jonathanrainer @ed-sparkes given at Airflow Summit 2023: https://airflowsummit.org/sessions/2023/airflow-at-monzo-evolving-our-data-platform-as-the-bank-scales/. * Support using ``DatasetAlias`` and fix orphaning unreferenced dataset by @tatiana in #1217 #1240 Documentation: https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html#data-aware-scheduling * Add GCP_CLOUD_RUN_JOB execution mode by @ags-de #1153 Learn more about it: https://astronomer.github.io/astronomer-cosmos/getting_started/gcp-cloud-run-job.html Enhancements * Create single virtualenv when ``DbtVirtualenvBaseOperator`` has ``virtualenv_dir=None`` and ``is_virtualenv_dir_temporary=True`` by @kesompochy in #1200 * Consistently handle build and imports in ``cosmos/__init__.py`` by @tatiana in #1215 * Add enum constants to init for direct import by @fabiomx in #1184 Bug fixes * URL encode dataset names to support multibyte characters by @t0momi219 in #1198 * Fix invalid argument (``full_refresh``) passed to DbtTestAwsEksOperator (and others) by @johnhoran in #1175 * Fix ``printer_width`` arg type in ``DbtProfileConfigVars`` by @jessicaschueler in #1191 * Fix task owner fallback by @jmaicher in #1195 Docs * Add scarf to readme and docs for website analytics by @cmarteepants in #1221 * Add ``virtualenv_dir`` param to ``ExecutionConfig`` docs by @pankajkoti in #1173 * Give credits to @LennartKloppenburg in CHANGELOG.rst by @tatiana #1174 * Refactor docs for async mode execution by @pankajkoti in #1241 Others * Remove PR branch added for testing a change in CI in #1224 by @pankajkoti in #1233 * Fix CI wrt broken coverage upload artifact @pankajkoti in #1210 * Fix CI issues - Upgrade actions/upload-artifact & actions/download-artifact to v4 and set min version for packaging by @pankajkoti in #1208 * Resolve CI failures for Apache Airflow 2.7 jobs by @pankajkoti in #1182 * CI: Update GCP manifest file path based on new secret update by @pankajkoti in #1237 * Pre-commit hook updates in #1176 #1186, #1186, #1201, #1219, #1231
**New Features** * Support using ``DatasetAlias`` and fix orphaning unreferenced dataset by @tatiana in #1217 #1240 Documentation: https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html#data-aware-scheduling * Add GCP_CLOUD_RUN_JOB execution mode by @ags-de #1153 Learn more about it: https://astronomer.github.io/astronomer-cosmos/getting_started/gcp-cloud-run-job.html * Introduction of experimental support to run dbt BQ models using Airflow deferrable operators by @pankajkoti @pankajastro @tatiana in #1224 #1230. This is the first step in the journey of running dbt resources with native Airflow, and we would appreciate feedback from the community. For more information, check the documentation: https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes.html#airflow-async-experimental This work has been inspired by the talk "Airflow at Monzo: Evolving our data platform as the bank scales" by @jonathanrainer @ed-sparkes given at Airflow Summit 2023: https://airflowsummit.org/sessions/2023/airflow-at-monzo-evolving-our-data-platform-as-the-bank-scales/. **Enhancements** * Create single virtualenv when ``DbtVirtualenvBaseOperator`` has ``virtualenv_dir=None`` and ``is_virtualenv_dir_temporary=True`` by @kesompochy in #1200 * Consistently handle build and imports in ``cosmos/__init__.py`` by @tatiana in #1215 * Add enum constants to init for direct import by @fabiomx in #1184 **Bug fixes** * URL encode dataset names to support multibyte characters by @t0momi219 in #1198 * Fix invalid argument (``full_refresh``) passed to DbtTestAwsEksOperator (and others) by @johnhoran in #1175 * Fix ``printer_width`` arg type in ``DbtProfileConfigVars`` by @jessicaschueler in #1191 * Fix task owner fallback by @jmaicher in #1195 **Docs** * Add scarf to readme and docs for website analytics by @cmarteepants in #1221 * Add ``virtualenv_dir`` param to ``ExecutionConfig`` docs by @pankajkoti in #1173 * Give credits to @LennartKloppenburg in CHANGELOG.rst by @tatiana #1174 * Refactor docs for async mode execution by @pankajkoti in #1241 Others * Remove PR branch added for testing a change in CI in #1224 by @pankajkoti in #1233 * Fix CI wrt broken coverage upload artifact @pankajkoti in #1210 * Fix CI issues - Upgrade actions/upload-artifact & actions/download-artifact to v4 and set min version for packaging by @pankajkoti in #1208 * Resolve CI failures for Apache Airflow 2.7 jobs by @pankajkoti in #1182 * CI: Update GCP manifest file path based on new secret update by @pankajkoti in #1237 * Pre-commit hook updates in #1176 #1186, #1186, #1201, #1219, #1231 --------- Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com>
Context
Cosmos versions between 1.1 and 1.6 supported automatically emitting Airflow Datasets when using
ExecutionMode.LOCAL
. Although the datasets generated by these versions of Cosmos could be utilised for dataset-aware scheduling, the implementation had long-standing issues, as described in #522.The main problems were:
These issues were caused by Cosmos defining task outlets and inlets during Airflow task execution, a feature only partially supported before Airflow 2.10: apache/airflow#34206.
Solution
Airflow 2.10 has introduced the concept of
DatasetAlias
, as described in the official docs, so operators can dynamically define inlets and outlets during task execution.This PR uses Airflow
DatasetAlias
, when possible (Airflow 2.10 or above), and does two things:DatasetAlias
to every LocalOperator/VirtualenvOperator subclassDataset
as outlets during theLocalOperator
subclasses execution, associating them to the desiredDataset
instance.DatasetAlias
names, programaticallyCaveats
This feature relies on
DatasetAlias
, only available in Airflow 2.10 and above. If users use previous versions of Airflow, Cosmos behaves like it did before, and the issues described in this task are not solved.airflow dags test
Although the feature described in this PR works well when scheduling DAGs, triggering them via the UI, or using
airflow dags trigger
, it does not work when users attempt to usedags test
ordag.test()
. When trying to test the DAG, these commands fail with ansqlalchemy.orm. etc.FlushError
. This is a known issue from Airflow 2.10.0 and Airflow 2.10.1 when declaring DatasetAliases, as described in apache/airflow#42495.To mitigate this second problem, we've introduced a new Airflow variable,
AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS
, that allows users to disable using dataset aliases when running Cosmos. We'd recommend users who face thesqlalchemy.orm. etc.FlushError
in their tests to set this configuration toFalse
only for running tests - until the issue is solved in Airflow. When this configuration is set toFalse
, Cosmos behaves as before the DatasetAlias feature was introduced.How this feature was validated
Manually triggering this DAG
bq_profile_example
:And seeing that the dataset-scheduled DAG
dataset_triggered_dag
dependent on the first was triggered:Checking Airflow UI:
Related tickets
Closes: #522
Closes: #1119