Skip to content

Commit

Permalink
Add docs related to DatasetAlias
Browse files Browse the repository at this point in the history
  • Loading branch information
tatiana committed Sep 30, 2024
1 parent 1a32630 commit 43c9ff0
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 3 deletions.
8 changes: 6 additions & 2 deletions cosmos/operators/local.py
Original file line number Diff line number Diff line change
Expand Up @@ -493,10 +493,14 @@ def register_dataset(self, new_inlets: list[Dataset], new_outlets: list[Dataset]
Register a list of datasets as outlets of the current task, when possible.
Until Airflow 2.7, there was not a better interface to associate outlets to a task during execution.
This works before Airflow 2.10 with a few limitations, as described in the ticket:
This works in Cosmos with versions before Airflow 2.10 with a few limitations, as described in the ticket:
https://github.com/astronomer/astronomer-cosmos/issues/522
In Airflow 2.10.0 and 2.10.1, we are not able to test Airflow DAGs powered with DatasetAlias.
Since Airflow 2.10, Cosmos uses DatasetAlias by default, to generate datasets. This resolved the limitations
described before.
The only limitation is that with Airflow 2.10.0 and 2.10.1, the `airflow dags test` command will not work
with DatasetAlias:
https://github.com/apache/airflow/issues/42495
"""
if AIRFLOW_VERSION < Version("2.10") or not settings.enable_dataset_alias:
Expand Down
47 changes: 46 additions & 1 deletion docs/configuration/scheduling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Data-Aware Scheduling

`Apache Airflow® <https://airflow.apache.org/>`_ 2.4 introduced the concept of `scheduling based on Datasets <https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html>`_.

By default, if Airflow 2.4 or higher is used, Cosmos emits `Airflow Datasets <https://airflow.apache.org/docs/apache-airflow/stable/concepts/datasets.html>`_ when running dbt projects. This allows you to use Airflow's data-aware scheduling capabilities to schedule your dbt projects. Cosmos emits datasets using the OpenLineage URI format, as detailed in the `OpenLineage Naming Convention <https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md>`_.
By default, if using a version between Airflow 2.4 or higher is used, Cosmos emits `Airflow Datasets <https://airflow.apache.org/docs/apache-airflow/stable/concepts/datasets.html>`_ when running dbt projects. This allows you to use Airflow's data-aware scheduling capabilities to schedule your dbt projects. Cosmos emits datasets using the OpenLineage URI format, as detailed in the `OpenLineage Naming Convention <https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md>`_.

Cosmos calculates these URIs during the task execution, by using the library `OpenLineage Integration Common <https://pypi.org/project/openlineage-integration-common/>`_.

Expand Down Expand Up @@ -62,3 +62,48 @@ Then, you can use Airflow's data-aware scheduling capabilities to schedule ``my_
)
In this scenario, ``project_one`` runs once a day and ``project_two`` runs immediately after ``project_one``. You can view these dependencies in Airflow's UI.

Known Limitations
.................

Airflow 2.9 and below
_____________________

If using cosmos with an Airflow 2.9 or below, users will experience the following issues:

- The task inlets and outlets generated by Cosmos will not be seen in the Airflow UI
- The scheduler logs will contain many messages saying "Orphaning unreferenced dataset"

Example of scheduler logs:

.. code-block::
scheduler | [2023-09-08T10:18:34.252+0100] {scheduler_job_runner.py:1742} INFO - Orphaning unreferenced dataset 'postgres://0.0.0.0:5432/postgres.public.stg_customers'
scheduler | [2023-09-08T10:18:34.252+0100] {scheduler_job_runner.py:1742} INFO - Orphaning unreferenced dataset 'postgres://0.0.0.0:5432/postgres.public.stg_payments'
scheduler | [2023-09-08T10:18:34.252+0100] {scheduler_job_runner.py:1742} INFO - Orphaning unreferenced dataset 'postgres://0.0.0.0:5432/postgres.public.stg_orders'
scheduler | [2023-09-08T10:18:34.252+0100] {scheduler_job_runner.py:1742} INFO - Orphaning unreferenced dataset 'postgres://0.0.0.0:5432/postgres.public.customers'
References about the root cause of these issues:

- https://github.com/astronomer/astronomer-cosmos/issues/522
- https://github.com/apache/airflow/issues/34206


Airflow 2.10.0 and 2.10.1
_________________________

If using Cosmos with Airflow 2.10.0 or 2.10.1, the two issues previously described are resolved, since Cosmos uses ``DatasetAlias``
to support the dynamic creation of datasets during task execution. However, users may face ``sqlalchemy.orm.exc.FlushError``
errors if they attempt to run Cosmos-powered DAGs using ``airflow dags test`` with these versions.

We've reported this issue and it will be resolved in future versions of Airflow:

- https://github.com/apache/airflow/issues/42495

For users to overcome this limitation in local tests, until the Airflow community solves this, we introduced the configuration
``AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS``, that is ``True`` by default. If users want to run ``dags test` and not see ``sqlalchemy.orm.exc.FlushError``,
they can set this configuration to ``False``. It can also be set in the ``airflow.cfg`` file:

.. code-block::
[cosmos]
enable_dataset_alias = False

0 comments on commit 43c9ff0

Please sign in to comment.