Skip to content

Commit

Permalink
Add descriptions (#503)
Browse files Browse the repository at this point in the history
Co-authored-by: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com>
  • Loading branch information
fhennig and razvan authored Sep 13, 2024
1 parent 2f1e91a commit aaf61ca
Show file tree
Hide file tree
Showing 15 changed files with 64 additions and 25 deletions.
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
= First steps
:description: Set up an Apache Airflow cluster using Stackable Operator, PostgreSQL, and Redis. Run and monitor example workflows (DAGs) via the web UI or command line.

Once you have followed the steps in the xref:getting_started/installation.adoc[] section to install the Operator and its dependencies, you will now deploy a Airflow cluster and its dependencies. Afterwards you can <<_verify_that_it_works, verify that it works>> by running and tracking an example DAG.

Expand Down
4 changes: 3 additions & 1 deletion docs/modules/airflow/pages/getting_started/index.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
= Getting started
:description: Get started with the Stackable Operator for Apache Airflow by installing the operator, SQL database, and Redis, then setting up and running your first DAG.

This guide will get you started with Airflow using the Stackable Operator. It will guide you through the installation of the Operator as well as an SQL database and Redis instance for trial usage, setting up your first Airflow cluster and connecting to it, and viewing and running one of the example workflows (called DAGs = Direct Acyclic Graphs).
This guide will get you started with Airflow using the Stackable Operator.
It will guide you through the installation of the Operator as well as an SQL database and Redis instance for trial usage, setting up your first Airflow cluster and connecting to it, and viewing and running one of the example workflows (called DAGs = Direct Acyclic Graphs).

== Prerequisites for this guide

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
= Installation
:description: Install the Stackable operator for Apache Airflow with PostgreSQL, Redis, and required components using Helm or stackablectl.

On this page you will install the Stackable Airflow Operator, the software that Airflow depends on - Postgresql and Redis - as well as the commons, secret and listener operator which are required by all Stackable Operators.

Expand Down
4 changes: 2 additions & 2 deletions docs/modules/airflow/pages/index.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
= Stackable Operator for Apache Airflow
:description: The Stackable Operator for Apache Airflow is a Kubernetes operator that can manage Apache Airflow clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Airflow versions.
:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, engineer, big data, metadata, job pipeline, scheduler, workflow, ETL
:description: The Stackable Operator for Apache Airflow manages Airflow clusters on Kubernetes, supporting custom workflows, executors, and external databases for efficient orchestration.
:keywords: Stackable Operator, Apache Airflow, Kubernetes, k8s, operator, job pipeline, scheduler, ETL
:airflow: https://airflow.apache.org/
:dags: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html
:k8s-crs: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
Expand Down
4 changes: 3 additions & 1 deletion docs/modules/airflow/pages/required-external-components.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
= Required external components
:description: Airflow requires PostgreSQL, MySQL, or SQLite for database support, and Redis for Celery executors. MSSQL has experimental support.

Airflow requires an SQL database to operate. The https://airflow.apache.org/docs/apache-airflow/stable/installation/prerequisites.html[Airflow documentation] specifies:
Airflow requires an SQL database to operate.
The https://airflow.apache.org/docs/apache-airflow/stable/installation/prerequisites.html[Airflow documentation] specifies:

Fully supported for production usage:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
= Applying Custom Resources
:description: Learn to apply custom resources in Airflow, such as Spark jobs, using Kubernetes connections, roles, and modular DAGs with git-sync integration.

Airflow can be used to apply custom resources from within a cluster. An example of this could be a SparkApplication job that is to be triggered by Airflow. The steps below describe how this can be done. The DAG will consist of modularized python files and will be provisioned using the git-sync facility.
Airflow can be used to apply custom resources from within a cluster.
An example of this could be a SparkApplication job that is to be triggered by Airflow.
The steps below describe how this can be done.
The DAG will consist of modularized Python files and will be provisioned using the git-sync facility.

== Define an in-cluster Kubernetes connection

Expand Down Expand Up @@ -38,7 +42,9 @@ include::example$example-airflow-spark-clusterrolebinding.yaml[]

== DAG code

Now for the DAG itself. The job to be started is a modularized DAG that uses starts a one-off Spark job that calculates the value of pi. The file structure fetched to the root git-sync folder looks like this:
Now for the DAG itself.
The job to be started is a modularized DAG that uses starts a one-off Spark job that calculates the value of pi.
The file structure fetched to the root git-sync folder looks like this:

----
dags
Expand All @@ -57,12 +63,15 @@ The Spark job will calculate the value of pi using one of the example scripts th
include::example$example-pyspark-pi.yaml[]
----

This will be called from within a DAG by using the connection that was defined earlier. It will be wrapped by the `KubernetesHook` that the Airflow Kubernetes provider makes available https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py[here].There are two classes that are used to:
This will be called from within a DAG by using the connection that was defined earlier.
It will be wrapped by the `KubernetesHook` that the Airflow Kubernetes provider makes available https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py[here].
There are two classes that are used to:

- start the job
- monitor the status of the job
* start the job
* monitor the status of the job

The classes `SparkKubernetesOperator` and `SparkKubernetesSensor` are located in two different Python modules as they will typically be used for all custom resources and thus are best decoupled from the DAG that calls them. This also demonstrates that modularized DAGs can be used for Airflow jobs as long as all dependencies exist in or below the root folder pulled by git-sync.
The classes `SparkKubernetesOperator` and `SparkKubernetesSensor` are located in two different Python modules as they will typically be used for all custom resources and thus are best decoupled from the DAG that calls them.
This also demonstrates that modularized DAGs can be used for Airflow jobs as long as all dependencies exist in or below the root folder pulled by git-sync.

[source,python]
----
Expand Down Expand Up @@ -100,6 +109,7 @@ TIP: A full example of the above is used as an integration test https://github.c

== Logging

As mentioned above, the logs are available from the webserver UI if the jobs run with the `celeryExecutor`. If the SDP logging mechanism has been deployed, log information can also be retrieved from the vector backend (e.g. Opensearch):
As mentioned above, the logs are available from the webserver UI if the jobs run with the `celeryExecutor`.
If the SDP logging mechanism has been deployed, log information can also be retrieved from the vector backend (e.g. Opensearch):

image::airflow_dag_log_opensearch.png[Opensearch]
3 changes: 3 additions & 0 deletions docs/modules/airflow/pages/usage-guide/index.adoc
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
= Usage guide
:description: Practical instructions to make the most out of the Stackable operator for Apache Airflow.

Practical instructions to make the most out of the Stackable operator for Apache Airflow.
7 changes: 5 additions & 2 deletions docs/modules/airflow/pages/usage-guide/listenerclass.adoc
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
= Service exposition with ListenerClasses
:description: Configure Airflow service exposure with ListenerClasses: cluster-internal, external-unstable, or external-stable.

Airflow offers a web UI and an API, both are exposed by the webserver process under the `webserver` role. The Operator deploys a service called `<name>-webserver` (where `<name>` is the name of the AirflowCluster) through which Airflow can be reached.
Airflow offers a web UI and an API, both are exposed by the webserver process under the `webserver` role.
The Operator deploys a service called `<name>-webserver` (where `<name>` is the name of the AirflowCluster) through which Airflow can be reached.

This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`.
Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.

This is how the listener class is configured:

Expand Down
1 change: 1 addition & 0 deletions docs/modules/airflow/pages/usage-guide/logging.adoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
= Log aggregation
:description: Forward Airflow logs to a Vector aggregator by configuring the ConfigMap and enabling the log agent.

The logs can be forwarded to a Vector log aggregator by providing a discovery
ConfigMap for the aggregator and by enabling the log agent:
Expand Down
5 changes: 3 additions & 2 deletions docs/modules/airflow/pages/usage-guide/monitoring.adoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
= Monitoring
:description: Airflow instances export Prometheus metrics for monitoring.

The managed Airflow instances are automatically configured to export Prometheus metrics. See
xref:operators:monitoring.adoc[] for more details.
The managed Airflow instances are automatically configured to export Prometheus metrics.
See xref:operators:monitoring.adoc[] for more details.
20 changes: 15 additions & 5 deletions docs/modules/airflow/pages/usage-guide/mounting-dags.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
= Mounting DAGs
:description: Mount DAGs in Airflow via ConfigMap for single DAGs or use git-sync for multiple DAGs. git-sync pulls from a Git repo and handles updates automatically.

DAGs can be mounted by using a `ConfigMap` or `git-sync`. This is best illustrated with an example of each, shown in the sections below.
DAGs can be mounted by using a `ConfigMap` or `git-sync`.
This is best illustrated with an example of each, shown in the sections below.

== via `ConfigMap`

Expand All @@ -23,13 +25,18 @@ include::example$example-airflow-dags-configmap.yaml[]

WARNING: If a DAG mounted via ConfigMap consists of modularized files then using the standard location is mandatory as python will use this as a "root" folder when looking for referenced files.

The advantage of this approach is that a DAG can be provided "in-line", as it were. This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually. For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below.
The advantage of this approach is that a DAG can be provided "in-line", as it were.
This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually.
For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below.

== via `git-sync`

=== Overview

https://github.com/kubernetes/git-sync/tree/v4.2.1[git-sync] is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes. The Stackable implementation is a wrapper around this such that the binary and image requirements are included in the Stackable Airflow product images and do not need to be specified or handled in the `AirflowCluster` custom resource. Internal details such as image names and volume mounts are handled by the operator, so that only the repository and synchronization details are required. An example of this usage is given in the next section.
https://github.com/kubernetes/git-sync/tree/v4.2.1[git-sync] is a command that pulls a git repository into a local directory and is supplied as a sidecar container for use within Kubernetes.
The Stackable implementation is a wrapper around this such that the binary and image requirements are included in the Stackable Airflow product images and do not need to be specified or handled in the `AirflowCluster` custom resource.
Internal details such as image names and volume mounts are handled by the operator, so that only the repository and synchronization details are required.
An example of this usage is given in the next section.

=== Example

Expand All @@ -51,6 +58,9 @@ include::example$example-airflow-gitsync.yaml[]
<11> Git-sync settings can be provided inline, although some of these (`--dest`, `--root`) are specified internally in the operator and will be ignored if provided by the user. Git-config settings can also be specified, although a warning will be logged if `safe.directory` is specified as this is defined internally, and should not be defined by the user.


IMPORTANT: The example above shows a _*list*_ of git-sync definitions, with a single element. This is to avoid breaking-changes in future releases. Currently, only one such git-sync definition is considered and processed.
IMPORTANT: The example above shows a _list_ of git-sync definitions, with a single element.
This is to avoid breaking-changes in future releases.
Currently, only one such git-sync definition is considered and processed.

NOTE: git-sync can be used with DAGs that make use of Python modules, as Python will be configured to use the git-sync target folder as the "root" location when looking for referenced files. See the xref:usage-guide/applying-custom-resources.adoc[] example for more details.
NOTE: git-sync can be used with DAGs that make use of Python modules, as Python will be configured to use the git-sync target folder as the "root" location when looking for referenced files.
See the xref:usage-guide/applying-custom-resources.adoc[] example for more details.
6 changes: 3 additions & 3 deletions docs/modules/airflow/pages/usage-guide/overrides.adoc
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@

= Configuration & Environment Overrides
:description: Airflow supports configuration and environment variable overrides per role or role group, with role group settings taking precedence. Be cautious with overrides.

The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).

IMPORTANT: Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended
that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected.
IMPORTANT: Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected.

== Configuration Properties

Expand All @@ -13,7 +13,7 @@ Airflow exposes an environment variable for every Airflow configuration setting,
As Airflow can be configured with python code too, arbitrary code can be added to the `webserver_config.py`.
You can use either `EXPERIMENTAL_FILE_HEADER` to add code to the top or `EXPERIMENTAL_FILE_FOOTER` to add to the bottom.

IMPORTANT: This is an experimental feature
IMPORTANT: This is an experimental feature.

[source,yaml]
----
Expand Down
7 changes: 5 additions & 2 deletions docs/modules/airflow/pages/usage-guide/security.adoc
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
= Security
:description: Airflow supports authentication via Web UI or LDAP, with role-based access control managed by Flask AppBuilder, and LDAP users assigned default roles.

== Authentication

Every user has to authenticate themselves before using Airflow and there are several ways of doing this.

=== Webinterface

The default setting is to view and manually set up users via the Webserver UI. Note the blue "+" button where users can be added directly:
The default setting is to view and manually set up users via the Webserver UI.
Note the blue "+" button where users can be added directly:

image::airflow_security.png[Airflow Security menu]

=== LDAP

Airflow supports xref:concepts:authentication.adoc[authentication] of users against an LDAP server. This requires setting up an AuthenticationClass for the LDAP server.
Airflow supports xref:concepts:authentication.adoc[authentication] of users against an LDAP server.
This requires setting up an AuthenticationClass for the LDAP server.
The AuthenticationClass is then referenced in the AirflowCluster resource as follows:

[source,yaml]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
= Resource Requests
:description: Find out about minimal HA Airflow requirements for CPU and memory, with defaults for schedulers, Celery executors, webservers using Kubernetes resource limits.

include::home:concepts:stackable_resource_requests.adoc[]

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
= Using Kubernetes executors
:description: Configure Kubernetes executors in Airflow to dynamically create pods for tasks, replacing Celery executors and bypassing Redis for job routing.

Instead of using the Celery workers you can let Airflow run the tasks using Kubernetes executors, where pods are created dynamically as needed without jobs being routed through a redis queue to the workers.

Expand Down

0 comments on commit aaf61ca

Please sign in to comment.