Skip to content

Commit

Permalink
more improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
fhennig committed Sep 24, 2024
1 parent fe4b6a8 commit f228abb
Show file tree
Hide file tree
Showing 9 changed files with 32 additions and 30 deletions.
13 changes: 8 additions & 5 deletions docs/modules/airflow/pages/getting_started/first_steps.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ It should be a long random string of bytes.

`connections.sqlalchemyDatabaseUri` must contain the connection string to the SQL database storing the Airflow metadata.

`connections.celeryResultBackend` must contain the connection string to the SQL database storing the job metadata (in the example above we are using the same postgresql database for both).
`connections.celeryResultBackend` must contain the connection string to the SQL database storing the job metadata (the example above uses the same PostgreSQL database for both).

`connections.celeryBrokerUrl` must contain the connection string to the Redis instance used for queuing the jobs submitted to the airflow executor(s).

Expand Down Expand Up @@ -69,7 +69,9 @@ It is set to `true` here as the example DAGs are used when verifying the install
* the `spec.clusterConfig.exposeConfig` key is optional and defaults to `false`. It is set to `true` only as an aid to verify the configuration and should never be used as such in anything other than test or demo clusters.
* the previously created secret must be referenced in `spec.clusterConfig.credentialsSecret`.

NOTE: Please note that the version you need to specify for `spec.image.productVersion` is the desired version of Apache Airflow. You can optionally specify the `spec.image.stackableVersion` to a certain release like `23.11.0` but it is recommended to leave it out and use the default provided by the operator. For a list of available versions please check our https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%airflow%2Ftags[image registry].
NOTE: The version you need to specify for `spec.image.productVersion` is the desired version of Apache Airflow.
You can optionally specify the `spec.image.stackableVersion` to a certain release like `23.11.0` but it is recommended to leave it out and use the default provided by the operator.
Check our https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%airflow%2Ftags[image registry] for a list of available versions.
It should generally be safe to simply use the latest version that is available.

This creates the actual Airflow cluster.
Expand Down Expand Up @@ -124,15 +126,16 @@ If you prefer to interact directly with the API instead of using the web interfa
[source,bash]
include::example$getting_started/code/getting_started.sh[tag=enable-dag]
A DAG can then be triggered by providing the DAG name (in this case, `example_trigger_target_dag`). The response identifies the DAG identifier, which we can parse out of the JSON like this:
A DAG can then be triggered by providing the DAG name (in this case, `example_trigger_target_dag`).
The response identifies the DAG identifier, which can be parse out of the JSON like this:
[source,bash]
include::example$getting_started/code/getting_started.sh[tag=run-dag]
If we read this identifier into a variable such as `dag_id` (or replace it manually in the command) we can run this command to access the status of the DAG run:
If this identifier is stored in a variable such as `dag_id` (manually replaced in the command) you can run this command to access the status of the DAG run:
[source,bash]
include::example$getting_started/code/getting_started.sh[tag=check-dag]
====

== What's next

Look at the xref:usage-guide/index.adoc[] to find out more about configuring your Airflow cluster and loading your own DAG files.
Look at the xref:usage-guide/index.adoc[] to find out more about configuring your Airflow Stacklet and loading your own DAG files.
2 changes: 1 addition & 1 deletion docs/modules/airflow/pages/getting_started/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ You need:
* kubectl
* Helm

Resource sizing depends on cluster type(s), usage and scope, but as a starting point we recommend a minimum of the following resources for this operator:
Resource sizing depends on cluster type(s), usage and scope, but as a minimum starting point the following resources are recommended for this operator:

include::partial$hardware-requirements.adoc[]

Expand Down
4 changes: 2 additions & 2 deletions docs/modules/airflow/pages/getting_started/installation.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Install the Stackable Airflow operator, the software that Airflow depends on --
== Required external components: PostgreSQL and Redis

PostgreSQL is required by Airflow to store metadata about DAG runs, and Redis is required by the Celery executor to schedule and/or queue DAG jobs.
They are components that may well already be available for customers, in which case we treat them here as pre-requisites for an Airflow cluster and hence as part of the installation process.
They are components that may well already be available for customers, in which case they are treated as prerequisites for an Airflow cluster and hence as part of the installation process.
Install these components using Helm.
Note that specific versions are declared:

Expand All @@ -26,7 +26,7 @@ include::example$getting_started/code/getting_started.sh[tag=helm-add-bitnami-re

WARNING: Do not use this setup in production!
Supported databases and versions are listed on the xref:required-external-components.adoc[required external components] page for this operator.
Please follow the instructions of those components for a production setup.
Follow the instructions of those components for a production setup.

== Stackable operators

Expand Down
3 changes: 2 additions & 1 deletion docs/modules/airflow/pages/required-external-components.adoc
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
= Required external components
:description: Airflow requires PostgreSQL, MySQL, or SQLite for database support, and Redis for Celery executors. MSSQL has experimental support.
:airflow-prerequisites: https://airflow.apache.org/docs/apache-airflow/stable/installation/prerequisites.html

Airflow requires an SQL database to operate.
The https://airflow.apache.org/docs/apache-airflow/stable/installation/prerequisites.html[Airflow documentation] specifies:
The {airflow-prerequisites}[Airflow documentation] specifies:

Fully supported for production usage:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
= Applying Custom Resources
:description: Learn to apply custom resources in Airflow, such as Spark jobs, using Kubernetes connections, roles, and modular DAGs with git-sync integration.
:airflow-managing-connections: https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html

Airflow can be used to apply custom resources from within a cluster.
An example of this could be a SparkApplication job that is to be triggered by Airflow.
The steps below describe how this can be done.
The DAG consists of modularized Python files and is provisioned using the git-sync facility.
Airflow can apply custom resources from within a cluster, such as triggering Spark job by applying a SparkApplication resource.
The steps below outline this process.
The DAG consists of modularized Python files and is provisioned using the git-sync feature.

== Define an in-cluster Kubernetes connection

To start a Spark job, Airflow needs to be able to communicate with Kubernetes and an in-cluster connection is required for this, which can be created from within the Webserver UI (note that the "in cluster configuration" box is ticked):
To start a Spark job, Airflow must communicate with Kubernetes, requiring an in-cluster connection.
This can be created through the Webserver UI by enabling the "in cluster configuration" setting:

image::airflow_connection_ui.png[Airflow Connections]
image::airflow_connection_ui.png[A screenshot of the 'Edit connection' window with the 'in cluster configuration' tick box ticked]

Alternatively, the connection can be https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html[defined] by an environment variable in URI format:
Alternatively, the connection can be {airflow-managing-connections}[defined] using an environment variable in URI format:

[source]
AIRFLOW_CONN_KUBERNETES_IN_CLUSTER: "kubernetes://?__extra__=%7B%22extra__kubernetes__in_cluster%22%3A+true%2C+%22extra__kubernetes__kube_config%22%3A+%22%22%2C+%22extra__kubernetes__kube_config_path%22%3A+%22%22%2C+%22extra__kubernetes__namespace%22%3A+%22%22%7D"
Expand Down Expand Up @@ -42,9 +43,8 @@ include::example$example-airflow-spark-clusterrolebinding.yaml[]

== DAG code

Now for the DAG itself.
The job to be started is a modularized DAG that uses starts a one-off Spark job that calculates the value of pi.
The file structure fetched to the root git-sync folder looks like this:
For the DAG itself, the job is a modularized DAG that starts a one-off Spark job to calculate the value of pi.
The file structure, fetched to the root git-sync folder, looks like this:

----
dags
Expand Down
6 changes: 2 additions & 4 deletions docs/modules/airflow/pages/usage-guide/logging.adoc
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
= Log aggregation
:description: Forward Airflow logs to a Vector aggregator by configuring the ConfigMap and enabling the log agent.

The logs can be forwarded to a Vector log aggregator by providing a discovery
ConfigMap for the aggregator and by enabling the log agent:
The logs can be forwarded to a Vector log aggregator by providing a discovery ConfigMap for the aggregator and by enabling the log agent:

[source,yaml]
----
Expand Down Expand Up @@ -38,5 +37,4 @@ spec:
level: INFO
----

Further information on how to configure logging, can be found in
xref:concepts:logging.adoc[].
Further information on how to configure logging, can be found in xref:concepts:logging.adoc[].
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

This section of the documentation is intended for the operations teams that maintain a Stackable Data Platform installation.

Please read the xref:concepts:operations/index.adoc[Concepts page on Operations] that contains the necessary details to operate the platform in a production environment.
Read the xref:concepts:operations/index.adoc[concepts page on operations] that contains the necessary details to operate the platform in a production environment.
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

You can configure the permitted Pod disruptions for Airflow nodes as described in xref:concepts:operations/pod_disruptions.adoc[].

Unless you configure something else or disable our PodDisruptionBudgets (PDBs), we write the following PDBs:
Unless you configure something else or disable the default PodDisruptionBudgets (PDBs), the operator writes the following PDBs:

== Schedulers
We only allow a single scheduler to be offline at any given time, regardless of the number of replicas or `roleGroups`.
Allow only a single scheduler to be offline at any given time, regardless of the number of replicas or `roleGroups`.

== Webservers
We only allow a single webserver to be offline at any given time, regardless of the number of replicas or `roleGroups`.
Allow only a single webserver to be offline at any given time, regardless of the number of replicas or `roleGroups`.

== Executors
* In the case of Celery executors, we only allow a single executor to be offline at any given time, regardless of the number of replicas or `roleGroups`.
* In the case of Kubernetes executors, we don't deploy any PDB, as it's Airflows responsibility to take care of the executor Pods.
* In the case of Celery executors, allow only a single executor to be offline at any given time, regardless of the number of replicas or `roleGroups`.
* In the case of Kubernetes executors, don't deploy any PDB, as it's Airflows responsibility to take care of the executor Pods.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
= Using Kubernetes executors
:description: Configure Kubernetes executors in Airflow to dynamically create pods for tasks, replacing Celery executors and bypassing Redis for job routing.

Instead of using the Celery workers you can let Airflow run the tasks using Kubernetes executors, where pods are created dynamically as needed without jobs being routed through a redis queue to the workers.
Instead of using the Celery workers you can let Airflow run the tasks using Kubernetes executors, where Pods are created dynamically as needed without jobs being routed through a Redis queue to the workers.

To achieve this, swap `spec.celeryExecutors` with `spec.kubernetesExecutors`.
E.g. you would change the following example
Expand Down

0 comments on commit f228abb

Please sign in to comment.