Overview

ea_airflow_util contains additional Airflow functionality used within EDU that falls outside the scope of edu_edfi_airflow.

Callables

Various Airflow callables have been defined in bespoke submodules under ea_airflow_util.callables. These have been outlined below.

airflow

Airflow utility helpers that are used for argument-passing and parameter-checking in DAGs

See more:

When importing this submodule, be careful not to overwrite airflow in your namespace!

# Do not do this!  This will overwrite `import airflow`.
from ea_airflow_util.callables import airflow

# Use one of these instead!
from ea_airflow_util.callables import airflow as airflow_util
from ea_airflow_util.callables.airflow import xcom_pull_template

xcom_pull_template(task_ids, key)

Build an xcom_pull string for passing arguments between tasks. Either a task-ID or task operator can be passed. The default return key return_value is the final return value of the operator.

skip_if_not_in_params_list(param_name, value)

Verify whether a value is defined in a passed parameter list, and raise an AirflowSkipException otherwise. Raise an error if the parameter is not defined.

casing

String-casing callables (i.e., snake_casing or CamelCasing)

See more:

snake_case(string)

Convert a string to string_case.

record_to_string_case(record)

Convert the keys of a JSON record into snake_case. Raise an error if a name-collision occurs after formatting.

ftp

FTP- and SFTP-utility helpers

See more:

download_all(ftp_conn_id, remote_dir, local_dir, endswith)

Download all files from an FTP to disk, optionally filtering on file-extension endings.

gsheets

Google-sheets authentication and parsing helpers

See more:

get_google_client_from_airflow(gcp_conn_id, key_field)

Create a Google Sheets client populated with key data in an Airflow connection. The key data can be saved in a separate, linked file, or as a JSON structure in the connection. The Airflow connection key field can be specified; otherwise, both will be tried.

get_google_spreadsheet_by_url(google_cloud_client, google_sheets_url)

Call the Google Sheets API and retrieve a Spreadsheet based on a given URL. If API Rate Limit has been reached, use Truncated exponential backoff strategy to retry.

parse_google_worksheet(worksheet)

Parse a gspread worksheet and retrieve the relevant data.

get_worksheet_from_google_spreadsheet(spreadsheet, sheet_index, sheet_name)

Parse a Google spreadsheet and return a specific worksheet by index or name. If neither is specified, retrieve the zeroth worksheet.

get_and_serialize_google_survey_url_to_jsonl(gcp_conn_id, survey_url, output_dir)

Unified method for retrieving data from a Google survey and writing to disk as JSON lines.

jsonl

JSON utility helpers. Most Airflow tasks write data to disk and database as JSON lines.

See more:

serialize_json_records_to_disk(json_records, output_path, **kwargs)

Write an iterator of dictionaries to an output path as JSON lines. Optional arguments customize the output.

translate_csv_file_to_jsonl(local_path, output_path, **kwargs)

Transform a CSV file to JSON lines. If output_path is not specified, rewrite the CSV file with a .jsonl file extension. Optional arguments customize the output.

s3

Helpers for getting data to S3

Note: There are some SQL/Snowflake callables that interface with S3.

See more:

disk_to_s3(s3_conn_id, local_path, base_dir, bucket, delete_local, **kwargs)

Upload local files to S3. Optional arguments apply schema-checking and path-mutation.

list_s3_keys(s3_hook, s3_bucket, s3_key)

Internal utility function for listing S3 keys. Note: this method uses a pre-instantiated S3 hook instead of a connection ID.

sharefile

Helpers when interfacing with Sharefile

See more:

list_sharefile_objects(sharefile_conn_id, remote_dir)

List object names in a specified Sharefile directory.

sharefile_to_disk(sharefile_conn_id, sharefile_path, local_path, ds_nodash, ts_nodash, delete_remote=False, file_pattern=None)

Transfer all files from a ShareFile folder to a local date-stamped directory, optionally deleting the remote copy.

disk_to_sharefile(sf_conn_id, sf_folder_path, local_path)

Post a file or the contents of a directory to the specified Sharefile folder

s3_to_sharefile(s3_conn_id, s3_key, sf_conn_id, sf_folder_path):

Copy a single file from S3 to Sharefile

slack

This package contains several callback functions which can be used with Slack webhooks to alert at task failures or successes, or when SLAs are missed. Each function takes the Slack Airflow connection ID as their primary argument. The contents of the callback messages are filled automatically via the DAG run context.

See more:

Airflow callbacks only accept expected arguments, not kwargs. Because these custom Slack callback functions expect the additional argument http_conn_id, this argument must be filled before applying the callbacks to the DAG. This can be done using the functools.partial() function, as follows:

from functools import partial

on_failure_callback = partial(slack_alert_failure , http_conn_id=HTTP_CONN_ID)
on_success_callback = partial(slack_alert_success , http_conn_id=HTTP_CONN_ID)
sla_miss_callback   = partial(slack_alert_sla_miss, http_conn_id=HTTP_CONN_ID)

slack_alert_failure()

🔴 Task Failed.
Task: {task_id}
Dag: {dag_id}
Execution Time: {logical_date}
Log Url: {log_url}

slack_alert_success()

✔ Task Succeeded.
Task: {task_id}
Dag: {dag_id}
Execution Time: {logical_date}
Log Url: {log_url}

slack_alert_sla_miss()

🆘 SLA has been missed.
Task: {task_id}
Dag: {dag_id}
Execution Time: {logical_date}

Note, due to different definitions of task-failure/success callbacks and SLA callbacks, Log Url is unavailable in SLA callback messages. This will be investigated further and patched in a future update.

slack_alert_download_failure(remote_path, local_path, error)

🔴 File did not download
Remote Path: {remote_path}
Local Path: {local_path}
Task: {task_id}
Dag: {dag_id}
Execution Time: {logical_date}
Log Url: {log_url}
Error: {error}

slack_alert_s3_upload_failure(local_path, file_key, error)

🔴 File did not upload to S3
File Path: {local_path}
File Key: {file_key}
Task: {task_id}
Dag: {dag_id}
Execution Time: {logical_date}
Log Url: {log_url}
Error: {error}

slack_alert_insert_failure(file_key, table, error)

🔴 File did not insert to database
File Key: {file_key}
Dest Table: {table}
Task: {task_id}
Dag: {dag_id}
Execution Time: {logical_date}
Log Url: {log_url}
Error: {error}

slack_alert_file_format_failure(local_path, file_type, cols_expected, cols_found)

🔴 File did not match expected spec
File Path: {local_path}
File Type: {file_type}
Exp. Cols: {cols_expected}
Found Cols: {cols_found}
Task: {task_id}
Dag: {dag_id}
Execution Time: {logical_date}
Log Url: {log_url}

slack_alert_match_spec_failure(local_path, error)

🔴 File did not match file spec
File Path: {local_path}
Task: {task_id}
Dag: {dag_id}
Execution Time: {logical_date}
Log Url: {log_url}
Error: {error}

snowflake

Helpers for getting data out of and into Snowflake

See more:

snowflake_to_disk(snowflake_conn_id, query, local_path, **kwargs)

Copy data from Snowflake to local disk using a passed query. Optional arguments alter formatting and chunking when writing to disk.

sql

Helpers for getting data out of and into different (non-Snowflake) SQL dialects

See more:

mssql_to_disk(conn_string, tables, local_path)

Copy data from MySQL to local disk.

s3_to_postgres(pg_conn_id, s3_conn_id, dest_table, column_customization, options, s3_key, s3_region, **kwargs)

Copy data from an S3 filepath into Postgres. Optional arguments alter table clean-up and import logic.

s3_dir_to_postgres(pg_conn_id, s3_conn_id, dest_table, column_customization, options, s3_key, s3_region, **kwargs)

Copy all files from an S3 directory into Postgres. Optional arguments alter table clean-up and import logic.

ssm

SSM ParameterStore helpers for extracting parameter strings from AWS. This code is used exclusively in AWSParamStoreToAirflowDAG.

variable

Utility methods for checking and updating Airflow variables

See more:

update_variable(var, value)

Update an Airflow variable with the specified value. A callable can be passed in value to update the variable in-place.

check_variable(var, condition, force)

Compare the current value of a variable against a passed boolean condition. Raise an AirflowSkipException if the result is False. Always succeed if force is True.

zip

Compressed-file utility methods

See more:

extract_zips(local_dir, extract_dir, filter_lambda, remove_zips)

Extract zip files from a local_dir to an extract_dir, optionally filtering on filepath.

DAGs

All DAGs defined in this package utilize the EACustomDAG behind the scenes. This means that unconventional DAG arguments like slack_conn_id can be passed to any DAG.

EACustomDAG

This is a DAG factory that pre-instantiates default arguments and UDMs used across our projects. By default, max_active_runs is set to 1, and catchup arguments are turned off. Any non-standard DAG-kwargs are ignored.

If a Slack connection ID is passed through slack_conn_id, failure and SLA callbacks are automatically instantiated. This argument can also be accessed in UDMs under the key slack_conn_id.

AirflowDBCleanDAG

The Airflow database backend does not remove historic records by default. This DAG removes data older than a specified number of retention days. Note that the DAG errors when attempting to remove data newer than 30 days.

Arguments:

Argument	Description
retention_days	number of days of log-data to preserve (default `90`)
dry_run	whether to complete a dry-run instead of a real run (default `False`)
verbose	whether to turn on verbose logging (default `False`)