Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out test-iasworld-data workflow from test-dbt-models so that data integrity tests will fail properly #671

Merged
57 changes: 15 additions & 42 deletions .github/workflows/test_dbt_models.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,16 @@ name: test-dbt-models
# process defined by the `build-and-test-dbt` workflow. It runs under two
# circumstances:
#
# 1. When manually triggered by the Data Department's Spark data extraction
# process upon completion. This process runs iasWorld data tests. See
# https://github.com/ccao-data/service-sqoop-iasworld
# 1. When manually triggered by a Data Department team member, in order to
# run specific dbt tests using the `select` or `selector` input (runs all
# data integriy tests by default, if both `select` and `selector`
# are blank)
#
# 2. On the cron schedule set below, which runs non-iasWorld data tests on
# a weekly schedule to alert the team of any problems proactively
# 2. On the cron schedule set below, which runs data integrity tests
# weekly in order to proactively alert the team of any problems. Since
# the cron schedule cannot pass input variables to the workflow, we
# configure the input variables to fall back to defaults that support
# the scheduled job, namely the data integrity tests
on:
workflow_dispatch:
inputs:
Expand All @@ -25,16 +29,12 @@ on:
over the above list of tests if both are present)
required: false
type: string
upload_test_results:
description: Upload test results to S3
required: false
default: false
type: boolean
schedule:
# Every Monday at 11am UTC (6am UTC-5)
- cron: '0 11 * * 1'

env:
PYTHONUNBUFFERED: "1"
UV_SYSTEM_PYTHON: 1

jobs:
Expand All @@ -52,7 +52,7 @@ jobs:
- name: Validate and parse input variables
id: parse-inputs
run: |
# Default to running all tests
# Default to running all non-iasWorld data tests
SELECT_OPTION="--selector select_data_test_non_iasworld"
if [[ -n "$SELECTOR" ]]; then
SELECT_OPTION="--selector $SELECTOR"
Expand Down Expand Up @@ -90,51 +90,24 @@ jobs:
DEFER_OPTION="--defer --state $STATE_DIR"
fi
# shellcheck disable=SC2086
python scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/ \
dbt test --target "$TARGET" \
$SELECT_OPTION \
$SKIP_ARTIFACTS_OPTION \
$DEFER_OPTION
working-directory: ${{ env.PROJECT_DIR }}
shell: bash
env:
USER: ${{ github.triggering_actor }}
GIT_SHA: ${{ github.sha }}
GIT_REF: ${{ github.ref_name }}
GIT_AUTHOR: ${{ github.event.commits[0].author.name }}
CACHE_HIT: ${{ steps.cache.outputs.cache-hit }}
SELECT_OPTION: ${{ steps.parse-inputs.outputs.select-option }}
SKIP_ARTIFACTS_OPTION: ${{ inputs.upload_test_results && '--no-skip-artifacts' || '--skip-artifacts' }}

- name: Save test results to S3
if: inputs.upload_test_results
run: |
s3_prefix="s3://ccao-data-warehouse-us-east-1/qc"
local_prefix="qc_test_results/metadata"
for dir in "test_run" "test_run_result" "test_run_failing_row"; do
dirpath="${local_prefix}/${dir}"
if [ -e "$dirpath" ]; then
echo "Copying ${dirpath} metadata to S3"
aws s3 sync "$dirpath" "${s3_prefix}/${dir}"
fi
done

crawler_name="ccao-data-warehouse-qc-crawler"
aws glue start-crawler --name "$crawler_name"
echo "Triggered Glue crawler $crawler_name"
working-directory: ${{ env.PROJECT_DIR }}
shell: bash

- name: Get current time
if: failure()
run: echo "TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%S")" >> "$GITHUB_ENV"
shell: bash

# Only triggered when called by a bot. Otherwise, notifications
# go to whoever called the workflow
# Only triggered when run on a schedule. Otherwise, notifications go to
# whoever dispatched the workflow
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
- name: Send failure notification
if: github.event_name == 'workflow_dispatch' && github.triggering_actor == 'sqoop-bot[bot]' && failure()
if: github.event_name == 'schedule' && failure()
uses: ./.github/actions/publish_sns_topic
with:
sns_topic_arn: ${{ secrets.AWS_SNS_NOTIFICATION_TOPIC_ARN }}
Expand Down
141 changes: 141 additions & 0 deletions .github/workflows/test_iasworld_data.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
name: test-iasworld-data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow should be nearly identical to the old implementation of test-dbt-models, but GitHub can't pick up on the diff because we're keeping test-dbt-models and adapting it for a slightly more specific purpose. I'll call out every change I made in comments below.


# This workflow is not configured to run on PRs, because PRs have their own CI
# process defined by the `build-and-test-dbt` workflow. Instead, the Data
# Department's Spark data extraction process dispatches it via API once per
# day after finishing the daily ingest of iasWorld data. For more detail, see:
#
# https://github.com/ccao-data/service-spark-iasworld
Comment on lines +3 to +8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is slightly different, to reflect the new conditions in which this workflow runs compared to the old test-dbt-models workflow.

on:
workflow_dispatch:
inputs:
select:
description: >
Optional space-separated list of tests to run (defaults to all
iasWorld data tests)
required: false
type: string
selector:
description: >
Optional dbt selector representing tests to run (takes precedence
over the above list of tests if both are present)
required: false
type: string
upload_test_results:
description: Upload test results to S3
required: false
default: false
type: boolean

env:
PYTHONUNBUFFERED: "1"
UV_SYSTEM_PYTHON: 1

jobs:
test-iasworld-data:
runs-on: ubuntu-latest
# These permissions are needed to interact with GitHub's OIDC Token endpoint
# so that we can authenticate with AWS
permissions:
id-token: write
contents: read
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Validate and parse input variables
id: parse-inputs
run: |
# Default to no select option, which will fall back to the default
# behavior of the underlying Python script
SELECT_OPTION=""
Comment on lines +49 to +51
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is different from the old implementation of test-dbt-models: Now that this workflow is dedicated to running iasWorld tests, we can fall back to the default behavior of the underlying run_iasworld_data_tests.py script, which means we can avoid passing in the --select or --selector options unless the caller specifies one of them.

if [[ -n "$SELECTOR" ]]; then
SELECT_OPTION="--selector $SELECTOR"
elif [[ -n "$SELECT" ]]; then
SELECT_OPTION="--select $SELECT"
fi
echo "Setting select option to '$SELECT_OPTION'"
echo "select-option=$SELECT_OPTION" >> "$GITHUB_OUTPUT"
shell: bash
env:
SELECT: ${{ inputs.select }}
SELECTOR: ${{ inputs.selector }}

- name: Setup dbt
uses: ./.github/actions/setup_dbt
with:
role-to-assume: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }}

- name: Restore dbt state cache
id: cache
uses: ./.github/actions/restore_dbt_cache
with:
path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }}
key: ${{ env.CACHE_KEY }}

- name: Install Python dependencies
working-directory: ${{ env.PROJECT_DIR }}
shell: bash
run: uv pip install ".[dbt_tests]"

- name: Run tests
run: |
DEFER_OPTION=""
if [[ "$CACHE_HIT" == 'true' ]]; then
DEFER_OPTION="--defer --state $STATE_DIR"
fi
# shellcheck disable=SC2086
python scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/ \
$SELECT_OPTION \
$SKIP_ARTIFACTS_OPTION \
$DEFER_OPTION
working-directory: ${{ env.PROJECT_DIR }}
shell: bash
env:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): Why pass these in as env vars rather than just interpolating them directly into the bash?

Copy link
Contributor Author

@jeancochrane jeancochrane Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! There's no functional reason, I just felt like it was clearer in this particular case. For USER, GIT_SHA, GIT_REF, and GIT_AUTHOR, the script expects these environment variables to be set, so we need to convert the GitHub variables to env vars anyway. For SELECT_OPTION, CACHE_HIT, and SKIP_ARTIFACTS_OPTION, I felt like the GitHub variable references were a bit long/esoteric, and it feels easier to me to read them as env vars. Definitely a weak preference on my part, I'm open to changing any of these if you think a GitHub variable would be clearer.

USER: ${{ github.triggering_actor }}
GIT_SHA: ${{ github.sha }}
GIT_REF: ${{ github.ref_name }}
GIT_AUTHOR: ${{ github.event.commits[0].author.name }}
CACHE_HIT: ${{ steps.cache.outputs.cache-hit }}
SELECT_OPTION: ${{ steps.parse-inputs.outputs.select-option }}
SKIP_ARTIFACTS_OPTION: ${{ inputs.upload_test_results && '--no-skip-artifacts' || '--skip-artifacts' }}

- name: Save test results to S3
if: inputs.upload_test_results
run: |
s3_prefix="s3://ccao-data-warehouse-us-east-1/qc"
local_prefix="qc_test_results/metadata"
for dir in "test_run" "test_run_result" "test_run_failing_row"; do
dirpath="${local_prefix}/${dir}"
if [ -e "$dirpath" ]; then
echo "Copying ${dirpath} metadata to S3"
aws s3 sync "$dirpath" "${s3_prefix}/${dir}"
fi
done

crawler_name="ccao-data-warehouse-qc-crawler"
aws glue start-crawler --name "$crawler_name"
echo "Triggered Glue crawler $crawler_name"
working-directory: ${{ env.PROJECT_DIR }}
shell: bash

- name: Get current time
if: failure()
run: echo "TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%S")" >> "$GITHUB_ENV"
shell: bash

# Only triggered when called by a bot. Otherwise, notifications
# go to whoever called the workflow
- name: Send failure notification
if: github.event_name == 'workflow_dispatch' && github.triggering_actor == 'sqoop-bot[bot]' && failure()
uses: ./.github/actions/publish_sns_topic
with:
sns_topic_arn: ${{ secrets.AWS_SNS_NOTIFICATION_TOPIC_ARN }}
subject: "iasWorld tests errored for workflow run: ${{ github.run_id }}"
body: |
iasWorld tests raised an error for workflow ${{ github.run_id }}, run on ${{ env.TIMESTAMP }} UTC
Comment on lines +136 to +138
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subject and body are now slightly different from the old implementation ("iasWorld tests" instead of "dbt tests") to make it clear that the message specifically relates to iasWorld tests.


Link to failing workflow:
https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
Loading