Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add framework for instant email notifications of failing tests #579

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Sep 4, 2024

This PR adds support for a new meta.notify attribute to our dbt schema.yml syntax that can control automated email notifications for failing tests. It introduces a new test iasworld_pardat_spot_is_null that uses this system, along with a seed representing existing test failures; when combined, these two features allow us to send realtime notifications when a new value starts failing the test, without spamming us with notifications for known rows that are already failing.

I'll forward you a sample email so that you can see what the notification looks like. Here are the logs for the workflow that generated that notification.

Here's an overview for how this system works:

Sending a notification for a failing test

  1. Get a dbt variable representing an SNS topic for the group of team members we want to email when this test fails
    • If no variable/topic combo exists yet for this group, a few config steps are required:
      1. Create the SNS topic
      2. Define a new empty dbt variable in dbt_project.yml that will store the ARN for the topic
      3. Set the ARN to a repo secret so that it is hidden but we can access it from GitHub workflows
      4. Update the scripts/run_iasworld_data_tests.py call in the test-dbt-models workflow to set the dbt variable you created in step 2 to the secret you created in step 3
    • If a variable/topic already exists for this group, no changes are necessary; just grab its dbt variable name
  2. Update the test definition to add a meta.notify attribute whose value is the dbt variable name from step 1
  3. When the test-dbt-models workflow runs, it will send a notification for any failures to the subscribers of the SNS topic from step 1

Restricting failure notifications to remove known failures

  1. Add a seed representing the rows that we already know are failing for a given test
  2. Adjust the interface for the generic test to add support for an anti_join argument that can join to another model and only return failures that are not already represented in that model
  3. Update the test to supply the anti_join argument added in step 2 and point it to the seed added in step 1
  4. When the test runs, it will only fail for rows that do not have a representation in the seed from step 1

Note that we have sketched out a simplified design for this system in #595, but we're planning to merge this version anyway so that we can get started configuring tests for notifications as soon as possible.

@jeancochrane jeancochrane linked an issue Sep 4, 2024 that may be closed by this pull request
@rross0
Copy link

rross0 commented Sep 5, 2024

This is really cool stuff!

.github/actions/publish_sns_topic/action.yaml Outdated Show resolved Hide resolved
@@ -31,14 +31,16 @@ jobs:
run: |
python3 scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/
--output-dir ./qc_test_results/ \
--vars "{\"data_test_iasworld_commercial_sns_topic\": \"$COMMERCIAL_SNS_TOPIC\"}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside of this approach (setting topic ARNs as vars and overriding them with secrets during workflow execution) is that we'll have to update this bloated JSON dict every time we add a new topic. I don't think it's bad enough to outweigh the benefits of this approach, but it's a downside for sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Why not just make a jq call or similar that maps the dbt vars to their expected env var equivalent, such that if we add a new dbt var it will automatically search for the corresponding env var? E.g.

  • Add data_test_iasworld_commercial_sns_topic searches for $COMMERCIAL_SNS_TOPIC
  • Add data_test_iasworld_asmt_sns_topic searches for $ASMT_SNS_TOPIC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is an interesting idea, but I'm not totally convinced it's worth the complexity just yet. Let's plan to discuss in-person tomorrow.

Copy link
Contributor Author

@jeancochrane jeancochrane Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our in-person discussion, I'm going to table this until we need to add more topics. I've added it to the future design in #595 to preserve this discussion.

Comment on lines 85 to 87
jq -r \
'.[] | "./.github/actions/publish_sns_topic/publish.sh \(.topic_arn|@sh) \(.subject|@sh) \(.body|@sh)"' \
"$failure_notifications_file" \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unholy jq command was the simplest way I could think of parsing failures by topic and queuing up each one to send, but I'm open to ideas that will be more understandable to laypeople!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a little bit cursed. What about doing it in Python? Also, let's add some conditionals on this (like those on L68) so that we don't accidentally spam folks if this workflow is triggered a bunch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great ideas on both fronts! I gated this step behind a conditional in e5a615e and factored the esoteric jq call into a Python script in 8fbac89. Note that I decided to implement the gate via an input variable, so we'll have to update the Spark job in order to enable it. I think this is a feature rather than a bug, however, since it means we can merge this change without enabling it immediately.

dbt/scripts/run_iasworld_data_tests.py Outdated Show resolved Hide resolved
dbt/scripts/run_iasworld_data_tests.py Outdated Show resolved Hide resolved
dbt/scripts/run_iasworld_data_tests.py Outdated Show resolved Hide resolved
dbt/tests/generic/test_is_null.sql Outdated Show resolved Hide resolved
dbt/tests/generic/test_is_null.sql Outdated Show resolved Hide resolved
@jeancochrane jeancochrane requested a review from dfsnow September 5, 2024 22:33
@jeancochrane
Copy link
Contributor Author

Requesting @dfsnow for a preliminary look at the design before I polish things up and add docs.

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start! The variable passing stuff seems like a necessary evil, but the seed + anti-join stuff strikes me as a bit clunky. Let's brainstorm to see if we can't simplify things and then meet later this week. In the meantime, see my other comments for more minor actions.

Comment on lines 85 to 87
jq -r \
'.[] | "./.github/actions/publish_sns_topic/publish.sh \(.topic_arn|@sh) \(.subject|@sh) \(.body|@sh)"' \
"$failure_notifications_file" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a little bit cursed. What about doing it in Python? Also, let's add some conditionals on this (like those on L68) so that we don't accidentally spam folks if this workflow is triggered a bunch.

dbt/scripts/run_iasworld_data_tests.py Outdated Show resolved Hide resolved
Comment on lines 923 to 938
# Generate the notification body and send it to each SNS topic. Start
# by parsing out a message body and subject for each group of failures
# by topic into a list of objects with the keys `topic_arn`, `subject`,
# and `body`
failure_notifications: typing.List[typing.Dict] = []
for topic_arn, test_results in failures_by_topic.items():
body = "The following iasWorld data tests are failing:"
for test_result in test_results:
body += f"\n\n{test_result.details}"
failure_notifications.append(
{
"topic_arn": topic_arn,
"subject": "iasWorld data tests failed",
"body": body,
}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand correctly: test failures are grouped by topic, such that different failing tests would end up in the same single email ala:

           - {table_name}: {description} ({status})
                * {fail1_key1}: {fail1_value1}, {fail1_key2}: {fail1_value2}
           - {table_name2}: {description} ({status})
                * {fail1_key1}: {fail1_value1}, {fail1_key2}: {fail1_value2}
                * {fail2_key1}: {fail2_value1}, {fail2_key2}: {fail2_value2}

Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's right! To make extra sure we're on the same page, "topic" here specifically means SNS topic, i.e. tests that are tagged with the same SNS topic representing a group of recipients will be merged into one email that goes out to those recipients, regardless of the category of the tests.

dbt/scripts/run_iasworld_data_tests.py Outdated Show resolved Hide resolved
@@ -31,14 +31,16 @@ jobs:
run: |
python3 scripts/run_iasworld_data_tests.py \
--target "$TARGET" \
--output-dir ./qc_test_results/
--output-dir ./qc_test_results/ \
--vars "{\"data_test_iasworld_commercial_sns_topic\": \"$COMMERCIAL_SNS_TOPIC\"}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Why not just make a jq call or similar that maps the dbt vars to their expected env var equivalent, such that if we add a new dbt var it will automatically search for the corresponding env var? E.g.

  • Add data_test_iasworld_commercial_sns_topic searches for $COMMERCIAL_SNS_TOPIC
  • Add data_test_iasworld_asmt_sns_topic searches for $ASMT_SNS_TOPIC

Comment on lines 923 to 926
# Generate the notification body and send it to each SNS topic. Start
# by parsing out a message body and subject for each group of failures
# by topic into a list of objects with the keys `topic_arn`, `subject`,
# and `body`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Let's add a call to action to the body as well. Something like "These were autogenerated by the Data Team. You will stop receiving these emails once the failures below are fixed" and then also our contact info at the bottom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, done in c5096a8. I added our names instead of contact info, to avoid committing our email addresses to the source code; let me know if you don't mind the risk of spam and I can add email addresses in.

@jeancochrane jeancochrane changed the title [Do not merge] Add framework for instant email notifications of failing tests Add framework for instant email notifications of failing tests Sep 17, 2024
@jeancochrane jeancochrane changed the title Add framework for instant email notifications of failing tests [Do not merge] Add framework for instant email notifications of failing tests Sep 17, 2024
@jeancochrane jeancochrane changed the title [Do not merge] Add framework for instant email notifications of failing tests Add framework for instant email notifications of failing tests Sep 17, 2024
@jeancochrane jeancochrane changed the base branch from master to jeancochrane/gate-test-result-s3-upload-behind-workflow-variable September 17, 2024 20:36
@jeancochrane jeancochrane force-pushed the jeancochrane/572-add-automated-test-with-notifications-for-dweldatspot branch from b4a612c to d6ed838 Compare September 17, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add automated test with notifications for dweldat.spot
3 participants