-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add framework for instant email notifications of failing tests #579
Conversation
This is really cool stuff! |
@@ -31,14 +31,16 @@ jobs: | |||
run: | | |||
python3 scripts/run_iasworld_data_tests.py \ | |||
--target "$TARGET" \ | |||
--output-dir ./qc_test_results/ | |||
--output-dir ./qc_test_results/ \ | |||
--vars "{\"data_test_iasworld_commercial_sns_topic\": \"$COMMERCIAL_SNS_TOPIC\"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One downside of this approach (setting topic ARNs as vars and overriding them with secrets during workflow execution) is that we'll have to update this bloated JSON dict every time we add a new topic. I don't think it's bad enough to outweigh the benefits of this approach, but it's a downside for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Why not just make a jq call or similar that maps the dbt vars to their expected env var equivalent, such that if we add a new dbt var it will automatically search for the corresponding env var? E.g.
- Add
data_test_iasworld_commercial_sns_topic
searches for$COMMERCIAL_SNS_TOPIC
- Add
data_test_iasworld_asmt_sns_topic
searches for$ASMT_SNS_TOPIC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is an interesting idea, but I'm not totally convinced it's worth the complexity just yet. Let's plan to discuss in-person tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our in-person discussion, I'm going to table this until we need to add more topics. I've added it to the future design in #595 to preserve this discussion.
jq -r \ | ||
'.[] | "./.github/actions/publish_sns_topic/publish.sh \(.topic_arn|@sh) \(.subject|@sh) \(.body|@sh)"' \ | ||
"$failure_notifications_file" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unholy jq command was the simplest way I could think of parsing failures by topic and queuing up each one to send, but I'm open to ideas that will be more understandable to laypeople!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a little bit cursed. What about doing it in Python? Also, let's add some conditionals on this (like those on L68) so that we don't accidentally spam folks if this workflow is triggered a bunch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great ideas on both fronts! I gated this step behind a conditional in e5a615e and factored the esoteric jq call into a Python script in 8fbac89. Note that I decided to implement the gate via an input variable, so we'll have to update the Spark job in order to enable it. I think this is a feature rather than a bug, however, since it means we can merge this change without enabling it immediately.
Requesting @dfsnow for a preliminary look at the design before I polish things up and add docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start! The variable passing stuff seems like a necessary evil, but the seed + anti-join stuff strikes me as a bit clunky. Let's brainstorm to see if we can't simplify things and then meet later this week. In the meantime, see my other comments for more minor actions.
jq -r \ | ||
'.[] | "./.github/actions/publish_sns_topic/publish.sh \(.topic_arn|@sh) \(.subject|@sh) \(.body|@sh)"' \ | ||
"$failure_notifications_file" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a little bit cursed. What about doing it in Python? Also, let's add some conditionals on this (like those on L68) so that we don't accidentally spam folks if this workflow is triggered a bunch.
# Generate the notification body and send it to each SNS topic. Start | ||
# by parsing out a message body and subject for each group of failures | ||
# by topic into a list of objects with the keys `topic_arn`, `subject`, | ||
# and `body` | ||
failure_notifications: typing.List[typing.Dict] = [] | ||
for topic_arn, test_results in failures_by_topic.items(): | ||
body = "The following iasWorld data tests are failing:" | ||
for test_result in test_results: | ||
body += f"\n\n{test_result.details}" | ||
failure_notifications.append( | ||
{ | ||
"topic_arn": topic_arn, | ||
"subject": "iasWorld data tests failed", | ||
"body": body, | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand correctly: test failures are grouped by topic, such that different failing tests would end up in the same single email ala:
- {table_name}: {description} ({status})
* {fail1_key1}: {fail1_value1}, {fail1_key2}: {fail1_value2}
- {table_name2}: {description} ({status})
* {fail1_key1}: {fail1_value1}, {fail1_key2}: {fail1_value2}
* {fail2_key1}: {fail2_value1}, {fail2_key2}: {fail2_value2}
Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's right! To make extra sure we're on the same page, "topic" here specifically means SNS topic, i.e. tests that are tagged with the same SNS topic representing a group of recipients will be merged into one email that goes out to those recipients, regardless of the category of the tests.
@@ -31,14 +31,16 @@ jobs: | |||
run: | | |||
python3 scripts/run_iasworld_data_tests.py \ | |||
--target "$TARGET" \ | |||
--output-dir ./qc_test_results/ | |||
--output-dir ./qc_test_results/ \ | |||
--vars "{\"data_test_iasworld_commercial_sns_topic\": \"$COMMERCIAL_SNS_TOPIC\"}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Why not just make a jq call or similar that maps the dbt vars to their expected env var equivalent, such that if we add a new dbt var it will automatically search for the corresponding env var? E.g.
- Add
data_test_iasworld_commercial_sns_topic
searches for$COMMERCIAL_SNS_TOPIC
- Add
data_test_iasworld_asmt_sns_topic
searches for$ASMT_SNS_TOPIC
# Generate the notification body and send it to each SNS topic. Start | ||
# by parsing out a message body and subject for each group of failures | ||
# by topic into a list of objects with the keys `topic_arn`, `subject`, | ||
# and `body` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Let's add a call to action to the body as well. Something like "These were autogenerated by the Data Team. You will stop receiving these emails once the failures below are fixed" and then also our contact info at the bottom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that, done in c5096a8. I added our names instead of contact info, to avoid committing our email addresses to the source code; let me know if you don't mind the risk of spam and I can add email addresses in.
b4a612c
to
d6ed838
Compare
This PR adds support for a new
meta.notify
attribute to our dbtschema.yml
syntax that can control automated email notifications for failing tests. It introduces a new testiasworld_pardat_spot_is_null
that uses this system, along with a seed representing existing test failures; when combined, these two features allow us to send realtime notifications when a new value starts failing the test, without spamming us with notifications for known rows that are already failing.I'll forward you a sample email so that you can see what the notification looks like. Here are the logs for the workflow that generated that notification.
Here's an overview for how this system works:
Sending a notification for a failing test
dbt_project.yml
that will store the ARN for the topicscripts/run_iasworld_data_tests.py
call in thetest-dbt-models
workflow to set the dbt variable you created in step 2 to the secret you created in step 3meta.notify
attribute whose value is the dbt variable name from step 1test-dbt-models
workflow runs, it will send a notification for any failures to the subscribers of the SNS topic from step 1Restricting failure notifications to remove known failures
anti_join
argument that can join to another model and only return failures that are not already represented in that modelanti_join
argument added in step 2 and point it to the seed added in step 1Note that we have sketched out a simplified design for this system in #595, but we're planning to merge this version anyway so that we can get started configuring tests for notifications as soon as possible.