-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve Slack rate limiting logic when updating alert groups #5287
base: dev
Are you sure you want to change the base?
Conversation
…:grafana/oncall into jorlando/integration-slack-rate-limiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly type hint updates in this file (was poking around in here so decided to leave type hints just a bit better than I found them)
return ( | ||
self.rate_limited_in_slack_at is not None | ||
and self.rate_limited_in_slack_at + SLACK_RATE_LIMIT_TIMEOUT > timezone.now() | ||
) | ||
|
||
def start_send_rate_limit_message_task(self, delay=SLACK_RATE_LIMIT_DELAY): | ||
def start_send_rate_limit_message_task(self, error_message_verb: str, delay: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is currently only invoked in two spots, both of which have been updated (additionally, both of these spots that were invoking this function were already passing in a delay
, hence why I removed the default of SLACK_RATE_LIMIT_DELAY
)
@@ -61,8 +66,10 @@ class SlackMessage(models.Model): | |||
related_name="slack_messages", | |||
) | |||
|
|||
# ID of a latest celery task to update the message | |||
active_update_task_id = models.CharField(max_length=100, null=True, default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this field (conveniently) already existed, but it doesn't look like it was being used? (code search) in this PR I'm going to start reusing this field
def update_alert_groups_message(self) -> None: | ||
""" | ||
Schedule an update task for the associated alert group's Slack message, respecting the debounce interval. | ||
|
||
This method ensures that updates to the Slack message related to an alert group are not performed | ||
too frequently, adhering to the `ALERT_GROUP_UPDATE_DEBOUNCE_INTERVAL_SECONDS` debounce interval. | ||
It schedules a background task to update the message after the appropriate countdown. | ||
|
||
The method performs the following steps: | ||
- Checks if there's already an active update task (`active_update_task_id` is set). If so, exits to prevent | ||
duplicate scheduling. | ||
- Calculates the time since the last update (`last_updated` field) and determines the remaining time needed | ||
to respect the debounce interval. | ||
- Schedules the `update_alert_group_slack_message` task with the calculated countdown. | ||
- Stores the task ID in `active_update_task_id` to prevent multiple tasks from being scheduled. | ||
""" | ||
|
||
if not self.alert_group: | ||
logger.warning( | ||
f"skipping update_alert_groups_message as SlackMessage {self.pk} has no alert_group associated with it" | ||
) | ||
return | ||
elif self.active_update_task_id: | ||
logger.info( | ||
f"skipping update_alert_groups_message as SlackMessage {self.pk} has an active update task {self.active_update_task_id}" | ||
) | ||
return | ||
|
||
now = timezone.now() | ||
|
||
# we previously weren't updating the last_updated field for messages, so there will be cases | ||
# where the last_updated field is None | ||
last_updated = self.last_updated or now | ||
|
||
time_since_last_update = (now - last_updated).total_seconds() | ||
remaining_time = self.ALERT_GROUP_UPDATE_DEBOUNCE_INTERVAL_SECONDS - int(time_since_last_update) | ||
countdown = max(remaining_time, 10) | ||
|
||
logger.info( | ||
f"updating message for alert_group {self.alert_group.pk} in {countdown} seconds " | ||
f"(debounce interval: {self.ALERT_GROUP_UPDATE_DEBOUNCE_INTERVAL_SECONDS})" | ||
) | ||
|
||
task_id = celery_uuid() | ||
update_alert_group_slack_message.apply_async((self.pk,), countdown=countdown, task_id=task_id) | ||
self.active_update_task_id = task_id | ||
self.save(update_fields=["active_update_task_id"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main change
if not alert_group_slack_message: | ||
logger.info( | ||
f"Skip updating alert group in Slack because alert_group {alert_group.pk} doesn't " | ||
"have a slack message associated with it" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to centralize all of the logic of conditionally updating the alert group's slack message to SlackMessage.update_alert_groups_message
(and also decided to drop this caching logic in favour of slack_message.active_update_task_id
+ celery task countdown
(which represents the "debounce interval"))
engine/apps/slack/tasks.py
Outdated
def update_alert_group_slack_message(slack_message_pk: int) -> None: | ||
""" | ||
Background task to update the Slack message for an alert group. | ||
|
||
This function is intended to be executed as a Celery task. It performs the following: | ||
- Compares the current task ID with the `active_update_task_id` stored in the `SlackMessage`. | ||
- If they do not match, it means a newer task has been scheduled, so the current task exits to prevent outdated updates. | ||
- Uses the `AlertGroupSlackService` to perform the actual update of the Slack message. | ||
- Upon successful completion, clears the `active_update_task_id` to allow future updates. | ||
|
||
Args: | ||
slack_message_pk (int): The primary key of the `SlackMessage` instance to update. | ||
""" | ||
|
||
from apps.slack.models import SlackMessage | ||
|
||
current_task_id = update_alert_group_slack_message.request.id | ||
|
||
return ( | ||
f"update_incident_slack_message rescheduled because of current task_id ({current_task_id})" | ||
f" for alert_group {alert_group_pk} doesn't exist in cache" | ||
logger.info( | ||
f"update_alert_group_slack_message for slack message {slack_message_pk} started with task_id {current_task_id}" | ||
) | ||
|
||
try: | ||
slack_message = SlackMessage.objects.get(pk=slack_message_pk) | ||
except SlackMessage.DoesNotExist: | ||
logger.warning(f"SlackMessage {slack_message_pk} doesn't exist") | ||
return | ||
|
||
if not current_task_id == slack_message.active_update_task_id: | ||
logger.warning( | ||
f"update_alert_group_slack_message skipped, because current_task_id ({current_task_id}) " | ||
f"does not equal to active_update_task_id ({slack_message.active_update_task_id}) " | ||
) | ||
if not current_task_id == cached_task_id: | ||
return ( | ||
f"update_incident_slack_message skipped, because of current task_id ({current_task_id})" | ||
f" doesn't equal to cached task_id ({cached_task_id}) for alert_group {alert_group_pk}" | ||
return | ||
|
||
alert_group = slack_message.alert_group | ||
if not alert_group: | ||
logger.warning( | ||
f"skipping update_alert_group_slack_message as SlackMessage {slack_message_pk} " | ||
"doesn't have an alert group associated with it" | ||
) | ||
return | ||
|
||
AlertGroupSlackService(slack_message.slack_team_identity).update_alert_group_slack_message(alert_group) | ||
|
||
slack_message.active_update_task_id = None | ||
slack_message.last_updated = timezone.now() | ||
slack_message.save(update_fields=["active_update_task_id", "last_updated"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main change
|
||
|
||
@shared_dedicated_queue_retry_task(autoretry_for=(Exception,), retry_backoff=True) | ||
def update_incident_slack_message(slack_team_identity_pk: int, alert_group_pk: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been renamed to update_alert_group_slack_message
.. leaving this around to allow the celery workers to process any remaining update_incident_slack_message
tasks (before removing this in a subsequent PR)
@@ -29,7 +28,7 @@ def escalate_alert_group(alert_group_pk): | |||
except IndexError: | |||
return f"Alert group with pk {alert_group_pk} doesn't exist" | |||
|
|||
if not compare_escalations(escalate_alert_group.request.id, alert_group.active_escalation_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function felt improperly named to me, as the only invocations of it are to compare celery task ids. Given that this function simply looks like this 👇, decide to just completely remove it in favour of a simple equality check:
def compare_escalations(request_id, active_escalation_id):
if request_id == active_escalation_id:
return True
return False
@@ -9,7 +9,6 @@ django-cors-headers==3.7.0 | |||
# pyroscope-io==0.8.1 | |||
django-dbconn-retry==0.1.7 | |||
django-debug-toolbar==4.1 | |||
django-deprecate-fields==0.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a random change, but we discussed it here (since it's only a one-liner change + this library isn't used, decided to just squeeze it in this PR)
@pytest.fixture | ||
def slack_message_setup( | ||
make_organization_and_user_with_slack_identities, make_alert_receive_channel, make_alert_group, make_slack_message | ||
): | ||
def _slack_message_setup(cached_permalink): | ||
( | ||
organization, | ||
user, | ||
slack_team_identity, | ||
slack_user_identity, | ||
) = make_organization_and_user_with_slack_identities() | ||
integration = make_alert_receive_channel(organization) | ||
alert_group = make_alert_group(integration) | ||
|
||
return make_slack_message(alert_group, cached_permalink=cached_permalink) | ||
|
||
return _slack_message_setup | ||
|
||
|
||
@patch.object( | ||
SlackClient, | ||
"chat_getPermalink", | ||
return_value=build_slack_response({"ok": True, "permalink": "test_permalink"}), | ||
) | ||
@pytest.mark.django_db | ||
def test_slack_message_permalink(mock_slack_api_call, slack_message_setup): | ||
slack_message = slack_message_setup(cached_permalink=None) | ||
assert slack_message.permalink == "test_permalink" | ||
mock_slack_api_call.assert_called_once() | ||
|
||
|
||
@patch.object( | ||
SlackClient, | ||
"chat_getPermalink", | ||
side_effect=SlackAPIError(response=build_slack_response({"ok": False, "error": "message_not_found"})), | ||
) | ||
@pytest.mark.django_db | ||
def test_slack_message_permalink_error(mock_slack_api_call, slack_message_setup): | ||
slack_message = slack_message_setup(cached_permalink=None) | ||
assert slack_message.permalink is None | ||
mock_slack_api_call.assert_called_once() | ||
|
||
|
||
@patch.object( | ||
SlackClient, | ||
"chat_getPermalink", | ||
return_value=build_slack_response({"ok": True, "permalink": "test_permalink"}), | ||
) | ||
@pytest.mark.django_db | ||
def test_slack_message_permalink_cache(mock_slack_api_call, slack_message_setup): | ||
slack_message = slack_message_setup(cached_permalink="cached_permalink") | ||
assert slack_message.permalink == "cached_permalink" | ||
mock_slack_api_call.assert_not_called() | ||
|
||
|
||
@patch.object( | ||
SlackClient, | ||
"chat_getPermalink", | ||
return_value=build_slack_response({"ok": False, "error": "account_inactive"}), | ||
) | ||
@pytest.mark.django_db | ||
def test_slack_message_permalink_token_revoked(mock_slack_api_call, slack_message_setup): | ||
slack_message = slack_message_setup(cached_permalink=None) | ||
slack_message._slack_team_identity.detected_token_revoked = timezone.now() | ||
slack_message._slack_team_identity.save() | ||
assert slack_message._slack_team_identity is not None | ||
assert slack_message.permalink is None | ||
mock_slack_api_call.assert_not_called() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've simply combined the tests in engine/apps/slack/test_slack_message.py with this file (as their all tests related to the same model, SlackMessage
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see this comment
alert_group_3.resolve() | ||
with patch("apps.alerts.tasks.notify_user.compare_escalations", return_value=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only touching this test file because I removed compare_escalations
in this PR (and hence need to update this mock)..
this test should probably be broken out into several separate tests (to avoid having to reset notification_bundle.notification_task_id
in between calls to send_bundled_notification
.. but I will leave that for the future)
What this PR does
Closes https://github.com/grafana/oncall-private/issues/2947
Which issue(s) this PR closes
Related to [issue link here]
Checklist
pr:no public docs
PR label added if not required)release:
). These labels dictate how your PR willshow up in the autogenerated release notes.