Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust sidekiq health monitor alerts #6558

Open
hackartisan opened this issue Dec 4, 2024 · 0 comments
Open

Adjust sidekiq health monitor alerts #6558

hackartisan opened this issue Dec 4, 2024 · 0 comments

Comments

@hackartisan
Copy link
Member

Our sidekiq alerts cover 2 types of response cases that are in some conflict.

Too many retries: When a job has retried enough times that we're pretty confident it won't succeed, we want to communicate with ingesters about the resource. However, this communication is non-urgent and in practice is rolled into our weekly honeybadger meeting. Sometimes the alert stays triggered for several days.

Realtime queue latency: When requests for pdfs outpace the realtime queues available for pdf generation jobs, the user experiences a progress bar that mysteriously does not progress. Even relatively small amounts of latency in this queue indicate a usability concern that we'd like to know about as we consider whether we have the right number of realtime workers. This alert could be immediately actionable, for example if we experience latency here we could increase the number of workers for that queue.

There are 2 main problems with the way the alerts are set up:

  1. when the alert stays triggered due to retries, a realtime queue latency event won't alert and we'll never know it happened
  2. when the alert is triggered it is not clear from the datadog message whether it's triggering due to retries (non-urgent, ignorable) or realtime queue latency (potentially actionable)

Things we could maybe do:

  • We could just remove retries from the alert, and rely on our weekly process to review these. The sidekiq documentation says that the 25 retries take about 20 days, so if we ever missed a week we'd see it the next week. And if we missed 2 weeks we could look at the "dead jobs" list.
  • We could look to treat the realtime queue latency as a metric rather than a threshold, collect and present that value somewhere we could consult it and look at its historic value.
  • We could try to create another monitor, potentially inheriting from the sidekiq one, with its own endpoint and its own alert.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant