You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our sidekiq alerts cover 2 types of response cases that are in some conflict.
Too many retries: When a job has retried enough times that we're pretty confident it won't succeed, we want to communicate with ingesters about the resource. However, this communication is non-urgent and in practice is rolled into our weekly honeybadger meeting. Sometimes the alert stays triggered for several days.
Realtime queue latency: When requests for pdfs outpace the realtime queues available for pdf generation jobs, the user experiences a progress bar that mysteriously does not progress. Even relatively small amounts of latency in this queue indicate a usability concern that we'd like to know about as we consider whether we have the right number of realtime workers. This alert could be immediately actionable, for example if we experience latency here we could increase the number of workers for that queue.
There are 2 main problems with the way the alerts are set up:
when the alert stays triggered due to retries, a realtime queue latency event won't alert and we'll never know it happened
when the alert is triggered it is not clear from the datadog message whether it's triggering due to retries (non-urgent, ignorable) or realtime queue latency (potentially actionable)
Things we could maybe do:
We could just remove retries from the alert, and rely on our weekly process to review these. The sidekiq documentation says that the 25 retries take about 20 days, so if we ever missed a week we'd see it the next week. And if we missed 2 weeks we could look at the "dead jobs" list.
We could look to treat the realtime queue latency as a metric rather than a threshold, collect and present that value somewhere we could consult it and look at its historic value.
We could try to create another monitor, potentially inheriting from the sidekiq one, with its own endpoint and its own alert.
The text was updated successfully, but these errors were encountered:
Our sidekiq alerts cover 2 types of response cases that are in some conflict.
Too many retries: When a job has retried enough times that we're pretty confident it won't succeed, we want to communicate with ingesters about the resource. However, this communication is non-urgent and in practice is rolled into our weekly honeybadger meeting. Sometimes the alert stays triggered for several days.
Realtime queue latency: When requests for pdfs outpace the realtime queues available for pdf generation jobs, the user experiences a progress bar that mysteriously does not progress. Even relatively small amounts of latency in this queue indicate a usability concern that we'd like to know about as we consider whether we have the right number of realtime workers. This alert could be immediately actionable, for example if we experience latency here we could increase the number of workers for that queue.
There are 2 main problems with the way the alerts are set up:
Things we could maybe do:
The text was updated successfully, but these errors were encountered: