-
Notifications
You must be signed in to change notification settings - Fork 21
add and emit pool owner metadata for alerting #327
Conversation
Signed-off-by: Max Falk <gfalk@yelp.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nits, but nothing blocking :)
acceptance/srv-configs/clusterman-clusters/local-dev/default.mesos
Outdated
Show resolved
Hide resolved
clusterman/autoscaler/autoscaler.py
Outdated
@@ -181,7 +181,7 @@ def run(self, dry_run: bool = False, timestamp: Optional[arrow.Arrow] = None) -> | |||
self.target_capacity_gauge.set(new_target_capacity, {"dry_run": dry_run}) | |||
self.max_capacity_gauge.set( | |||
self.pool_manager.max_capacity, | |||
{"dry_run": dry_run, "alert_on_max_capacity": self.pool_manager.alert_on_max_capacity}, | |||
{"dry_run": dry_run, "alert_on_max_capacity": self.pool_manager.alert_on_max_capacity, "team": self.pool_manager.pool_owner}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we s/team/pool_owner
to keep things consistent between config and metric labeling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knowingly set it to team
because that's what our alertmanager automatically resolves for ticketing/alerting.
If we don't set it to team
here, we'll have to relabel in the prometheus query for the max_pool_capacity
alerting.
Still want to keep it at pool_owner
? @nemacysts @jfongatyelp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, that's fine then - might be good to add that in a comment here tho since i'm not sure how many people are aware of that (or maybe i'm just the only one that didn't know this :p)
Signed-off-by: Max Falk <gfalk@yelp.com>
Signed-off-by: Max Falk <gfalk@yelp.com>
clusterman/autoscaler/autoscaler.py
Outdated
{ | ||
"dry_run": dry_run, | ||
"alert_on_max_capacity": self.pool_manager.alert_on_max_capacity, | ||
"team": self.pool_manager.pool_owner, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmdfalk actually, I was just talking to @EmanElsaban about another alert we wanted to use internally and we realized that having a team
label on all of the clusterman metrics would be useful - thoughts on adding the label to everything else in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nemacysts Yeah, probably makes sense!
In that case, i think we should be able to add the team label to autoscaler._emit_requested_resource_metrics
and drainer._emit_draining_metrics
to tag onto most if not all metrics emitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ - if we have a centralized place that sounds even better!
I'm a little torn about whether or not this would be of use for the drainer metrics: is there any case where we're not at fault for the drainer not working and we'd want pool owners to deal with drainer pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although, i suppose it would be useful for filtering regardless :)
* master: CLUSTERMAN-812: upgrade k8s client library (#334) Revert "Revert "Add support to use in-cluster service account (#338)"" (#340) Revert "Add support to use in-cluster service account (#338)" Add support to use in-cluster service account (#338) Updated PATH variable in supervisod.conf Grant home dir permission to user nobody Failing on signal errors pin cryptography for acceptance venv bump yelp-batch to 11.2.7 using right targz Adding new targz with metrics fix Adding old acceptance test Revert "Revert "Upgrading clusterman image to Ubuntu Jammy""
Signed-off-by: Max Falk <gfalk@yelp.com>
Signed-off-by: Max Falk <gfalk@yelp.com>
Signed-off-by: Max Falk <gfalk@yelp.com>
Signed-off-by: Max Falk <gfalk@yelp.com>
Signed-off-by: Max Falk <gfalk@yelp.com>
This reverts commit 2595b5e.
We'd like to have
pool_owner
metadata on each pool for both informational and alerting purposes.The new attribute on the pool config,
pool_owner
, defaults tocompute_infra
.Signed-off-by: Max Falk gfalk@yelp.com
Description
Please fill out!
Testing Done
Please fill out! Generally speaking any new features should include
additional unit or integration tests to ensure the behaviour is
working correctly.