-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track plan rejection history and automatically mark clients as ineligible #13421
Changes from 4 commits
febcf30
e343ca6
5beb6a1
2db5edc
ff15de9
67f42b8
98ad0d7
620b413
27e767d
ff8f670
bd833f9
fb2e761
f5936f0
b243d7d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
```release-note:improvement | ||
core: automatically mark clients with recurring plan rejections as ineligible | ||
``` | ||
|
||
```release-note:improvement | ||
metrics: emit `nomad.nomad.plan.rejection_tracker.node_score` metric for the number of times a node had a plan rejection within the past time window | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -503,6 +503,10 @@ type ServerConfig struct { | |
// This value is ignored. | ||
DefaultSchedulerConfig *structs.SchedulerConfiguration `hcl:"default_scheduler_config"` | ||
|
||
// PlanRejectionTracker configures the node plan rejection tracker that | ||
// detects potentially bad nodes. | ||
PlanRejectionTracker *PlanRejectionTracker `hcl:"plan_rejection_tracker"` | ||
|
||
// EnableEventBroker configures whether this server's state store | ||
// will generate events for its event stream. | ||
EnableEventBroker *bool `hcl:"enable_event_broker"` | ||
|
@@ -548,6 +552,46 @@ type RaftBoltConfig struct { | |
NoFreelistSync bool `hcl:"no_freelist_sync"` | ||
} | ||
|
||
// PlanRejectionTracker is used in servers to configure the plan rejection | ||
// tracker. | ||
type PlanRejectionTracker struct { | ||
// NodeThreshold is the number of times a node can have plan rejections | ||
// before it is marked as ineligible. | ||
NodeThreshold int `hcl:"node_threshold"` | ||
|
||
// NodeWindow is the time window used to track active plan rejections for | ||
// nodes. | ||
NodeWindow time.Duration | ||
NodeWindowHCL string `hcl:"node_window" json:"-"` | ||
|
||
// ExtraKeysHCL is used by hcl to surface unexpected keys | ||
ExtraKeysHCL []string `hcl:",unusedKeys" json:"-"` | ||
} | ||
|
||
func (p *PlanRejectionTracker) Merge(b *PlanRejectionTracker) *PlanRejectionTracker { | ||
if p == nil { | ||
return b | ||
} | ||
|
||
result := *p | ||
|
||
if b == nil { | ||
return &result | ||
} | ||
|
||
if b.NodeThreshold != 0 { | ||
result.NodeThreshold = b.NodeThreshold | ||
} | ||
|
||
if b.NodeWindow != 0 { | ||
result.NodeWindow = b.NodeWindow | ||
} | ||
if b.NodeWindowHCL != "" { | ||
result.NodeWindowHCL = b.NodeWindowHCL | ||
} | ||
return &result | ||
} | ||
|
||
// Search is used in servers to configure search API options. | ||
type Search struct { | ||
// FuzzyEnabled toggles whether the FuzzySearch API is enabled. If not | ||
|
@@ -985,6 +1029,10 @@ func DefaultConfig() *Config { | |
EventBufferSize: helper.IntToPtr(100), | ||
RaftProtocol: 3, | ||
StartJoin: []string{}, | ||
PlanRejectionTracker: &PlanRejectionTracker{ | ||
NodeThreshold: 15, | ||
NodeWindow: 10 * time.Minute, | ||
}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also not sure about these defaults. The workers will retry the plan with exponential back-off, so depending on the cluster load and eval queue size it may take a while for the node to reach the threshold. So I picked these values based on a threshold number that is likely to indicate a problem (it could maybe even be a little higher?), not just normal plan rejections, and a time window that encompasses a significant part of the rejection history but that will probably have expired by the time operators detected, drained, and restarted the client. A short time window may risk the threshold never being hit and a large one risks a normal plan rejection triggering the ineligibility since the node score can still be high. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. Let's try to dig up some real world If we moved this into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From our out of band discussion: let's leave this in the config file. If you're fiddling with it live, you can just mark bad nodes ineligible with existing commands and fiddle with these settings after your cluster is healthy. Hopefully we can find good enough defaults (not to mention root cause fixes!) that no one needs to worry about this anyway. |
||
ServerJoin: &ServerJoin{ | ||
RetryJoin: []string{}, | ||
RetryInterval: 30 * time.Second, | ||
|
@@ -1586,6 +1634,10 @@ func (s *ServerConfig) Merge(b *ServerConfig) *ServerConfig { | |
result.EventBufferSize = b.EventBufferSize | ||
} | ||
|
||
if b.PlanRejectionTracker != nil { | ||
result.PlanRejectionTracker = result.PlanRejectionTracker.Merge(b.PlanRejectionTracker) | ||
} | ||
|
||
if b.DefaultSchedulerConfig != nil { | ||
c := *b.DefaultSchedulerConfig | ||
result.DefaultSchedulerConfig = &c | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about these names, but I wanted to hedge against us potentially needing to track plan rejections for other things, like evals or jobs. Feel free to modify them.