System scheduler blocked evals #5900

langmartin · 2019-06-27T17:44:57Z

System jobs now support blocking evaluations by node, in the case that a placement can't be made due to resource constraints. We keep a minimal set of blocked evals, indexed by job and by node, so they can be re-evaluated when node resources become available.

Changelog

fixes #5579

notnoop

I made a first pass, mostly by inspecting the code, and have some suggestions. While I follow the (un-)blocking logic here, I'd like to do a deeper review or have @preetapan do a closer examination.

notnoop · 2019-07-01T03:44:24Z

nomad/blocked_evals.go

+
+	// node maps a node id to the set of all blocked evals, mapped to their authentication tokens
+	// node map[string]map[*structs.Evaluation]string
+	node map[string]map[string]*wrappedEval


This seems more mapping the node id and eval id to evaluation and authentication token?

Also, I don't quite follow the field names, because they are in singular but refer to collections and don't indicate map structure.

I'd suggest clarifying the structure fields a bit more with a comment. Something like

systemEvals stores the blocked evaluations for system jobs, indexed by node and job to ease lookups in scheduler.

Then name the fields byJobID and byNodeID?

notnoop · 2019-07-01T03:47:58Z

nomad/blocked_evals.go

+}
+
+// setSystemEval creates the inner map if necessary
+func (b *BlockedEvals) setSystemEval(eval *structs.Evaluation, token string) {


setSystemEval and its siblings here operate purely on b.system and nothing else. I would consider having the functions be on *systemEvalsand use a more conventional namesGet|Add|Remove`.

nomad/blocked_evals.go

notnoop · 2019-07-01T03:59:31Z

nomad/blocked_evals_test.go

-	}, func(err error) {
-		t.Fatalf("err: %s", err)
-	})
+	waitBrokerEnqueued(t, blocked, broker)


Thank you so much for extracting this code!

notnoop · 2019-07-01T04:08:12Z

nomad/blocked_evals_test.go


+func waitBrokerEnqueued(t *testing.T, blocked *BlockedEvals, broker *EvalBroker) {


Thank you so much for extracting this code! It was quite repetitive.

Looking at the function, it seems to assert that the blocked evaluations made it to eval broker. So maybe requireBlockedEvalsEnqueued is more appropriate.

Also, is brokerStats.TotalReady always expected to be 1? Does it it depend on how many evaluations are blocked in the test?

Also, while you are here, I'd suggest updating the error message to be more meaningful than bad:

notnoop · 2019-07-01T04:13:17Z

scheduler/system_sched.go

@@ -390,3 +394,28 @@ func (s *SystemScheduler) computePlacements(place []allocTuple) error {

 	return nil
 }
+
+// addBlocked creates a new blocked eval for this job on this node
+// - Keep blocked evals in the scheduler, but just for grins


Not sure I understand this comment? why do we do it for grins? Is this following a pattern used in non system jobs? I'd suggest avoiding if it's not needed.

notnoop · 2019-07-01T04:19:05Z

scheduler/system_sched.go

@@ -18,6 +18,7 @@ const (

 // SystemScheduler is used for 'system' jobs. This scheduler is
 // designed for services that should be run on every client.
+// One for each job, containing an allocation for each node


This is true for all schedulers I believe. Each scheduler instance is created to schedule a single eval.

notnoop · 2019-07-01T04:20:25Z

scheduler/system_sched.go

+	if s.blocked == nil {
+		s.blocked = map[string]*structs.Evaluation{}
+	}
+	s.blocked[node.ID] = blocked


I had a harder time following time reading this code due to reusing blocked name to refer to the blocked evaluation as well as container of blocked evals.

notnoop

The code lgtm. I have a couple of clarifying questions before marking it as good.

Also, I'd highly encourage to rebase and squash some changes. I personally use git blame a lot and do some repository archeology when debugging issues or trying to understand some code; and having a lot of commits for relatively small number of lines of changed code adds some speed bumps into the process, specially when some changes are renames of variables introduced in the PR. I would suggest either squashing it before merging to have logical commits or use the "Squash & Merge" github feature for merging.

notnoop · 2019-07-16T14:14:15Z

nomad/blocked_evals.go

+		return
+	}
+
+	// QUESTION is it dangerous to delete this first?


I believe this is fine - but @preetapan should know best. If we want to be extra safe, we may need to have EnqueueAll return error on failure.

This should be fine and follows a similar pattern as other places in eval broker where we need to unblock evals and re-enqueue them.

Currently EnqueueAll can silently not enqueue anything if this node lost leadership between before this was called to the point where it needs to enqueue. But if that's an expected case handled in this method called during leadership transitions -

nomad/nomad/leader.go

Line 290 in 9134274

func (s *Server) restoreEvals() error {

notnoop · 2019-07-16T14:19:51Z

nomad/blocked_evals_system.go

+// systemEvals are handled specially, each job may have a blocked eval on each node
+type systemEvals struct {
+	// byJob maps a jobID to a nodeID to that job's single blocked evalID on that node
+	byJob map[structs.NamespacedID]map[string]string


More of a question: Can a system job have multiple TaskGroups - would they always be associated with the same eval id?

the eval is always at the job level but if the system job has multiple task groups the scheduler will create multiple allocations. We always evaluate every task group in the job in the reconciler.

notnoop · 2019-07-16T14:24:34Z

nomad/worker.go

@@ -254,6 +254,7 @@ func (w *Worker) invokeScheduler(snap *state.StateSnapshot, eval *structs.Evalua
 	}

 	// Create the scheduler, or use the special system scheduler
+	// QUESTION: does this mean "special core scheduler"


Yes! this is special core scheduler that runs some administrative tasks (e.g. gc).

notnoop · 2019-07-16T14:26:15Z

nomad/blocked_evals.go

@@ -47,6 +47,11 @@ type BlockedEvals struct {
 	// classes.
 	escaped map[string]wrappedEval

+	// system is the set of system evaluations that failed to start on nodes because of
+	// resource constraints. Retried on any change for those nodes
+	// map[node.ID][token]eval


Is this type declaration needed? If so, maybe elaborate how one should read it?

Suggested change

// map[node.ID][token]eval

preetapan · 2019-07-16T15:58:04Z

nomad/blocked_evals.go

+
+	b.evalBroker.EnqueueAll(evals)
+}
+


You'll need to clear out the system evals from in memory state in the flush method. That gets called when the eval broker is disabled (when the node loses leadership).

Also add a test to eval_broker_test.go where some blocked evals are in the state map and when SetDisabled is called it should no longer be in memory.

preetapan · 2019-07-16T16:34:43Z

nomad/blocked_evals.go

+
+	b.evalBroker.EnqueueAll(evals)
+}
+
 // watchCapacity is a long lived function that watches for capacity changes in
 // nodes and unblocks the correct set of evals.
 func (b *BlockedEvals) watchCapacity(stopCh <-chan struct{}, changeCh <-chan *capacityUpdate) {


We need a way to persist these such that a leadership transition causes the system evals to live on the new leader node.

the blocked evals are created through the fsm, so when a new leader is elected, they will be locally replayed into memory via leader.go nomad.restoreEvals

preetapan · 2019-07-18T14:24:57Z

@langmartin there;s one outstanding suggestion #5900 (comment)

Otherwise LGTM

shantanugadgil · 2019-07-19T15:59:22Z

Will this solve #4267 ?

langmartin · 2019-07-19T16:37:31Z

Will this solve #4267 ?

I'll have to investigate that, it may.

github-actions · 2023-02-06T02:15:14Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop reviewed Jul 1, 2019

View reviewed changes

langmartin requested a review from notnoop July 10, 2019 22:14

preetapan added the 0.9.4 label Jul 15, 2019

notnoop reviewed Jul 16, 2019

View reviewed changes

preetapan reviewed Jul 16, 2019

View reviewed changes

langmartin force-pushed the f-system-sched-blocked-evals branch from 7e9de21 to 645be7e Compare July 17, 2019 19:25

preetapan approved these changes Jul 18, 2019

View reviewed changes

langmartin added 7 commits July 18, 2019 10:32

blocked_evals system evals indexed by job and node

4b93a08

system_sched submits failed evals as blocked

2d8bfb8

fsm attach UnblockNode on node updates

7e18bd6

blocked_evals_test Test_UnblockNode

e6820cd

blocked_evals reset system evals on Flush

5ee4ffb

worker comment system -> core

cdaec89

blocked_evals_test disable calls Flush

e0edd11

langmartin force-pushed the f-system-sched-blocked-evals branch from 0d8437f to e0edd11 Compare July 18, 2019 14:32

langmartin merged commit 5f44755 into master Jul 18, 2019

langmartin deleted the f-system-sched-blocked-evals branch July 18, 2019 20:13

preetapan mentioned this pull request Oct 11, 2019

Filtered evals should also be treated as blocked in system scheduler #6482

Open

notnoop mentioned this pull request Jul 9, 2020

System jobs are not rescheduled when resources become available #4072

Closed

notnoop mentioned this pull request Aug 10, 2021

system: re-evaluate node on feasability changes #11007

Merged

github-actions bot locked as resolved and limited conversation to collaborators Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System scheduler blocked evals #5900

System scheduler blocked evals #5900

langmartin commented Jun 27, 2019 •

edited

Loading

notnoop left a comment

notnoop Jul 1, 2019

notnoop Jul 1, 2019

notnoop Jul 1, 2019

notnoop Jul 1, 2019

notnoop Jul 1, 2019

notnoop Jul 1, 2019

notnoop Jul 1, 2019

notnoop left a comment

notnoop Jul 16, 2019

preetapan Jul 16, 2019

notnoop Jul 16, 2019

preetapan Jul 16, 2019

notnoop Jul 16, 2019

notnoop Jul 16, 2019

preetapan Jul 16, 2019

preetapan Jul 16, 2019

preetapan Jul 16, 2019

langmartin Jul 17, 2019

preetapan commented Jul 18, 2019

shantanugadgil commented Jul 19, 2019

langmartin commented Jul 19, 2019

github-actions bot commented Feb 6, 2023


		func waitBrokerEnqueued(t testing.T, blocked BlockedEvals, broker *EvalBroker) {

System scheduler blocked evals #5900

System scheduler blocked evals #5900

Conversation

langmartin commented Jun 27, 2019 • edited Loading

notnoop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

preetapan commented Jul 18, 2019

shantanugadgil commented Jul 19, 2019

langmartin commented Jul 19, 2019

github-actions bot commented Feb 6, 2023

langmartin commented Jun 27, 2019 •

edited

Loading