Eval Broker: Prevent redundant enqueue's when a node is not a leader #5699

endocrimes · 2019-05-14T12:12:04Z

Currently when an evalbroker is disabled, it still receives delayed enqueues via log application in the fsm. This causes an ever growing heap of evaluations that will never be drained, and can cause memory issues in busier clusters, or when left running for an extended period of time without a leader election.
This PR prevents the enqueuing of evaluations while we are disabled, and relies on the leader restoreEvals routine to handle reconciling state during a leadership transition.

Existing dequeues during an Enabled->Disabled broker state transition are handled by the enqueueLocked function dropping evals.

Primarily a cleanup commit, however, currently there is a potential race condition (that I'm not sure we've ever actually hit) during a flapping SetEnabled/Disabled state where we may never correctly restart the eval broker, if it was being called from multiple routines.

Currently when an evalbroker is disabled, it still recieves delayed enqueues via log application in the fsm. This causes an ever growing heap of evaluations that will never be drained, and can cause memory issues in larger clusters, or when left running for an extended period of time without a leader election. This commit prevents the enqueuing of evaluations while we are disabled, and relies on the leader restoreEvals routine to handle reconciling state during a leadership transition. Existing dequeues during an Enabled->Disabled broker state transition are handled by the enqueueLocked function dropping evals.

preetapan · 2019-05-14T15:35:41Z

nomad/eval_broker.go

 		return nil, time.Time{}
 	}
 	nextEval := b.delayHeap.Peek()
-	b.l.RUnlock()


The lock was originally unlocked here rather than using defer to avoid holding on to it after peeking into the heap. I am not certain this was the root cause of the non leader enqueues. Was this more of a clarity fix?

Entirely a clarity fix - I can revert, I thought pulling out the eval would be fast enough that it’s not a big deal.

Its fine to leave it as is, the new lines of execution included in the lock's scope are not that expensive.

preetapan · 2019-05-14T15:39:51Z

nomad/eval_broker.go

@@ -778,13 +785,13 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context, updateCh <-chan
 // This peeks at the heap to return the top. If the heap is empty, this returns nil and zero time.
 func (b *EvalBroker) nextDelayedEval() (*structs.Evaluation, time.Time) {
 	b.l.RLock()
+	defer b.l.RUnlock()
+


Could we add a unit test in eval_broker_test.go. Suggestion - could create two eval brokers with one of them enabled, and then switch to disabling while enabling the other one in another goroutine. It should verify that the flush method drained everything on the previously enabled eval broker

It would be pretty hard to actually validate this on CI because we set GOMAXPROCS to one.

We’d only need a single broker in a test though bc they don’t interact with each other?

preetapan

LGTM

notnoop · 2019-05-16T20:53:56Z

Code looks good but a question

This PR prevents the enqueuing of evaluations while we are disabled, and relies on the leader restoreEvals routine to handle reconciling state during a leadership transition.

How tested is path with restoreEvals and leader transition? Do we need to do follow up manual testing by any chance?

endocrimes · 2019-05-17T10:45:22Z

@notnoop I did a fair amount of manual testing - nomad/leader_test.go has less testing that I'd like though.

github-actions · 2023-02-10T02:17:54Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

endocrimes added 3 commits May 14, 2019 13:26

evalbroker: Simplify nextDelayedEval locking

c67fb62

endocrimes marked this pull request as ready for review May 14, 2019 15:22

endocrimes requested a review from preetapan May 14, 2019 15:22

preetapan reviewed May 14, 2019

View reviewed changes

schmichael approved these changes May 14, 2019

View reviewed changes

evalbroker: test for no enqueue on disabled

68c1454

preetapan approved these changes May 15, 2019

View reviewed changes

endocrimes merged commit 781c94b into master May 15, 2019

endocrimes deleted the dani/b-eval-broker-lifetime branch May 15, 2019 22:31

github-actions bot locked as resolved and limited conversation to collaborators Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval Broker: Prevent redundant enqueue's when a node is not a leader #5699

Eval Broker: Prevent redundant enqueue's when a node is not a leader #5699

endocrimes commented May 14, 2019

preetapan May 14, 2019

endocrimes May 14, 2019

preetapan May 14, 2019

preetapan May 14, 2019

endocrimes May 14, 2019

preetapan left a comment

notnoop commented May 16, 2019

endocrimes commented May 17, 2019

github-actions bot commented Feb 10, 2023

Eval Broker: Prevent redundant enqueue's when a node is not a leader #5699

Eval Broker: Prevent redundant enqueue's when a node is not a leader #5699

Conversation

endocrimes commented May 14, 2019

preetapan May 14, 2019

Choose a reason for hiding this comment

endocrimes May 14, 2019

Choose a reason for hiding this comment

preetapan May 14, 2019

Choose a reason for hiding this comment

preetapan May 14, 2019

Choose a reason for hiding this comment

endocrimes May 14, 2019

Choose a reason for hiding this comment

preetapan left a comment

Choose a reason for hiding this comment

notnoop commented May 16, 2019

endocrimes commented May 17, 2019

github-actions bot commented Feb 10, 2023