Make number of scheduler workers reloadable #11593

angrycub · 2021-11-30T23:01:46Z

This PR:

enables the server.num_schedulers and server.enabled_schedulers values to be hot reloadable.
adds an API to allow operators to make temporary changes to these values on a per server basis.
adds an API to get the status of the scheduler workers and their internal workloads.

This closes #11449

DerekStrickland

Beautiful work.

nomad/worker.go

Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>

nomad/server.go

…into f-reload-num-schedulers

command/agent/agent_endpoint_test.go

nomad/worker.go

command/agent/agent_endpoint_test.go

tgross

I've re-reviewed the API and HTTP agent sections and I'm putting that review up for you @angrycub. I'll re-review the worker/leader/server section next.

website/content/api-docs/agent.mdx

api/agent.go

tgross · 2022-01-03T16:33:12Z

api/agent.go

+}
+
+// SetSchedulerWorkerConfig attempts to update the targeted agent's worker pool configuration
+func (a *Agent) SetSchedulerWorkerConfig(args SchedulerWorkerPoolArgs) (*SchedulerWorkerPoolArgs, error) {


I think we need WriteOptions here (and QueryOptions for GetSchedulerConfig above) to support ACLs and any HTTP params we might want in the future. And we always seem to want it eventually so that way we don't have to make a SetSchedulerWorkerConfigWithOptions later on.

We can probably get away without having a QueryMeta returned here because everything in QueryMeta is used for comes out of raft? None of the other agent APIs in this file have it.

tgross · 2022-01-03T16:38:42Z

command/agent/agent_endpoint.go

+	switch req.Method {
+	case "PUT", "POST":
+		return s.UpdateScheduleWorkersConfig(resp, req)
+	case "GET":
+		return s.GetScheduleWorkersConfig(resp, req)


We're probably not consistent about this across the code base, but good to use the constants for new code at least:

Suggested change

switch req.Method {

case "PUT", "POST":

return s.UpdateScheduleWorkersConfig(resp, req)

case "GET":

return s.GetScheduleWorkersConfig(resp, req)

switch req.Method {

case http.MethodPut, http.MethodPost:

return s.UpdateScheduleWorkersConfig(resp, req)

case http.MethodGet:

return s.GetScheduleWorkersConfig(resp, req)

tgross · 2022-01-03T16:42:19Z

command/agent/agent_endpoint.go

+	}
+}
+
+func (s *HTTPServer) GetScheduleWorkersConfig(resp http.ResponseWriter, req *http.Request) (interface{}, error) {


This is the implementation for AgentSchedulerWorkerConfigRequest and not used in other packages, right? Usually we'll want to avoid exporting it (ex. name it getScheduleWorkersConfig). Same applied for the update implementation.

I think that I'd done that to help the OpenAPI generator auto-documentation process, but will make them private is def more correct.

tgross · 2022-01-03T16:44:46Z

command/agent/agent_endpoint.go

+type agentSchedulerWorkerConfig struct {
+	ServerID          string   `json:"server_id,omitempty"`
+	NumSchedulers     int      `json:"num_schedulers"`
+	EnabledSchedulers []string `json:"enabled_schedulers"`
+}
+
+type agentSchedulerWorkersInfo struct {
+	ServerID   string                     `json:"server_id"`
+	Schedulers []agentSchedulerWorkerInfo `json:"schedulers"`
+}
+
+type agentSchedulerWorkerInfo struct {
+	ID                string   `json:"id"`
+	EnabledSchedulers []string `json:"enabled_schedulers"`
+	Started           string   `json:"started"`
+	Status            string   `json:"status"`
+	WorkloadStatus    string   `json:"workload_status"`
+}


Aren't these always going to be 1:1 with api structs? I think you can import the api package here and use it them directly.

tgross · 2022-01-03T16:55:54Z

command/agent/agent_endpoint_test.go

+			t.Run(tc.name, func(t *testing.T) {
+
+				req, err := http.NewRequest(tc.request.verb, "/v1/agent/schedulers/config", bytes.NewReader([]byte(tc.request.requestBody)))
+				require.Nil(t, err)


Nitpick: while require.Nil(t, err) and require.NoError(t, err) test the same thing, we tend to use NoError for clarity.

tgross · 2022-01-03T16:57:45Z

command/agent/agent_endpoint_test.go

@@ -1463,3 +1464,586 @@ func TestHTTP_XSS_Monitor(t *testing.T) {
 		})
 	}
 }
+


I love these giant table-driven tests!

tgross

Ok @angrycub I've looked through the second half of the work and this is looking great. I've left a few remarks about locking in the worker.go but I think other than that there's nothing too serious here.

tgross · 2022-01-03T18:38:20Z

nomad/leader_test.go

+	// this satisfies the require.Eventually test interface
+	checkPaused := func(count int) func() bool {
+		return func() bool {
+			workers := pausedWorkers()


Because this is all closures inside the test function, it's probably ok to inline the body of pausedWorkers here in the checkPaused function.

tgross · 2022-01-03T18:44:10Z

nomad/worker_test.go

+func TestWorker_WorkerInfo_String(t *testing.T) {
+	t.Parallel()
+	startTime := time.Date(2009, time.November, 10, 23, 0, 0, 0, time.UTC)
+	w := &Worker{
+		id:                "uuid",
+		start:             startTime,
+		status:            WorkerStarted,
+		workloadStatus:    WorkloadBackoff,
+		enabledSchedulers: []string{structs.JobTypeCore, structs.JobTypeBatch, structs.JobTypeSystem},
+	}
+	_, err := json.Marshal(w)
+	require.NoError(t, err)
+
+	require.Equal(t, `{"id":"uuid","enabled_schedulers":["_core","batch","system"],"started":"2009-11-10T23:00:00Z","status":"Started","workload_status":"Backoff"}`, fmt.Sprint(w.Info()))
+}


This test feels like it's just testing the stdlib's encoding/json package behavior... we can probably drop this one.

tgross · 2022-01-03T18:46:40Z

nomad/worker.go

+}
+
+// _newWorker creates a worker without calling its Start func. This is useful for testing.
+func newWorker(ctx context.Context, srv *Server, args SchedulerWorkerPoolArgs) (*Worker, error) {


This function doesn't ever return an error (which is pretty typical for newBlahblah functions), so I think we can drop that return value and then clean up all the cases where we're doing w, _ := newWorker(...)

tgross · 2022-01-03T18:56:08Z

nomad/worker.go

+
+// setWorkloadStatus is used internally to the worker to update the
+// status of the worker based updates from the workload.
+func (w *Worker) setWorkloadStatus(newStatus SchedulerWorkerStatus) {


This is a great idea for making the scheduler worker behavior more observable. Having the trace log here is great but I can totally see writing a bpftrace script that hooks this function to read the stack args and catch all the transitions too.

(I might have done this in its own PR, but as long as it's here now we might as well enjoy it.)

tgross · 2022-01-03T19:11:24Z

nomad/worker.go

 		w.pauseCond.Wait()
 	}
+
+	w.pauseLock.Unlock()


I think there's a race here if a Pause method call comes in after this line but before the status is set before we return. Even with the locks the Pause method's status setting calls could be interleaved such that the worker sets its status to WorkerPaused, sets the pause flag, and then its status is set to WorkerStarted.

Maybe we should move this up to the top of the function as a defer w.pauseLock.Unlock()?

tgross · 2022-01-03T19:33:04Z

nomad/worker.go

+	defer func() {
+		w.setWorkloadStatus(WorkloadStopped)
+		w.markStopped()
+	}()
+	w.setStatus(WorkerStarted)
+	w.setWorkloadStatus(WorkloadRunning)


We pair up the status and workload status updates more often than not (see maybeWait above) and here we're not doing it atomically. It's probably safe here but easy to accidentally split up locked function calls later so that it's unsafe. So it might be a good idea to have a combined setStatus(workerStatus, workloadStatus) function that takes care of both and does the nice bit you've done where it only logs on change.

nomad/worker_test.go

tgross · 2022-01-03T19:36:50Z

nomad/worker_test.go

+	w._start(testWorkload)
+	require.Eventually(t, w.IsStarted, longWait, tinyWait, "should have started")
+
+	go func() {
+		time.Sleep(tinyWait)
+		w.Pause()
+	}()
+	require.Eventually(t, w.IsPaused, longWait, tinyWait, "should have paused")
+
+	go func() {
+		time.Sleep(tinyWait)
+		w.Resume()
+	}()
+	require.Eventually(t, w.IsStarted, longWait, tinyWait, "should have restarted from pause")
+
+	go func() {
+		time.Sleep(tinyWait)
+		w.Stop()
+	}()
+	require.Eventually(t, w.IsStopped, longWait, tinyWait, "should have shutdown")
+}


👍 This test shows exactly why I like the API you have around "pausing" vs "paused"

nomad/server.go

tgross · 2022-01-03T19:47:24Z

nomad/server.go

@@ -1430,17 +1440,165 @@ func (s *Server) setupSerf(conf *serf.Config, ch chan serf.Event, path string) (
 	return serf.Create(conf)
 }

+// shouldReloadSchedulers checks the new config to determine if the scheduler worker pool
+// needs to be updated. If so, returns true and a pointer to a populated SchedulerWorkerPoolArgs
+func shouldReloadSchedulers(s *Server, newPoolArgs *SchedulerWorkerPoolArgs) (bool, *SchedulerWorkerPoolArgs) {


In retrospect we probably could have unconditionally drained and replaced the worker pool on reload; the API call is gated by ACLs so the risk of an operator DoS'ing their own scheduler seems low. And then we could come back in later to do this more clever logic in a future PR. But this is pretty nice and enables scale up/down in the future (as per your // TODO remark below).

tgross

I added one last place where we could use setStatuses, but other than that this LGTM! Let's ship it!

tgross · 2022-01-06T13:51:03Z

nomad/worker.go

+		w.setWorkloadStatus(WorkloadStopped)
+		w.markStopped()


I think we missed this one:

Suggested change

w.setWorkloadStatus(WorkloadStopped)

w.markStopped()

w.setStatuses(WorkerStopped , WorkloadStopped)

github-actions · 2022-11-05T02:33:36Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

angrycub added 4 commits November 19, 2021 22:53

Working POC

1908187

Unexport setupNewWorkers; improve comments

0071e55

Added some VSCode codetours

763671a

Update shutdown to use context

339316a

angrycub requested a review from tgross November 30, 2021 23:01

angrycub self-assigned this Nov 30, 2021

angrycub added type/enhancement theme/scheduling labels Nov 30, 2021

DerekStrickland reviewed Dec 1, 2021

View reviewed changes

nomad/worker.go Outdated Show resolved Hide resolved

nomad/worker.go Outdated Show resolved Hide resolved

nomad/worker.go Show resolved Hide resolved

Apply suggestions from code review

1a985b3

Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>

vercel bot deployed to Preview – nomad-storybook-and-ui December 1, 2021 15:07 View deployment

vercel bot temporarily deployed to Preview – nomad December 1, 2021 15:07 Inactive

tgross reviewed Dec 1, 2021

View reviewed changes

nomad/server.go Outdated Show resolved Hide resolved

tgross reviewed Dec 1, 2021

View reviewed changes

nomad/server.go Show resolved Hide resolved

angrycub added 2 commits December 1, 2021 18:16

Implement GET for SchedulerWorker API + tests

22f93b7

Merge branch 'f-reload-num-schedulers' of github.com:hashicorp/nomad …

16f9dd4

…into f-reload-num-schedulers

vercel bot deployed to Preview – nomad-storybook-and-ui December 1, 2021 23:16 View deployment

vercel bot temporarily deployed to Preview – nomad December 1, 2021 23:16 Inactive

angrycub added 2 commits December 3, 2021 17:49

Wired API, refactors, more testing

48428e7

Merge branch 'main' into f-reload-num-schedulers

1258128

vercel bot deployed to Preview – nomad December 6, 2021 15:12 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui December 6, 2021 15:12 View deployment

DerekStrickland reviewed Dec 6, 2021

View reviewed changes

command/agent/agent_endpoint_test.go Outdated Show resolved Hide resolved

angrycub added 2 commits December 6, 2021 18:32

Fix linter complaints

1845577

Updating worker to cache EnabledScheduler list

9c4e5c4

vercel bot deployed to Preview – nomad-storybook-and-ui December 6, 2021 23:42 View deployment

vercel bot temporarily deployed to Preview – nomad December 6, 2021 23:42 Inactive

DerekStrickland reviewed Dec 7, 2021

View reviewed changes

nomad/worker.go Outdated Show resolved Hide resolved

tgross reviewed Dec 7, 2021

View reviewed changes

command/agent/agent_endpoint_test.go Outdated Show resolved Hide resolved

Refactor unsafe... func names to ...Locked

0d8b7ec

vercel bot deployed to Preview – nomad-storybook-and-ui December 21, 2021 23:00 View deployment

vercel bot deployed to Preview – nomad December 21, 2021 23:00 View deployment

Adding API test for bad worker info

0417332

vercel bot deployed to Preview – nomad-storybook-and-ui December 22, 2021 18:44 View deployment

vercel bot temporarily deployed to Preview – nomad December 22, 2021 18:44 Inactive

Add changelog message

420a158

vercel bot temporarily deployed to Preview – nomad December 23, 2021 14:59 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui December 23, 2021 14:59 View deployment

typo in changelog 🤦

fd016de

vercel bot temporarily deployed to Preview – nomad December 23, 2021 16:01 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui December 23, 2021 16:01 View deployment

tgross reviewed Jan 3, 2022

View reviewed changes

angrycub added 3 commits January 4, 2022 15:23

Incorporate API code review feedback

167c6a3

Incorporate api-docs feedback

f4f610b

Updates to worker/leader code from code review

689fa77

vercel bot deployed to Preview – nomad-storybook-and-ui January 4, 2022 20:26 View deployment

vercel bot temporarily deployed to Preview – nomad January 4, 2022 20:26 Inactive

Fix test response type

982c397

vercel bot temporarily deployed to Preview – nomad January 5, 2022 15:26 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui January 5, 2022 15:26 View deployment

angrycub requested a review from tgross January 5, 2022 21:44

tgross approved these changes Jan 6, 2022

View reviewed changes

Set both statuses in markStopped so they are atomic

7581957

vercel bot temporarily deployed to Preview – nomad January 6, 2022 16:04 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui January 6, 2022 16:04 View deployment

angrycub merged commit 6e61606 into main Jan 6, 2022

angrycub deleted the f-reload-num-schedulers branch January 6, 2022 16:56

github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make number of scheduler workers reloadable #11593

Make number of scheduler workers reloadable #11593

angrycub commented Nov 30, 2021 •

edited

Loading

DerekStrickland left a comment

tgross left a comment

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross Jan 3, 2022

angrycub Jan 4, 2022

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross left a comment

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross Jan 3, 2022

tgross left a comment

tgross Jan 6, 2022

github-actions bot commented Nov 5, 2022

@@ @@ -1463,3 +1464,586 @@ func TestHTTP_XSS_Monitor(t *testing.T) { @@
               		})
               	}
               }

	w.setWorkloadStatus(WorkloadStopped)
	w.markStopped()
	w.setStatuses(WorkerStopped , WorkloadStopped)

Make number of scheduler workers reloadable #11593

Make number of scheduler workers reloadable #11593

Conversation

angrycub commented Nov 30, 2021 • edited Loading

DerekStrickland left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 5, 2022

angrycub commented Nov 30, 2021 •

edited

Loading