node pools: apply node pool scheduler configuration #17598

lgfa29 · 2023-06-19T20:44:47Z

Merge global scheduler configuration with the job's node pool scheduler configuration when processing nodes.

tgross

This looks really good @lgfa29! Just the problem with the previous-job version move to fix I think

helper/iterator/iterator.go

tgross · 2023-06-20T13:46:16Z

scheduler/generic_sched.go

+		// Fetch node pool and global scheduler configuration to determine how
+		// to configure the scheduler.
+		pool, err := s.state.NodePoolByName(ws, s.job.NodePool)
+		if err != nil {
+			return false, fmt.Errorf("failed to get job node pool '%s': %v", s.job.NodePool, err)
+		}
+		_, schedConfig, err := s.state.SchedulerConfig()
+		if err != nil {
+			return false, fmt.Errorf("failed to get scheduler configuration: %v", err)
+		}
+		s.stack.SetSchedulerConfiguration(schedConfig.WithNodePool(pool))


There are two other places in the GenericScheduler where we call s.stack.SetJob with the previous version of the job as part of computing placements, and then restore it later. This will miss all those cases which will break if the job is moved between node pools.

Because the scheduler config is now effectively tied to to the node pool and therefore a specific version of the job, we should probably move the call to SetSchedulerConfiguration inside of SetJob so that they're always in sync.

I was trying to write a test that would cover this scenario but the other calls to SetJob were never triggered, so I looked for tests that did and the only two I found were TestServiceSched_Migrate_CanaryStatus and TestSpreadPanicDowngrade.

Copying the structure of TestSpreadPanicDowngrade I was able to trigger the downgradedJob ! nil path but I didn't see any difference between updating the scheduler configuration or not.

de31ec2 has the change to always update the scheduler config along with the job.

Uploaded the test function here just in case: https://gist.github.com/lgfa29/76aa40859e8cc808a81e6bc32f3f61fa

but I didn't see any difference between updating the scheduler configuration or not.

Oh! It looks like that code path calls the selectNextOption method, which is querying the state store for the scheduler config again (ref generic_sched.go#L814). So it's completely ignoring that we've set the scheduler config in the stack earlier. We should make sure we're not querying the state directly anywhere else too.

It made no difference because I was always setting the same job 🤦
de31ec2#diff-7da37ffb89a2df3f8fea4a048ca3bb21b799df290033c26766743e60bc217a48R708

I now have a difference in the node scores, which makes sense, but I still don't fully understand what's going on, and I think we may have a bug if node pools or datacenter changes?

Before the fix, the resulting plan looked like this:
https://gist.github.com/lgfa29/36ef5ba7ac4c7e2c798bfe29f047ec97#file-before_fix-txt

It's very hard to understand what's going on but the main part is that there are two NodeAllocation, one for the first job version in the node pool that uses binpacking and one for the updated job in the node pool with spread.

The scores for the alloc in the node in the binpacking pool look like this:

ScoreMetaData: ([]*structs.NodeScoreMeta) (len=3 cap=3) { (*structs.NodeScoreMeta)(0x1400029fd40)(2a5436c6-6311-5ff3-c454-a6ca027bbd95 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]), (*structs.NodeScoreMeta)(0x1400029fcc0)(3df6cdd1-39fb-7854-513b-5251e83ebdf4 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]), (*structs.NodeScoreMeta)(0x1400029fdc0)(e67e2a90-9c39-75b9-3ea5-b83268ad3fc9 0.137793 map[binpack:0.6755864441534736 job-anti-affinity:-0.4 node-affinity:0 node-reschedule-penalty:0]) },

Which I think it's the main bug because these values are quite high for essentially empty nodes, which indicates the spread algorithm was used to score them.

And we can see that by comparing them with the scores for the alloc in the node in the spread pool:

ScoreMetaData: ([]*structs.NodeScoreMeta) (len=3 cap=3) { (*structs.NodeScoreMeta)(0x1400029fb80)(f02fca69-2de8-f319-e1a4-315c27bbd982 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]), (*structs.NodeScoreMeta)(0x1400029fc00)(08f7edaa-7a78-e0f8-96d1-7295b0daf003 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]), (*structs.NodeScoreMeta)(0x1400029fb00)(e67e2a90-9c39-75b9-3ea5-b83268ad3fc9 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]) },

After code change things look better. The alloc in the binpack node looks like this:

ScoreMetaData: ([]*structs.NodeScoreMeta) (len=4 cap=4) { (*structs.NodeScoreMeta)(0x140002cd760)(07567ea6-6c0a-1cb5-4984-26250bb27bca 0.181781 map[binpack:0.18178064119742032 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]), (*structs.NodeScoreMeta)(0x140002cd6e0)(4b06224f-ac54-c922-3245-0d732555b12f 0.181781 map[binpack:0.18178064119742032 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]), (*structs.NodeScoreMeta)(0x140002cd840)(7f96eb7a-cb60-5040-b722-7b6cd7193898 0.181781 map[binpack:0.18178064119742032 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]), (*structs.NodeScoreMeta)(0x140002cd7e0)(c18cd236-b02a-7118-d800-6193b7ae839a -0.037793 map[binpack:0.3244135558465265 job-anti-affinity:-0.4 node-affinity:0 node-reschedule-penalty:0]) },

And the alloc in the spread node looks like this:

ScoreMetaData: ([]*structs.NodeScoreMeta) (len=3 cap=3) { (*structs.NodeScoreMeta)(0x140002cd5a0)(7f96eb7a-cb60-5040-b722-7b6cd7193898 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]), (*structs.NodeScoreMeta)(0x140002cd620)(ebb9fade-cd87-cbbb-0229-e268f658789d 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]), (*structs.NodeScoreMeta)(0x140002cd520)(c18cd236-b02a-7118-d800-6193b7ae839a 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]) },

Now, why are there four nodes in the binpack alloc? I'm still not sure, but while looking into it I realized that both sets of metrics had shared node IDs, which makes no sense since each alloc is for a job in a different node pool!

I think the problem is that we only call SetNodes() once at the beginning of the function, but we need to get the nodes again for the downgradedJob if the node pool or datacenter changed.

Reading #8691 and c49359a helped me understand what this part of the code is covering.

I rebased this branch to squash the commits into a single commit with the (I hope 😅) proper fix: b99e8c5.

I think the problem with nodes not being reset also affects current versions of Nomad when datacenters change. Since this code is a bit different in main because of node pools I will create fix for the release branches manually.

Nice detective work there. Sounds like a plan!

I got an actual reproduction and described it in #17651

Manual backports created as well:
#17652
#17653
#17654

shoenig

LGTM!

helper/iterator/iterator.go

scheduler/generic_sched.go

scheduler/scheduler_system.go

scheduler/scheduler_test.go

scheduler/stack.go

When a non-canary allocation fails during a canary deployment it must be replaced according to the original job specification, which includes resources, constraints, node pool, datacenters etc. Prior to this change, only job was being reset in the stack, which could lead to the replacement being placed in nodes in the wrong node pool or datacenter and nodes being scored with the wrong scheduler configuration.

tgross

LGTM!

scheduler/scheduler_system.go

scheduler/stack.go

…ystem scheduler

…ation

lgfa29 added 3 commits June 19, 2023 16:16

np: check pool memory oversubscription on job registration

21e49da

np: apply node pool scheduler config

b32258b

docs: add node pool memory oversubscription notes

2908d54

lgfa29 added the theme/node-pools Issues related to node pools label Jun 19, 2023

lgfa29 added this to the 1.6.0 milestone Jun 19, 2023

lgfa29 requested review from shoenig and tgross June 19, 2023 20:44

vercel bot deployed to Preview – nomad June 19, 2023 20:46 View deployment

tgross requested changes Jun 20, 2023

View reviewed changes

shoenig approved these changes Jun 20, 2023

View reviewed changes

helper/iterator/iterator.go Show resolved Hide resolved

scheduler/generic_sched.go Outdated Show resolved Hide resolved

scheduler/scheduler_system.go Outdated Show resolved Hide resolved

scheduler/scheduler_test.go Outdated Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui June 20, 2023 18:36 View deployment

tgross reviewed Jun 20, 2023

View reviewed changes

scheduler/stack.go Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui June 20, 2023 20:32 View deployment

lgfa29 added 2 commits June 20, 2023 19:28

apply code review

b62d43b

lgfa29 force-pushed the f-node-pools-scheduler-config branch from 70df8c6 to b62d43b Compare June 20, 2023 23:29

vercel bot deployed to Preview – nomad-storybook-and-ui June 20, 2023 23:31 View deployment

tgross approved these changes Jun 21, 2023

View reviewed changes

scheduler/scheduler_system.go Outdated Show resolved Hide resolved

scheduler/stack.go Show resolved Hide resolved

core: follow the same setNode pattern from generic scheduler in the s…

7422636

…ystem scheduler

vercel bot deployed to Preview – nomad-storybook-and-ui June 21, 2023 23:46 View deployment

np: add note about the scheduler and per node pool scheduler configur…

6b07aed

…ation

vercel bot deployed to Preview – nomad-storybook-and-ui June 22, 2023 00:01 View deployment

This was referenced Jun 22, 2023

core: set proper nodes on downgrade #17652

Merged

core: set proper nodes on downgrade #17653

Merged

core: set proper nodes on downgrade #17654

Merged

lgfa29 merged commit f4c7182 into main Jun 22, 2023
21 checks passed

lgfa29 deleted the f-node-pools-scheduler-config branch June 22, 2023 00:31

lgfa29 mentioned this pull request Jun 22, 2023

core: remove unnecessary call to SetNodes and adds DC downgrade test #17655

Merged

tgross mentioned this pull request Jun 23, 2023

Replacement allocations during canary deployments are placed in the wrong datacenter #17651

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node pools: apply node pool scheduler configuration #17598

node pools: apply node pool scheduler configuration #17598

lgfa29 commented Jun 19, 2023

tgross left a comment

tgross Jun 20, 2023

lgfa29 Jun 20, 2023

lgfa29 Jun 20, 2023

tgross Jun 20, 2023

lgfa29 Jun 20, 2023

lgfa29 Jun 20, 2023

tgross Jun 21, 2023

lgfa29 Jun 21, 2023

lgfa29 Jun 22, 2023

shoenig left a comment

tgross left a comment

node pools: apply node pool scheduler configuration #17598

node pools: apply node pool scheduler configuration #17598

Conversation

lgfa29 commented Jun 19, 2023

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoenig left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment