Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node pools: apply node pool scheduler configuration #17598

Merged
merged 7 commits into from
Jun 22, 2023

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Jun 19, 2023

Merge global scheduler configuration with the job's node pool scheduler configuration when processing nodes.

@lgfa29 lgfa29 added the theme/node-pools Issues related to node pools label Jun 19, 2023
@lgfa29 lgfa29 added this to the 1.6.0 milestone Jun 19, 2023
@lgfa29 lgfa29 requested review from shoenig and tgross June 19, 2023 20:44
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good @lgfa29! Just the problem with the previous-job version move to fix I think

helper/iterator/iterator.go Show resolved Hide resolved
Comment on lines 286 to 296
// Fetch node pool and global scheduler configuration to determine how
// to configure the scheduler.
pool, err := s.state.NodePoolByName(ws, s.job.NodePool)
if err != nil {
return false, fmt.Errorf("failed to get job node pool '%s': %v", s.job.NodePool, err)
}
_, schedConfig, err := s.state.SchedulerConfig()
if err != nil {
return false, fmt.Errorf("failed to get scheduler configuration: %v", err)
}
s.stack.SetSchedulerConfiguration(schedConfig.WithNodePool(pool))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two other places in the GenericScheduler where we call s.stack.SetJob with the previous version of the job as part of computing placements, and then restore it later. This will miss all those cases which will break if the job is moved between node pools.

Because the scheduler config is now effectively tied to to the node pool and therefore a specific version of the job, we should probably move the call to SetSchedulerConfiguration inside of SetJob so that they're always in sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to write a test that would cover this scenario but the other calls to SetJob were never triggered, so I looked for tests that did and the only two I found were TestServiceSched_Migrate_CanaryStatus and TestSpreadPanicDowngrade.

Copying the structure of TestSpreadPanicDowngrade I was able to trigger the downgradedJob ! nil path but I didn't see any difference between updating the scheduler configuration or not.

de31ec2 has the change to always update the scheduler config along with the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploaded the test function here just in case: https://gist.github.com/lgfa29/76aa40859e8cc808a81e6bc32f3f61fa

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I didn't see any difference between updating the scheduler configuration or not.

Oh! It looks like that code path calls the selectNextOption method, which is querying the state store for the scheduler config again (ref generic_sched.go#L814). So it's completely ignoring that we've set the scheduler config in the stack earlier. We should make sure we're not querying the state directly anywhere else too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It made no difference because I was always setting the same job 🤦
de31ec2#diff-7da37ffb89a2df3f8fea4a048ca3bb21b799df290033c26766743e60bc217a48R708

I now have a difference in the node scores, which makes sense, but I still don't fully understand what's going on, and I think we may have a bug if node pools or datacenter changes?

Before the fix, the resulting plan looked like this:
https://gist.github.com/lgfa29/36ef5ba7ac4c7e2c798bfe29f047ec97#file-before_fix-txt

It's very hard to understand what's going on but the main part is that there are two NodeAllocation, one for the first job version in the node pool that uses binpacking and one for the updated job in the node pool with spread.

The scores for the alloc in the node in the binpacking pool look like this:

     ScoreMetaData: ([]*structs.NodeScoreMeta) (len=3 cap=3) {
      (*structs.NodeScoreMeta)(0x1400029fd40)(2a5436c6-6311-5ff3-c454-a6ca027bbd95 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]),
      (*structs.NodeScoreMeta)(0x1400029fcc0)(3df6cdd1-39fb-7854-513b-5251e83ebdf4 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]),
      (*structs.NodeScoreMeta)(0x1400029fdc0)(e67e2a90-9c39-75b9-3ea5-b83268ad3fc9 0.137793 map[binpack:0.6755864441534736 job-anti-affinity:-0.4 node-affinity:0 node-reschedule-penalty:0])
     },

Which I think it's the main bug because these values are quite high for essentially empty nodes, which indicates the spread algorithm was used to score them.

And we can see that by comparing them with the scores for the alloc in the node in the spread pool:

     ScoreMetaData: ([]*structs.NodeScoreMeta) (len=3 cap=3) {
      (*structs.NodeScoreMeta)(0x1400029fb80)(f02fca69-2de8-f319-e1a4-315c27bbd982 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]),
      (*structs.NodeScoreMeta)(0x1400029fc00)(08f7edaa-7a78-e0f8-96d1-7295b0daf003 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]),
      (*structs.NodeScoreMeta)(0x1400029fb00)(e67e2a90-9c39-75b9-3ea5-b83268ad3fc9 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0])
     },

After code change things look better. The alloc in the binpack node looks like this:

     ScoreMetaData: ([]*structs.NodeScoreMeta) (len=4 cap=4) {
      (*structs.NodeScoreMeta)(0x140002cd760)(07567ea6-6c0a-1cb5-4984-26250bb27bca 0.181781 map[binpack:0.18178064119742032 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]),
      (*structs.NodeScoreMeta)(0x140002cd6e0)(4b06224f-ac54-c922-3245-0d732555b12f 0.181781 map[binpack:0.18178064119742032 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]),
      (*structs.NodeScoreMeta)(0x140002cd840)(7f96eb7a-cb60-5040-b722-7b6cd7193898 0.181781 map[binpack:0.18178064119742032 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]),
      (*structs.NodeScoreMeta)(0x140002cd7e0)(c18cd236-b02a-7118-d800-6193b7ae839a -0.037793 map[binpack:0.3244135558465265 job-anti-affinity:-0.4 node-affinity:0 node-reschedule-penalty:0])
     },

And the alloc in the spread node looks like this:

     ScoreMetaData: ([]*structs.NodeScoreMeta) (len=3 cap=3) {
      (*structs.NodeScoreMeta)(0x140002cd5a0)(7f96eb7a-cb60-5040-b722-7b6cd7193898 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]),
      (*structs.NodeScoreMeta)(0x140002cd620)(ebb9fade-cd87-cbbb-0229-e268f658789d 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]),
      (*structs.NodeScoreMeta)(0x140002cd520)(c18cd236-b02a-7118-d800-6193b7ae839a 0.818219 map[binpack:0.8182193588025797 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0])
     },

Now, why are there four nodes in the binpack alloc? I'm still not sure, but while looking into it I realized that both sets of metrics had shared node IDs, which makes no sense since each alloc is for a job in a different node pool!

I think the problem is that we only call SetNodes() once at the beginning of the function, but we need to get the nodes again for the downgradedJob if the node pool or datacenter changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading #8691 and c49359a helped me understand what this part of the code is covering.

I rebased this branch to squash the commits into a single commit with the (I hope 😅) proper fix: b99e8c5.

I think the problem with nodes not being reset also affects current versions of Nomad when datacenters change. Since this code is a bit different in main because of node pools I will create fix for the release branches manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice detective work there. Sounds like a plan!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got an actual reproduction and described it in #17651

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual backports created as well:
#17652
#17653
#17654

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

helper/iterator/iterator.go Show resolved Hide resolved
scheduler/generic_sched.go Outdated Show resolved Hide resolved
scheduler/scheduler_system.go Outdated Show resolved Hide resolved
scheduler/scheduler_test.go Outdated Show resolved Hide resolved
When a non-canary allocation fails during a canary deployment it must be
replaced according to the original job specification, which includes
resources, constraints, node pool, datacenters etc.

Prior to this change, only job was being reset in the stack, which could
lead to the replacement being placed in nodes in the wrong node pool or
datacenter and nodes being scored with the wrong scheduler configuration.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! :shipit:

scheduler/scheduler_system.go Outdated Show resolved Hide resolved
scheduler/stack.go Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/node-pools Issues related to node pools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants