scheduler: stop allocs in unrelated nodes #11391

notnoop · 2021-10-26T13:22:56Z

The system scheduler should leave allocs on draining nodes as-is, but
stop node stop allocs on nodes that are no longer part of the job
datacenters.

Previously, the scheduler did not make the distinction and left system
job allocs intact if they are already running.

I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 .

Fixes #11373

he system scheduler should leave allocs on draining nodes as-is, but stop node stop allocs on nodes that are no longer part of the job datacenters. Previously, the scheduler did not make the distinction and left system job allocs intact if they are already running.

notnoop · 2021-10-26T13:27:40Z

scheduler/util.go

-		if _, ok := eligibleNodes[nodeID]; !ok {
+		if _, ok := notReadyNodes[nodeID]; ok {
 			goto IGNORE


This is the source of the bug. Previously, if the node is not in the list of eligibleNodes - we ignore it. The assumption is the node is draining or was marked ineligible for scheduling. Now, we explicitly check if the node is not ready but in the DCes that the job targets.

notnoop · 2021-10-26T13:33:56Z

scheduler/util.go

@@ -65,7 +65,8 @@ func diffSystemAllocsForNode(
 	job *structs.Job, // job whose allocs are going to be diff-ed
 	nodeID string,
 	eligibleNodes map[string]*structs.Node,
-	taintedNodes map[string]*structs.Node, // nodes which are down or in drain (by node name)
+	notReadyNodes map[string]struct{}, // nodes that are not ready, e.g. draining
+	taintedNodes map[string]*structs.Node, // nodes which are down (by node name)


taintedNodes logic is actually confusing IMO. taintedNodes function filters the nodes with Down status as well or marked for draining in ShouldDrainNode

nomad/scheduler/util.go

Lines 351 to 377 in b0ce684

// taintedNodes is used to scan the allocations and then check if the

// underlying nodes are tainted, and should force a migration of the allocation.

// All the nodes returned in the map are tainted.

func taintedNodes(state State, allocs []*structs.Allocation) (map[string]*structs.Node, error) {

out := make(map[string]*structs.Node)

for _, alloc := range allocs {

if _, ok := out[alloc.NodeID]; ok {

continue

}

ws := memdb.NewWatchSet()

node, err := state.NodeByID(ws, alloc.NodeID)

if err != nil {

return nil, err

}

// If the node does not exist, we should migrate

if node == nil {

out[alloc.NodeID] = nil

continue

}

if structs.ShouldDrainNode(node.Status) || node.DrainStrategy != nil {

out[alloc.NodeID] = node

}

}

return out, nil

}

.

However, nodes that are up but marked for draining were already filtered out by readyNodesForDCs

nomad/scheduler/util.go

Lines 277 to 313 in b0ce684

// readyNodesInDCs returns all the ready nodes in the given datacenters and a

// mapping of each data center to the count of ready nodes.

func readyNodesInDCs(state State, dcs []string) ([]*structs.Node, map[string]struct{}, map[string]int, error) {

// Index the DCs

dcMap := make(map[string]int, len(dcs))

for _, dc := range dcs {

dcMap[dc] = 0

}

// Scan the nodes

ws := memdb.NewWatchSet()

var out []*structs.Node

notReady := map[string]struct{}{}

iter, err := state.Nodes(ws)

if err != nil {

return nil, nil, nil, err

}

for {

raw := iter.Next()

if raw == nil {

break

}

// Filter on datacenter and status

node := raw.(*structs.Node)

if !node.Ready() {

notReady[node.ID] = struct{}{}

continue

}

if _, ok := dcMap[node.Datacenter]; !ok {

continue

}

out = append(out, node)

dcMap[node.Datacenter]++

}

return out, notReady, dcMap, nil

}

.

So taintedNodes is only the down nodes. Reasoning through the code is a bit more complex and didn't feel confident restructuring that logic to be more explicit about node state.

Just to add a small nitpick on this, the comment says (by node name) but the code indexes them by ID.

It's interesting that structs.TerminalByNodeByName struct is actually grouped by node ID too. I'll update the comment, but rename the struct in a follow up PR.

lgfa29

LGTM

Just missing a changelog entry.

We use node ids rather than node names when hashing or grouping.

github-actions · 2022-11-11T02:34:59Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mahmood Ali added 3 commits October 26, 2021 08:14

Add a failing test

1df3f2e

Refactor tests further

ff58c60

notnoop requested review from schmichael and lgfa29 October 26, 2021 13:22

notnoop self-assigned this Oct 26, 2021

notnoop commented Oct 26, 2021

View reviewed changes

lgfa29 approved these changes Oct 26, 2021

View reviewed changes

Mahmood Ali added 2 commits October 26, 2021 21:29

schedule: we use node id

1004cf6

We use node ids rather than node names when hashing or grouping.

changelog

1b3d278

vercel bot deployed to Preview – nomad-storybook-and-ui October 27, 2021 04:33 View deployment

vercel bot temporarily deployed to Preview – nomad October 27, 2021 04:33 Inactive

notnoop merged commit 56a7cc6 into main Oct 27, 2021

notnoop deleted the b-system-remove-dces branch October 27, 2021 14:04

tgross added this to the 1.2.0 milestone Nov 8, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: stop allocs in unrelated nodes #11391

scheduler: stop allocs in unrelated nodes #11391

notnoop commented Oct 26, 2021 •

edited

Loading

notnoop Oct 26, 2021

notnoop Oct 26, 2021

lgfa29 Oct 26, 2021

notnoop Oct 27, 2021

lgfa29 left a comment •

edited

Loading

github-actions bot commented Nov 11, 2022

	// taintedNodes is used to scan the allocations and then check if the
	// underlying nodes are tainted, and should force a migration of the allocation.
	// All the nodes returned in the map are tainted.
	func taintedNodes(state State, allocs []structs.Allocation) (map[string]structs.Node, error) {
	out := make(map[string]*structs.Node)
	for _, alloc := range allocs {
	if _, ok := out[alloc.NodeID]; ok {
	continue
	}

	ws := memdb.NewWatchSet()
	node, err := state.NodeByID(ws, alloc.NodeID)
	if err != nil {
	return nil, err
	}

	// If the node does not exist, we should migrate
	if node == nil {
	out[alloc.NodeID] = nil
	continue
	}
	if structs.ShouldDrainNode(node.Status) \|\| node.DrainStrategy != nil {
	out[alloc.NodeID] = node
	}
	}
	return out, nil
	}

	// readyNodesInDCs returns all the ready nodes in the given datacenters and a
	// mapping of each data center to the count of ready nodes.
	func readyNodesInDCs(state State, dcs []string) ([]*structs.Node, map[string]struct{}, map[string]int, error) {
	// Index the DCs
	dcMap := make(map[string]int, len(dcs))
	for _, dc := range dcs {
	dcMap[dc] = 0
	}

	// Scan the nodes
	ws := memdb.NewWatchSet()
	var out []*structs.Node
	notReady := map[string]struct{}{}
	iter, err := state.Nodes(ws)
	if err != nil {
	return nil, nil, nil, err
	}
	for {
	raw := iter.Next()
	if raw == nil {
	break
	}

	// Filter on datacenter and status
	node := raw.(*structs.Node)
	if !node.Ready() {
	notReady[node.ID] = struct{}{}
	continue
	}
	if _, ok := dcMap[node.Datacenter]; !ok {
	continue
	}
	out = append(out, node)
	dcMap[node.Datacenter]++
	}
	return out, notReady, dcMap, nil
	}

scheduler: stop allocs in unrelated nodes #11391

scheduler: stop allocs in unrelated nodes #11391

Conversation

notnoop commented Oct 26, 2021 • edited Loading

notnoop Oct 26, 2021

Choose a reason for hiding this comment

notnoop Oct 26, 2021

Choose a reason for hiding this comment

lgfa29 Oct 26, 2021

Choose a reason for hiding this comment

notnoop Oct 27, 2021

Choose a reason for hiding this comment

lgfa29 left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Nov 11, 2022

notnoop commented Oct 26, 2021 •

edited

Loading

lgfa29 left a comment •

edited

Loading