Not placing system job allocations on new nodes #6960

jorgemarey · 2020-01-20T14:22:26Z

Nomad version

Nomad v0.10.2

Issue

When a new node is registered into the cluster, a new allocation of a system job should be placed, but it isn't. There's an issue tracking this, but it's closed, so that's why I created a new one.

I think this only happens when the system job has two (or more) task groups.

I was looking at the code and think that the problem occurs when making the diff:

Here, required, after calling materializeTaskGroups returns a map with the following:
job.taskgroup1[0] = tg1
job.taskgroup2[0] = tg2

Thas why here, when calling diffAllocs and checking all the placed allocations on the nodes, it tries to place all taskGroups in all nodes. (here if the task group is not in the node it will be in the diff.placed list, even if it has a contraint against that).

In the logs I can see the following:

{"@level":"debug","@message":"dequeued evaluation","@module":"worker","@timestamp":"2020-01-20T10:58:41.029223Z","eval_id":"45bcf831-d9e4-0511-b182-e259ab20f729"}
{"@level":"debug","@message":"reconciled current state with desired state","@module":"worker.system_sched","@timestamp":"2020-01-20T10:59:01.073127Z","eval_id":"45bcf831-d9e4-0511-b182-e259ab20f729", "job_id":"my-job", "namespace":"default",
  "place":102,
  "ignore":100,
  "lost":2,
  "migrate":0,
  "stop":0,
  "update":0
}
{"@level":"error","@message":"failed to compute job allocations","@module":"worker.system_sched","@timestamp":"2020-01-20T10:58:41.030027Z","error":"could not find node \"4743664b-0822-c982-a508-5be2f5ac4bd6\"","eval_id":"45bcf831-d9e4-0511-b182-e259ab20f729","job_id":"my-job","namespace":"default"}

I have a cluster with 100 nodes. 50 with a meta value of blue and another 50 with a meta of green. Each taskgroup goes to only one of that meta (by constraint).
When I add a new node, if another node is not eligible, the diff will try to place both taskgroups in all nodes, failing to do it.

I don't know if I explained myself correctly....

Reproduction steps

Create a job with 2 task groups (one tg to a set of nodes and the other to another different set)
Set a node ineligible (that has an allocation of that job)
Try to add a new node (will fail to add a new allocation of the system job)

The text was updated successfully, but these errors were encountered:

drewbailey · 2020-01-22T19:00:17Z

@jorgemarey I've just tested this with the latest changes from #6968 and it seems to be working now. Prior to the fix, the scheduler would return an error and return early when it came upon the ineligible node, now it will just add it as a placement error but continue scheduling the rest of the nodes.

jorgemarey · 2020-01-23T08:34:28Z

Hi @drewbailey . Thanks for the fix. With that I guess that scheduling will continue, but I believe that the code here is not "correct". In here:

	// Create the required task groups. 
	required := materializeTaskGroups(job)

	result := &diffResult{}
	for nodeID, allocs := range nodeAllocs {
		diff := diffAllocs(job, taintedNodes, required, allocs, terminalAllocs)

required contains a map that has as keys <job>.<taskgroup>[0], so this map will have as many keys as taskgroups are in the job. Then for every node that has some allocation of this job delpoyed, a diff is computed. The problem is that in diffAllocs :

func diffAllocs(job *structs.Job, taintedNodes map[string]*structs.Node,
	required map[string]*structs.TaskGroup, allocs []*structs.Allocation,
	terminalAllocs map[string]*structs.Allocation) *diffResult {
	result := &diffResult{}

	// Scan the existing updates
	existing := make(map[string]struct{})
	for _, exist := range allocs {
		// Index the existing node
		name := exist.Name
		existing[name] = struct{}{}
.... OMITED....
	}

	// Scan the required groups
	for name, tg := range required {
		// Check for an existing allocation
		_, ok := existing[name]

		// Require a placement if no existing allocation. If there
		// is an existing allocation, we would have checked for a potential
		// update or ignore above.
		if !ok {
			result.place = append(result.place, allocTuple{
				Name:      name,
				TaskGroup: tg,
				Alloc:     terminalAllocs[name],
			})
		}
	}
	return result
}

Is generating a placement for every "required" allocation that is not in the node. But that is not correct in many cases, as you can set a constraint so that a system job taskgroup only deploys to a certain pool of nodes, it's not required in each of them, so that placement shouldn't happen.

You can see that by looking at the logs, for explample, as I pasted before. Here nomad is generating placements that are not needed.

{"@level":"debug","@message":"reconciled current state with desired state","@module":"worker.system_sched","@timestamp":"2020-01-20T10:59:01.073127Z","eval_id":"45bcf831-d9e4-0511-b182-e259ab20f729", "job_id":"my-job", "namespace":"default",
  "place":102,
  "ignore":100,
  "lost":2,
  "migrate":0,
  "stop":0,
  "update":0
}

Is this example, I have 50 nodes with node class A, another 50 with node class B. The job I have has 2 taskgroups, one of them has a contraint to class A and the other to class B. Let say I disable eligibility to one node of B and try to add a new node of A. As you can see in the log nomad will try to add a new allocation of every taskgroup to every node, and thats not needed.

The fix should help to at least, to continue placing allocations on the new nodes, but maybe this "problem" leads to another set of issues?

I don't know if I explained the problem correctly....

Thanks!

drewbailey · 2020-01-23T16:37:44Z

Hi @jorgemarey the log debug log line you shared actually occurs before placements are computed. The output from diffSystemAllocs has not yet looked at constraints, this occurs further down in computePlacements https://github.com/hashicorp/nomad/blob/v0.10.2/scheduler/system_sched.go#L286 and https://github.com/hashicorp/nomad/blob/v0.10.2/scheduler/stack.go#L267

Here is a quick example of what I think outlines your scenario

job file

job "redis" {
  datacenters = ["dc1"]

  type = "system"

  # type = "service"
  group "cache2" {
    constraint {
      attribute = "${node.class}"
      value     = "class-2"
    }

    count = 1

    restart {
      attempts = 10
      interval = "5m"

      delay = "25s"
      mode  = "delay"
    }

    ephemeral_disk {
      size = 10
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      env {
        version = "3"
      }

      logs {
        max_files     = 1
        max_file_size = 9
      }

      resources {
        cpu = 20 # 500 MHz    

        memory = 40 # 256MB

        network {
          mbits = 1
          port  "db"  {}
        }
      }
    }
  }

  group "cache" {
    constraint {
      attribute = "${node.class}"
      value     = "class-1"
    }

    count = 1

    restart {
      attempts = 10
      interval = "5m"

      delay = "25s"
      mode  = "delay"
    }

    ephemeral_disk {
      size = 10
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      env {
        version = "3"
      }

      logs {
        max_files     = 1
        max_file_size = 9
      }

      resources {
        cpu = 20 # 500 MHz    

        memory = 40 # 256MB

        network {
          mbits = 1
          port  "db"  {}
        }
      }
    }
  }
}

output before adding a new node

→ nomad job status redis
ID            = redis
Name          = redis
Submit Date   = 2020-01-23T11:32:10-05:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       0         0
cache2      0       0         1        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
76f6c5a4  511a2b4a  cache2      0        run      running  4s ago   3s ago
9d635a25  2c943a10  cache       0        run      running  4s ago   3s ago


→ nomad node status
ID        DC   Name     Class    Drain  Eligibility  Status
2c943a10  dc1  client1  class-1  false  eligible     ready
511a2b4a  dc1  client2  class-2  false  eligible     ready

mark one node ineligible
nomad node eligibility -disable 511

add new node of class-1, I'll have 3 running jobs as expected

→ nomad job status redis
ID            = redis
Name          = redis
Submit Date   = 2020-01-23T11:32:10-05:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       1       0         2        0       0         0
cache2      0       0         1        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
77d52111  463a123c  cache       0        run      running  5s ago    4s ago
76f6c5a4  511a2b4a  cache2      0        run      running  2m9s ago  2m8s ago
9d635a25  2c943a10  cache       0        run      running  2m9s ago  2m8s ago

→ nomad node status
ID        DC   Name     Class    Drain  Eligibility  Status
463a123c  dc1  client3  class-1  false  eligible     ready
2c943a10  dc1  client1  class-1  false  eligible     ready
511a2b4a  dc1  client2  class-2  false  ineligible   ready

so even though the reconcile debug output shows a higher placement, we still only end up with one new allocation.

    2020-01-23T11:34:16.879-0500 [DEBUG] worker.system_sched: reconciled current state with desired state: eval_id=faae8047-413d-efa8-fbf8-f21c573232d1 job_id=redis namespace=default place=4 update=0 migrate=0 stop=0 ignore=2 lost=0

Let me know if that's not what you are expecting, thanks!

jorgemarey · 2020-01-24T07:59:42Z

Hi @drewbailey. Yes, I was expecting this behaviour from the fix, and I know it works. It was simple a little bit weird that a placement was generated for something that shouldn't be there.

In the fix, with this behaviour, using the example you provided. In the code here woudn't a new element will be added to the failedTGAllocs map? That would set the eval status to some failed allocs. And thats is not something that failed, but something that souldn't be generated previously.

Anyway, if it works fine and doesn't conflict with other parts I guess that's ok.

Thanks for the fix!

drewbailey · 2020-01-28T21:15:27Z

@jorgemarey, currently in master there would only be a failedTGAlloc entry for a running service job on an ineligible node that was attempted to be updated (through a deployment, new node etc). We agree that this shouldn't be the case and it should be ignored instead.

We are working on a potential solution here #6996

jorgemarey · 2020-01-29T07:24:08Z

Thats sounds great! Thanks!

github-actions · 2022-11-13T02:32:33Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

drewbailey closed this as completed Jan 22, 2020

johnzhanghua mentioned this issue Jul 24, 2020

Nomad stops jobs when adding a new server&client node into existing cluster #8526

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not placing system job allocations on new nodes #6960

Not placing system job allocations on new nodes #6960

jorgemarey commented Jan 20, 2020

drewbailey commented Jan 22, 2020

jorgemarey commented Jan 23, 2020

drewbailey commented Jan 23, 2020 •

edited

Loading

jorgemarey commented Jan 24, 2020

drewbailey commented Jan 28, 2020

jorgemarey commented Jan 29, 2020

github-actions bot commented Nov 13, 2022

Not placing system job allocations on new nodes #6960

Not placing system job allocations on new nodes #6960

Comments

jorgemarey commented Jan 20, 2020

Nomad version

Issue

Reproduction steps

drewbailey commented Jan 22, 2020

jorgemarey commented Jan 23, 2020

drewbailey commented Jan 23, 2020 • edited Loading

jorgemarey commented Jan 24, 2020

drewbailey commented Jan 28, 2020

jorgemarey commented Jan 29, 2020

github-actions bot commented Nov 13, 2022

drewbailey commented Jan 23, 2020 •

edited

Loading