Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not placing system job allocations on new nodes #6960

Closed
jorgemarey opened this issue Jan 20, 2020 · 7 comments
Closed

Not placing system job allocations on new nodes #6960

jorgemarey opened this issue Jan 20, 2020 · 7 comments

Comments

@jorgemarey
Copy link
Contributor

Nomad version

Nomad v0.10.2

Issue

When a new node is registered into the cluster, a new allocation of a system job should be placed, but it isn't. There's an issue tracking this, but it's closed, so that's why I created a new one.

I think this only happens when the system job has two (or more) task groups.

I was looking at the code and think that the problem occurs when making the diff:

  • Here, required, after calling materializeTaskGroups returns a map with the following:
  • job.taskgroup1[0] = tg1
  • job.taskgroup2[0] = tg2

Thas why here, when calling diffAllocs and checking all the placed allocations on the nodes, it tries to place all taskGroups in all nodes. (here if the task group is not in the node it will be in the diff.placed list, even if it has a contraint against that).

In the logs I can see the following:

{"@level":"debug","@message":"dequeued evaluation","@module":"worker","@timestamp":"2020-01-20T10:58:41.029223Z","eval_id":"45bcf831-d9e4-0511-b182-e259ab20f729"}
{"@level":"debug","@message":"reconciled current state with desired state","@module":"worker.system_sched","@timestamp":"2020-01-20T10:59:01.073127Z","eval_id":"45bcf831-d9e4-0511-b182-e259ab20f729", "job_id":"my-job", "namespace":"default",
  "place":102,
  "ignore":100,
  "lost":2,
  "migrate":0,
  "stop":0,
  "update":0
}
{"@level":"error","@message":"failed to compute job allocations","@module":"worker.system_sched","@timestamp":"2020-01-20T10:58:41.030027Z","error":"could not find node \"4743664b-0822-c982-a508-5be2f5ac4bd6\"","eval_id":"45bcf831-d9e4-0511-b182-e259ab20f729","job_id":"my-job","namespace":"default"}

I have a cluster with 100 nodes. 50 with a meta value of blue and another 50 with a meta of green. Each taskgroup goes to only one of that meta (by constraint).
When I add a new node, if another node is not eligible, the diff will try to place both taskgroups in all nodes, failing to do it.

I don't know if I explained myself correctly....

Reproduction steps

  • Create a job with 2 task groups (one tg to a set of nodes and the other to another different set)
  • Set a node ineligible (that has an allocation of that job)
  • Try to add a new node (will fail to add a new allocation of the system job)
@drewbailey
Copy link
Contributor

@jorgemarey I've just tested this with the latest changes from #6968 and it seems to be working now. Prior to the fix, the scheduler would return an error and return early when it came upon the ineligible node, now it will just add it as a placement error but continue scheduling the rest of the nodes.

@jorgemarey
Copy link
Contributor Author

Hi @drewbailey . Thanks for the fix. With that I guess that scheduling will continue, but I believe that the code here is not "correct". In here:

	// Create the required task groups. 
	required := materializeTaskGroups(job)

	result := &diffResult{}
	for nodeID, allocs := range nodeAllocs {
		diff := diffAllocs(job, taintedNodes, required, allocs, terminalAllocs)

required contains a map that has as keys <job>.<taskgroup>[0], so this map will have as many keys as taskgroups are in the job. Then for every node that has some allocation of this job delpoyed, a diff is computed. The problem is that in diffAllocs :

func diffAllocs(job *structs.Job, taintedNodes map[string]*structs.Node,
	required map[string]*structs.TaskGroup, allocs []*structs.Allocation,
	terminalAllocs map[string]*structs.Allocation) *diffResult {
	result := &diffResult{}

	// Scan the existing updates
	existing := make(map[string]struct{})
	for _, exist := range allocs {
		// Index the existing node
		name := exist.Name
		existing[name] = struct{}{}
.... OMITED....
	}

	// Scan the required groups
	for name, tg := range required {
		// Check for an existing allocation
		_, ok := existing[name]

		// Require a placement if no existing allocation. If there
		// is an existing allocation, we would have checked for a potential
		// update or ignore above.
		if !ok {
			result.place = append(result.place, allocTuple{
				Name:      name,
				TaskGroup: tg,
				Alloc:     terminalAllocs[name],
			})
		}
	}
	return result
}

Is generating a placement for every "required" allocation that is not in the node. But that is not correct in many cases, as you can set a constraint so that a system job taskgroup only deploys to a certain pool of nodes, it's not required in each of them, so that placement shouldn't happen.

You can see that by looking at the logs, for explample, as I pasted before. Here nomad is generating placements that are not needed.

{"@level":"debug","@message":"reconciled current state with desired state","@module":"worker.system_sched","@timestamp":"2020-01-20T10:59:01.073127Z","eval_id":"45bcf831-d9e4-0511-b182-e259ab20f729", "job_id":"my-job", "namespace":"default",
  "place":102,
  "ignore":100,
  "lost":2,
  "migrate":0,
  "stop":0,
  "update":0
}

Is this example, I have 50 nodes with node class A, another 50 with node class B. The job I have has 2 taskgroups, one of them has a contraint to class A and the other to class B. Let say I disable eligibility to one node of B and try to add a new node of A. As you can see in the log nomad will try to add a new allocation of every taskgroup to every node, and thats not needed.

The fix should help to at least, to continue placing allocations on the new nodes, but maybe this "problem" leads to another set of issues?

I don't know if I explained the problem correctly....

Thanks!

@drewbailey
Copy link
Contributor

drewbailey commented Jan 23, 2020

Hi @jorgemarey the log debug log line you shared actually occurs before placements are computed. The output from diffSystemAllocs has not yet looked at constraints, this occurs further down in computePlacements https://github.com/hashicorp/nomad/blob/v0.10.2/scheduler/system_sched.go#L286 and https://github.com/hashicorp/nomad/blob/v0.10.2/scheduler/stack.go#L267

Here is a quick example of what I think outlines your scenario

job file

job "redis" {
  datacenters = ["dc1"]

  type = "system"

  # type = "service"
  group "cache2" {
    constraint {
      attribute = "${node.class}"
      value     = "class-2"
    }

    count = 1

    restart {
      attempts = 10
      interval = "5m"

      delay = "25s"
      mode  = "delay"
    }

    ephemeral_disk {
      size = 10
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      env {
        version = "3"
      }

      logs {
        max_files     = 1
        max_file_size = 9
      }

      resources {
        cpu = 20 # 500 MHz    

        memory = 40 # 256MB

        network {
          mbits = 1
          port  "db"  {}
        }
      }
    }
  }

  group "cache" {
    constraint {
      attribute = "${node.class}"
      value     = "class-1"
    }

    count = 1

    restart {
      attempts = 10
      interval = "5m"

      delay = "25s"
      mode  = "delay"
    }

    ephemeral_disk {
      size = 10
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      env {
        version = "3"
      }

      logs {
        max_files     = 1
        max_file_size = 9
      }

      resources {
        cpu = 20 # 500 MHz    

        memory = 40 # 256MB

        network {
          mbits = 1
          port  "db"  {}
        }
      }
    }
  }
}

output before adding a new node

→ nomad job status redis
ID            = redis
Name          = redis
Submit Date   = 2020-01-23T11:32:10-05:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       0         0
cache2      0       0         1        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
76f6c5a4  511a2b4a  cache2      0        run      running  4s ago   3s ago
9d635a25  2c943a10  cache       0        run      running  4s ago   3s ago


→ nomad node status
ID        DC   Name     Class    Drain  Eligibility  Status
2c943a10  dc1  client1  class-1  false  eligible     ready
511a2b4a  dc1  client2  class-2  false  eligible     ready

mark one node ineligible
nomad node eligibility -disable 511

add new node of class-1, I'll have 3 running jobs as expected

→ nomad job status redis
ID            = redis
Name          = redis
Submit Date   = 2020-01-23T11:32:10-05:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       1       0         2        0       0         0
cache2      0       0         1        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
77d52111  463a123c  cache       0        run      running  5s ago    4s ago
76f6c5a4  511a2b4a  cache2      0        run      running  2m9s ago  2m8s ago
9d635a25  2c943a10  cache       0        run      running  2m9s ago  2m8s ago
→ nomad node status
ID        DC   Name     Class    Drain  Eligibility  Status
463a123c  dc1  client3  class-1  false  eligible     ready
2c943a10  dc1  client1  class-1  false  eligible     ready
511a2b4a  dc1  client2  class-2  false  ineligible   ready

so even though the reconcile debug output shows a higher placement, we still only end up with one new allocation.

    2020-01-23T11:34:16.879-0500 [DEBUG] worker.system_sched: reconciled current state with desired state: eval_id=faae8047-413d-efa8-fbf8-f21c573232d1 job_id=redis namespace=default place=4 update=0 migrate=0 stop=0 ignore=2 lost=0

Let me know if that's not what you are expecting, thanks!

@jorgemarey
Copy link
Contributor Author

Hi @drewbailey. Yes, I was expecting this behaviour from the fix, and I know it works. It was simple a little bit weird that a placement was generated for something that shouldn't be there.

In the fix, with this behaviour, using the example you provided. In the code here woudn't a new element will be added to the failedTGAllocs map? That would set the eval status to some failed allocs. And thats is not something that failed, but something that souldn't be generated previously.

Anyway, if it works fine and doesn't conflict with other parts I guess that's ok.

Thanks for the fix!

@drewbailey
Copy link
Contributor

@jorgemarey, currently in master there would only be a failedTGAlloc entry for a running service job on an ineligible node that was attempted to be updated (through a deployment, new node etc). We agree that this shouldn't be the case and it should be ignored instead.

We are working on a potential solution here #6996

@jorgemarey
Copy link
Contributor Author

Thats sounds great! Thanks!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants