Nomad performs in-place updates when datacenter is changed. #10746

angrycub · 2021-06-11T13:23:23Z

Nomad version

Nomad v1.1.1 (7feec97)
Reproduced in Nomad 0.11.8 as well

Issue

Nomad always performs an in-place update if datacenter is the only change to a job. This can make an allocation misreport the datacenter it is in, and can allow a user to update a job to an invalid datacenter. These changes would only be seen if the allocations were restarted.

Reproduction steps

Start a nomad dev agent

nomad agent -dev

Run e1.nomad

nomad run e1.nomad

==> Monitoring evaluation "358a1582"
    Evaluation triggered by job "example"
    Allocation "ec6e2e28" created: node "f7bc1f2d", group "cache"
==> Monitoring evaluation "358a1582"
    Evaluation within deployment: "64221904"
    Allocation "ec6e2e28" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "358a1582" finished with status "complete"

nomad status example

Elided output, click for full.

ID            = example
Datacenters   = dc1
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
ec6e2e28  f7bc1f2d  cache       0        run      running  49s ago  38s ago

ID            = example
Name          = example
Submit Date   = 2021-06-11T09:10:07-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       0         0

Latest Deployment
ID          = 64221904
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        1       1        0          2021-06-11T09:20:17-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
ec6e2e28  f7bc1f2d  cache       0        run      running  49s ago  38s ago

Plan e2.nomad

Notice that is says it's going to change the datacenter via in-place update.

nomad plan e2.nomad

+/- Job: "example"
+/- Datacenters {
  + Datacenters: "🤯"
  - Datacenters: "dc1"
    }
    Task Group: "cache" (1 in-place update)
      Task: "redis"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 209741
To submit the job with version verification run:

nomad job run -check-index 209741 e2.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Run e2.nomad

nomad run e2.nomad

==> Monitoring evaluation "e68ee2dd"
    Evaluation triggered by job "example"
    Allocation "ec6e2e28" modified: node "f7bc1f2d", group "cache"
==> Monitoring evaluation "e68ee2dd"
    Evaluation within deployment: "3fdfd618"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "e68ee2dd" finished with status "complete"

nomad status example

Allocation says that it's v1 now.

Elided output, click for full.

ID            = example
Datacenters   = 🤯
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
ec6e2e28  f7bc1f2d  cache       1        run      running  1m18s ago  29s ago

ID            = example
Name          = example
Submit Date   = 2021-06-11T09:10:55-04:00
Type          = service
Priority      = 50
Datacenters   = 🤯
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       0         0

Latest Deployment
ID          = 3fdfd618
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        1       0        0          2021-06-11T09:20:55-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
ec6e2e28  f7bc1f2d  cache       1        run      running  1m18s ago  29s ago

If you stop the alloc...

nomad alloc stop ec6e2e28

==> Monitoring evaluation "884146c3"
    Evaluation triggered by job "example"
==> Monitoring evaluation "884146c3"
    Evaluation within deployment: "3fdfd618"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "884146c3" finished with status "complete" but failed to place all allocations:
    Task Group "cache" (failed to place 1 allocation):
      * No nodes were eligible for evaluation
      * No nodes are available in datacenter "🤯"
    Evaluation "3b863d97" waiting for additional capacity to place remainder

Expected Result

Because "🤯" was invalid in my case (I did test with an invalid plaintext datacenter identifier too), I expected either an error or a create/destroy update which then became a blocked alloc.

Actual Result

Job files

e1.nomad

job "example" {
  datacenters = ["dc1"]

  group "cache" {
    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
      }
    }
  }
}

e2.nomad

job "example" {
  datacenters = ["🤯"]

  group "cache" {
    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

tgross · 2021-07-07T14:18:02Z

I've tested this against a cluster with 3 DCs and found that even if we update the job to a DC that exists, the update still happens in-place and the allocation isn't moved.

$ nomad node status
ID        DC   Name   Class   Drain  Eligibility  Status
285e9280  dc3  test3  <none>  false  eligible     ready
59b67f1e  dc2  test1  <none>  false  eligible     ready
44b88859  dc1  test2  <none>  false  eligible     ready

$ nomad job status ex
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
048bf0b2  44b88859  web         0        run      running  26s ago  15s ago

$ nomad plan example.nomad
+/- Job: "example"
+/- Datacenters {
  + Datacenters: "dc2"
  - Datacenters: "dc1"
    }
+/- Task Group: "web" (1 in-place update)
  +/- Task: "httpd" (forces in-place update)
...

$ nomad job run ./example.nomad
==> 2021-07-07T09:31:44-04:00: Monitoring evaluation "1d14c0d9"
    2021-07-07T09:31:44-04:00: Evaluation triggered by job "example"
==> 2021-07-07T09:31:45-04:00: Monitoring evaluation "1d14c0d9"
    2021-07-07T09:31:45-04:00: Evaluation within deployment: "abc2a176"
    2021-07-07T09:31:45-04:00: Allocation "048bf0b2" modified: node "44b88859", group "web"
...

$ nomad job status ex
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
048bf0b2  44b88859  web         1        run      running  1m6s ago  13s ago

One of the tricky bits here is that the tasksUpdated function in the scheduler checks the task group and not the allocation, so it has no knowledge of what datacenters a task group is currently using in the case where there are multiple.

The following patch triggers a destructive update if any member of the datacenters field changes:

diff --git a/scheduler/util.go b/scheduler/util.go
index 3bf944373..b9a220acc 100644
--- a/scheduler/util.go
+++ b/scheduler/util.go
@@ -7,6 +7,7 @@ import (

        log "github.com/hashicorp/go-hclog"
        memdb "github.com/hashicorp/go-memdb"
+       "github.com/hashicorp/nomad/helper"
        "github.com/hashicorp/nomad/nomad/structs"
 )

@@ -346,6 +347,10 @@ func tasksUpdated(jobA, jobB *structs.Job, taskGroup string) bool {
        a := jobA.LookupTaskGroup(taskGroup)
        b := jobB.LookupTaskGroup(taskGroup)

+       if !helper.CompareSliceSetString(jobA.Datacenters, jobB.Datacenters) {
+               return true
+       }
+
        // If the number of tasks do not match, clearly there is an update
        if len(a.Tasks) != len(b.Tasks) {
                return true

But unfortunately that will cause unneccessary destructive allocation updates. For example, if our job starts with datacenters = ["dc1", "dc2", "dc3"] and is placed in "dc2":

$ nomad node status
ID        DC   Name   Class   Drain  Eligibility  Status
54af0727  dc3  test3  <none>  false  eligible     ready
23f73009  dc1  test2  <none>  false  eligible     ready
f3029061  dc2  test1  <none>  false  eligible     ready

$ nomad job status ex
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
5954f0f4  f3029061  web         0        run      running  20s ago  9s ago

We can update to datacenters = ["dc1", "dc2"] which should not cause a new allocation to be created, but it does:

$ nomad plan example.nomad
+/- Job: "example"
-   Datacenters {
      Datacenters: "dc1"
      Datacenters: "dc2"
    - Datacenters: "dc3"
    }
+/- Task Group: "web" (1 create/destroy update)
  +/- Task: "httpd" (forces in-place update)
...

$ nomad job run ./example.nomad
==> 2021-07-07T09:59:24-04:00: Monitoring evaluation "c6f62163"
    2021-07-07T09:59:24-04:00: Evaluation triggered by job "example"
==> 2021-07-07T09:59:25-04:00: Monitoring evaluation "c6f62163"
    2021-07-07T09:59:25-04:00: Evaluation within deployment: "1a86e02b"
    2021-07-07T09:59:25-04:00: Allocation "2f206a34" created: node "23f73009", group "web
...

$ nomad job status ex
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
2f206a34  23f73009  web         1        run      running   25s ago    7s ago
5954f0f4  f3029061  web         0        stop     complete  2m27s ago  19s ago

But if we hack in a check of the datacenters in the calling functions:

diff --git a/scheduler/util.go b/scheduler/util.go
index 3bf944373..865803720 100644
--- a/scheduler/util.go
+++ b/scheduler/util.go
@@ -7,6 +7,7 @@ import (

        log "github.com/hashicorp/go-hclog"
        memdb "github.com/hashicorp/go-memdb"
+       "github.com/hashicorp/nomad/helper"
        "github.com/hashicorp/nomad/nomad/structs"
 )

@@ -700,6 +701,11 @@ func inplaceUpdate(ctx Context, eval *structs.Evaluation, job *structs.Job,
                        continue
                }

+               // The alloc is on a node that's now in an ineligible DC
+               if !helper.SliceStringContains(job.Datacenters, node.Datacenter) {
+                       continue
+               }
+
                // Set the existing node as the base set
                stack.SetNodes([]*structs.Node{node})

@@ -983,6 +989,11 @@ func genericAllocUpdateFn(ctx Context, stack Stack, evalID string) allocUpdateTy
                        return false, true, nil
                }

+               // The alloc is on a node that's now in an ineligible DC
+               if !helper.SliceStringContains(newJob.Datacenters, node.Datacenter) {
+                       return false, true, nil
+               }
+
                // Set the existing node as the base set
                stack.SetNodes([]*structs.Node{node})

Let's try that one:

$ nomad node status
ID        DC   Name   Class   Drain  Eligibility  Status
97e8087e  dc3  test3  <none>  false  eligible     ready
3912c593  dc2  test1  <none>  false  eligible     ready
905192cc  dc1  test2  <none>  false  eligible     ready

$ nomad job run ./example.nomad
==> 2021-07-07T10:13:17-04:00: Monitoring evaluation "fb5339f2"
    2021-07-07T10:13:17-04:00: Evaluation triggered by job "example"
==> 2021-07-07T10:13:18-04:00: Monitoring evaluation "fb5339f2"
    2021-07-07T10:13:18-04:00: Evaluation within deployment: "12549d63"
    2021-07-07T10:13:18-04:00: Allocation "72c9b99f" created: node "3912c593", group "web"

The alloc is in "dc2", so if we change to datacenters = ["dc1", "dc2"] we get an inplace update like we'd expect:

$ nomad plan example.nomad
+/- Job: "example"
-   Datacenters {
      Datacenters: "dc1"
      Datacenters: "dc2"
    - Datacenters: "dc3"
    }
+/- Task Group: "web" (1 in-place update)
  +/- Task: "httpd" (forces in-place update)

But if we change to datacenters = ["dc1", "dc3"] we get the destructive update we expect:

$ nomad plan example.nomad
+/- Job: "example"
-   Datacenters {
      Datacenters: "dc1"
    - Datacenters: "dc2"
      Datacenters: "dc3"
    }
+/- Task Group: "web" (1 create/destroy update)
  +/- Task: "httpd" (forces in-place update)

I'm not wild about where that function lands in terms of the design of the scheduler and testability, but I'll open a PR with this patch for further discussion of those details.

github-actions · 2022-10-17T02:45:45Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

angrycub added the type/bug label Jun 11, 2021

notnoop added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jun 15, 2021

tgross mentioned this issue Jul 7, 2021

scheduler: datacenter updates should be destructive #10864

Merged

tgross modified the milestones: 1.1.2, 1.1.3 Jul 7, 2021

notnoop closed this as completed in #10864 Jul 14, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad performs in-place updates when datacenter is changed. #10746

Nomad performs in-place updates when datacenter is changed. #10746

angrycub commented Jun 11, 2021

tgross commented Jul 7, 2021

github-actions bot commented Oct 17, 2022

Nomad performs in-place updates when datacenter is changed. #10746

Nomad performs in-place updates when datacenter is changed. #10746

Comments

angrycub commented Jun 11, 2021

Nomad version

Issue

Reproduction steps

Start a nomad dev agent

Run e1.nomad

Plan e2.nomad

Run e2.nomad

Expected Result

Actual Result

Job files

e1.nomad

e2.nomad

tgross commented Jul 7, 2021

github-actions bot commented Oct 17, 2022