Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job update always lead to allocations migrate (node-reschedule-penalty ) #5856

Closed
capone212 opened this issue Jun 19, 2019 · 11 comments · Fixed by #6781
Closed

Job update always lead to allocations migrate (node-reschedule-penalty ) #5856

capone212 opened this issue Jun 19, 2019 · 11 comments · Fixed by #6781

Comments

@capone212
Copy link
Contributor

capone212 commented Jun 19, 2019

Nomad version

Output from nomad version
Nomad v0.9.3

Operating system and Environment details

3 test linux boxes with client+server mode

Issue

After changing single Meta or Envar for single task in job file, nomad re-schedules the whole job with several tasks to another node. I think this behavior new (comparing to old versions) and suboptimal.

Digging with console tool I found the following. Originally job with single task group and 2 tasks in the group, was allocated on node 4aae63e4. Then i have changed single envar value, and run nomad job plan:

nomad job plan --verbose job.hcl 

+/- Job: "test.DeviceIpint"
+/- Task Group: "default" (1 create/destroy update)
  +/- Task: "DeviceIpint.1" (forces create/destroy update)
    +/- Env[RestartTag]: "tag2" => "tag3"
      Task: "VideDecoder.2"

Then I have updated job with nomad job run job.hcl . In result nomad moved allocation to another node (9e3a3d78).

To understand why, I looked to alloc status, and it seems it due node-reschedule-penalty.

nomad alloc status -verbose <alloc_id>

---skipped some info----
Placement Metrics
Node                                  node-reschedule-penalty  node-affinity  binpack  job-anti-affinity  final score
9e3a3d78-261f-2b52-ddc8-4a770196a325  0                        0              0.498    0                  0.498
3b36f877-4790-bc64-84a5-6e74d0cd4167  0                        0              0.403    0                  0.403
4aae63e4-6614-2e55-0168-c1b12cc992df  -1                       0              0.417    0                  -0.292

I think node-reschedule-penalty in current case should not be taken in account (because task is not failing ), also it is clear that restarting the allocation in the same node is more appropriate than migrating to another node in this case.

I would like nomad try to place allocations on the same server by default. Is there any flag or trick I can force nomad not to move allocations ?

@Dirrk
Copy link

Dirrk commented Jun 19, 2019

job "docs" {
  group "example" {
    ephemeral_disk {
      migrate = true
      size    = "500"
      sticky  = true
    }
  }
}

I attach ephemeral_disk to jobs that I want to stay on the same host. From the docs:

Specifies that Nomad should make a best-effort attempt to place the updated allocation on the same machine. This will move the local/ and alloc/data directories to the new allocation.

https://www.nomadproject.io/docs/job-specification/ephemeral_disk.html

@preetapan
Copy link
Contributor

@capone212 You can add an affinity or constraint on the set of nodes you want the job to run on. Would recommend against pinning to the same node using a constraint because then if that node goes down the job can't run anywhere else till its back.

As for the node-reschedule-penalty - could you share the output for nomad alloc status -json for all three allocs? It should only add a penalty if a previos alloc for the same job failed on that node. The json output should help us debug further.

@capone212
Copy link
Contributor Author

Hi @preetapan thanks for your response!
Please find the requested info by following link https://gist.github.com/capone212/280bb0d9cfdd11298eae8aed75fc0700

Please let me know if you need something. Please be informed, that I use tasks with custom external driver implemented using plugin interface.

@stale
Copy link

stale bot commented Sep 18, 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@capone212
Copy link
Contributor Author

Still actiual

@kdsnice
Copy link

kdsnice commented Oct 29, 2019

Still actual with Nomad v0.9.5

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 29, 2019
@tgross tgross self-assigned this Nov 25, 2019
@tgross
Copy link
Member

tgross commented Nov 25, 2019

I've verified this behavior on 0.10.2-rc1 as well. Reproduction steps on a cluster using our e2e setup (4 client nodes):

▶ nomad job init -short
▶ nomad job run example.nomad
==> Monitoring evaluation "85ea5dee"
    Evaluation triggered by job "example"
    Evaluation within deployment: "1cc9eeb6"
    Allocation "7b881420" created: node "ff8ed4ac", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "85ea5dee" finished with status "complete"

▶ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
7b881420  ff8ed4ac  cache       0        run      running  6s ago   1s ago

# Edit the job file to add a `env { version = "1" }` stanza to the task
▶ emacs example.hcl
...

▶ nomad job run example.nomad
==> Monitoring evaluation "89db0813"
    Evaluation triggered by job "example"
    Evaluation within deployment: "fc89ecc2"
    Allocation "686d1da1" created: node "1c61fbb7", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "89db0813" finished with status "complete"

# note we've landed on 2 different nodes
▶ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
686d1da1  1c61fbb7  cache       1        run      running   5s ago   1s ago
7b881420  ff8ed4ac  cache       0        stop     complete  37s ago  4s ago

# previous node has node-reschedule-penalty set
▶ nomad alloc status -verbose 686d1da1
ID                  = 686d1da1-cbfb-251c-b9e2-ccb85b4c8bf3
Eval ID             = 89db0813-e218-4ca4-ec6f-69e55dd7ce86
Name                = example.cache[0]
Node ID             = 1c61fbb7-df88-1737-c03f-7221fb8331eb
Node Name           = ip-172-31-29-84
Job ID              = example
Job Version         = 1
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 2019-11-25T09:05:35-05:00
Modified            = 2019-11-25T09:05:49-05:00
Deployment ID       = fc89ecc2-a05b-4fa2-26d6-1f350b0d6ce0
Deployment Health   = healthy
Evaluated Nodes     = 2
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 72.165µs
Failures            = 0

Task "redis" is "running"
Task Resources
CPU        Memory           Disk     Addresses
2/500 MHz  6.3 MiB/256 MiB  300 MiB  db: 172.31.29.84:27806

Task Events:
Started At     = 2019-11-25T14:05:39Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-11-25T09:05:39-05:00  Started     Task started by client
2019-11-25T09:05:36-05:00  Driver      Downloading image
2019-11-25T09:05:36-05:00  Task Setup  Building Task Directory
2019-11-25T09:05:35-05:00  Received    Task received by client

Placement Metrics
Node                                  binpack  job-anti-affinity  node-affinity  node-reschedule-penalty  final score
1c61fbb7-df88-1737-c03f-7221fb8331eb  0.2      0                  0              0                        0.2
ff8ed4ac-6b01-f8e2-490c-8c3016a4745f  0.2      0                  0              -1                       -0.4
alloc 7b881420 JSON
{
  "AllocModifyIndex": 69,
  "AllocatedResources": {
    "Shared": {
      "DiskMB": 300,
      "Networks": null
    },
    "Tasks": {
      "redis": {
        "Cpu": {
          "CpuShares": 500
        },
        "Memory": {
          "MemoryMB": 256
        },
        "Networks": [
          {
            "CIDR": "",
            "Device": "eth0",
            "DynamicPorts": [
              {
                "Label": "db",
                "To": 0,
                "Value": 21304
              }
            ],
            "IP": "172.31.25.125",
            "MBits": 10,
            "Mode": "",
            "ReservedPorts": null
          }
        ]
      }
    }
  },
  "ClientDescription": "All tasks have completed",
  "ClientStatus": "complete",
  "CreateIndex": 58,
  "CreateTime": 1574690702968629200,
  "DeploymentID": "1cc9eeb6-b553-7d09-2686-e61f6478f85c",
  "DeploymentStatus": {
    "Canary": false,
    "Healthy": true,
    "ModifyIndex": 63,
    "Timestamp": "2019-11-25T14:05:17.313989414Z"
  },
  "DesiredDescription": "alloc is being updated due to job update",
  "DesiredStatus": "stop",
  "DesiredTransition": {
    "Migrate": null,
    "Reschedule": null
  },
  "EvalID": "85ea5dee-ab03-3bd2-3678-a4ea4ebae49a",
  "FollowupEvalID": "",
  "ID": "7b881420-729f-6fec-7d76-e43dd86d0352",
  "Job": {
    "Affinities": null,
    "AllAtOnce": false,
    "Constraints": null,
    "CreateIndex": 56,
    "Datacenters": [
      "dc1"
    ],
    "Dispatched": false,
    "ID": "example",
    "JobModifyIndex": 56,
    "Meta": null,
    "Migrate": null,
    "ModifyIndex": 57,
    "Name": "example",
    "Namespace": "default",
    "ParameterizedJob": null,
    "ParentID": "",
    "Payload": null,
    "Periodic": null,
    "Priority": 50,
    "Region": "global",
    "Reschedule": null,
    "Spreads": null,
    "Stable": false,
    "Status": "pending",
    "StatusDescription": "",
    "Stop": false,
    "SubmitTime": 1574690702955314200,
    "TaskGroups": [
      {
        "Affinities": null,
        "Constraints": null,
        "Count": 1,
        "EphemeralDisk": {
          "Migrate": false,
          "SizeMB": 300,
          "Sticky": false
        },
        "Meta": null,
        "Migrate": {
          "HealthCheck": "checks",
          "HealthyDeadline": 300000000000,
          "MaxParallel": 1,
          "MinHealthyTime": 10000000000
        },
        "Name": "cache",
        "Networks": null,
        "ReschedulePolicy": {
          "Attempts": 0,
          "Delay": 30000000000,
          "DelayFunction": "exponential",
          "Interval": 0,
          "MaxDelay": 3600000000000,
          "Unlimited": true
        },
        "RestartPolicy": {
          "Attempts": 2,
          "Delay": 15000000000,
          "Interval": 1800000000000,
          "Mode": "fail"
        },
        "Services": null,
        "Spreads": null,
        "Tasks": [
          {
            "Affinities": null,
            "Artifacts": null,
            "Config": {
              "image": "redis:3.2",
              "port_map": [
                {
                  "db": 6379
                }
              ]
            },
            "Constraints": null,
            "DispatchPayload": null,
            "Driver": "docker",
            "Env": {
              "version": "0"
            },
            "KillSignal": "",
            "KillTimeout": 5000000000,
            "Kind": "",
            "Leader": false,
            "LogConfig": {
              "MaxFileSizeMB": 10,
              "MaxFiles": 10
            },
            "Meta": null,
            "Name": "redis",
            "Resources": {
              "CPU": 500,
              "Devices": null,
              "DiskMB": 0,
              "IOPS": 0,
              "MemoryMB": 256,
              "Networks": [
                {
                  "CIDR": "",
                  "Device": "",
                  "DynamicPorts": [
                    {
                      "Label": "db",
                      "To": 0,
                      "Value": 0
                    }
                  ],
                  "IP": "",
                  "MBits": 10,
                  "Mode": "",
                  "ReservedPorts": null
                }
              ]
            },
            "Services": null,
            "ShutdownDelay": 0,
            "Templates": null,
            "User": "",
            "Vault": null,
            "VolumeMounts": null
          }
        ],
        "Update": {
          "AutoPromote": false,
          "AutoRevert": false,
          "Canary": 0,
          "HealthCheck": "checks",
          "HealthyDeadline": 300000000000,
          "MaxParallel": 1,
          "MinHealthyTime": 10000000000,
          "ProgressDeadline": 600000000000,
          "Stagger": 30000000000
        },
        "Volumes": null
      }
    ],
    "Type": "service",
    "Update": {
      "AutoPromote": false,
      "AutoRevert": false,
      "Canary": 0,
      "HealthCheck": "",
      "HealthyDeadline": 0,
      "MaxParallel": 1,
      "MinHealthyTime": 0,
      "ProgressDeadline": 0,
      "Stagger": 30000000000
    },
    "VaultToken": "",
    "Version": 0
  },
  "JobID": "example",
  "Metrics": {
    "AllocationTime": 106081,
    "ClassExhausted": null,
    "ClassFiltered": null,
    "CoalescedFailures": 0,
    "ConstraintFiltered": null,
    "DimensionExhausted": null,
    "NodesAvailable": {
      "dc1": 2
    },
    "NodesEvaluated": 2,
    "NodesExhausted": 0,
    "NodesFiltered": 0,
    "QuotaExhausted": null,
    "ScoreMetaData": [
      {
        "NodeID": "1c61fbb7-df88-1737-c03f-7221fb8331eb",
        "NormScore": 0.20002653302990245,
        "Scores": {
          "job-anti-affinity": 0,
          "node-reschedule-penalty": 0,
          "node-affinity": 0,
          "binpack": 0.20002653302990245
        }
      },
      {
        "NodeID": "ff8ed4ac-6b01-f8e2-490c-8c3016a4745f",
        "NormScore": 0.20002653302990245,
        "Scores": {
          "job-anti-affinity": 0,
          "node-reschedule-penalty": 0,
          "node-affinity": 0,
          "binpack": 0.20002653302990245
        }
      }
    ],
    "Scores": null
  },
  "ModifyIndex": 72,
  "ModifyTime": 1574690735900604400,
  "Name": "example.cache[0]",
  "Namespace": "default",
  "NextAllocation": "686d1da1-cbfb-251c-b9e2-ccb85b4c8bf3",
  "NodeID": "ff8ed4ac-6b01-f8e2-490c-8c3016a4745f",
  "NodeName": "ip-172-31-25-125",
  "PreemptedAllocations": null,
  "PreemptedByAllocation": "",
  "PreviousAllocation": "",
  "RescheduleTracker": null,
  "Resources": {
    "CPU": 500,
    "Devices": null,
    "DiskMB": 300,
    "IOPS": 0,
    "MemoryMB": 256,
    "Networks": [
      {
        "CIDR": "",
        "Device": "eth0",
        "DynamicPorts": [
          {
            "Label": "db",
            "To": 0,
            "Value": 21304
          }
        ],
        "IP": "172.31.25.125",
        "MBits": 10,
        "Mode": "",
        "ReservedPorts": null
      }
    ]
  },
  "Services": null,
  "TaskGroup": "cache",
  "TaskResources": {
    "redis": {
      "CPU": 500,
      "Devices": null,
      "DiskMB": 0,
      "IOPS": 0,
      "MemoryMB": 256,
      "Networks": [
        {
          "CIDR": "",
          "Device": "eth0",
          "DynamicPorts": [
            {
              "Label": "db",
              "To": 0,
              "Value": 21304
            }
          ],
          "IP": "172.31.25.125",
          "MBits": 10,
          "Mode": "",
          "ReservedPorts": null
        }
      ]
    }
  },
  "TaskStates": {
    "redis": {
      "Events": [
        {
          "Details": {},
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Task received by client",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690702985412600,
          "Type": "Received",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "message": "Building Task Directory"
          },
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Building Task Directory",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "Building Task Directory",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690702991575300,
          "Type": "Task Setup",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "image": "redis:3.2"
          },
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Downloading image",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "Downloading image",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690703021879300,
          "Type": "Driver",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {},
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Task started by client",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690707311462700,
          "Type": "Started",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "kill_timeout": "5s"
          },
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Sent interrupt. Waiting 5s before force killing",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 5000000000,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690735483180800,
          "Type": "Killing",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "oom_killed": "false",
            "exit_code": "0",
            "signal": "0"
          },
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Exit Code: 0",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690735795506400,
          "Type": "Terminated",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {},
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Task successfully killed",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690735811976700,
          "Type": "Killed",
          "ValidationError": "",
          "VaultError": ""
        }
      ],
      "Failed": false,
      "FinishedAt": "2019-11-25T14:05:35.819294074Z",
      "LastRestart": null,
      "Restarts": 0,
      "StartedAt": "2019-11-25T14:05:07.311467296Z",
      "State": "dead"
    }
  }
}
alloc 686d1da1 JSON
{
  "AllocModifyIndex": 69,
  "AllocatedResources": {
    "Shared": {
      "DiskMB": 300,
      "Networks": null
    },
    "Tasks": {
      "redis": {
        "Cpu": {
          "CpuShares": 500
        },
        "Memory": {
          "MemoryMB": 256
        },
        "Networks": [
          {
            "CIDR": "",
            "Device": "eth0",
            "DynamicPorts": [
              {
                "Label": "db",
                "To": 0,
                "Value": 27806
              }
            ],
            "IP": "172.31.29.84",
            "MBits": 10,
            "Mode": "",
            "ReservedPorts": null
          }
        ]
      }
    }
  },
  "ClientDescription": "Tasks are running",
  "ClientStatus": "running",
  "CreateIndex": 69,
  "CreateTime": 1574690735466495500,
  "DeploymentID": "fc89ecc2-a05b-4fa2-26d6-1f350b0d6ce0",
  "DeploymentStatus": {
    "Canary": false,
    "Healthy": true,
    "ModifyIndex": 76,
    "Timestamp": "2019-11-25T14:05:49.356320114Z"
  },
  "DesiredDescription": "",
  "DesiredStatus": "run",
  "DesiredTransition": {
    "Migrate": null,
    "Reschedule": null
  },
  "EvalID": "89db0813-e218-4ca4-ec6f-69e55dd7ce86",
  "FollowupEvalID": "",
  "ID": "686d1da1-cbfb-251c-b9e2-ccb85b4c8bf3",
  "Job": {
    "Affinities": null,
    "AllAtOnce": false,
    "Constraints": null,
    "CreateIndex": 56,
    "Datacenters": [
      "dc1"
    ],
    "Dispatched": false,
    "ID": "example",
    "JobModifyIndex": 67,
    "Meta": null,
    "Migrate": null,
    "ModifyIndex": 67,
    "Name": "example",
    "Namespace": "default",
    "ParameterizedJob": null,
    "ParentID": "",
    "Payload": null,
    "Periodic": null,
    "Priority": 50,
    "Region": "global",
    "Reschedule": null,
    "Spreads": null,
    "Stable": false,
    "Status": "running",
    "StatusDescription": "",
    "Stop": false,
    "SubmitTime": 1574690735115202300,
    "TaskGroups": [
      {
        "Affinities": null,
        "Constraints": null,
        "Count": 1,
        "EphemeralDisk": {
          "Migrate": false,
          "SizeMB": 300,
          "Sticky": false
        },
        "Meta": null,
        "Migrate": {
          "HealthCheck": "checks",
          "HealthyDeadline": 300000000000,
          "MaxParallel": 1,
          "MinHealthyTime": 10000000000
        },
        "Name": "cache",
        "Networks": null,
        "ReschedulePolicy": {
          "Attempts": 0,
          "Delay": 30000000000,
          "DelayFunction": "exponential",
          "Interval": 0,
          "MaxDelay": 3600000000000,
          "Unlimited": true
        },
        "RestartPolicy": {
          "Attempts": 2,
          "Delay": 15000000000,
          "Interval": 1800000000000,
          "Mode": "fail"
        },
        "Services": null,
        "Spreads": null,
        "Tasks": [
          {
            "Affinities": null,
            "Artifacts": null,
            "Config": {
              "image": "redis:3.2",
              "port_map": [
                {
                  "db": 6379
                }
              ]
            },
            "Constraints": null,
            "DispatchPayload": null,
            "Driver": "docker",
            "Env": {
              "version": "1"
            },
            "KillSignal": "",
            "KillTimeout": 5000000000,
            "Kind": "",
            "Leader": false,
            "LogConfig": {
              "MaxFileSizeMB": 10,
              "MaxFiles": 10
            },
            "Meta": null,
            "Name": "redis",
            "Resources": {
              "CPU": 500,
              "Devices": null,
              "DiskMB": 0,
              "IOPS": 0,
              "MemoryMB": 256,
              "Networks": [
                {
                  "CIDR": "",
                  "Device": "",
                  "DynamicPorts": [
                    {
                      "Label": "db",
                      "To": 0,
                      "Value": 0
                    }
                  ],
                  "IP": "",
                  "MBits": 10,
                  "Mode": "",
                  "ReservedPorts": null
                }
              ]
            },
            "Services": null,
            "ShutdownDelay": 0,
            "Templates": null,
            "User": "",
            "Vault": null,
            "VolumeMounts": null
          }
        ],
        "Update": {
          "AutoPromote": false,
          "AutoRevert": false,
          "Canary": 0,
          "HealthCheck": "checks",
          "HealthyDeadline": 300000000000,
          "MaxParallel": 1,
          "MinHealthyTime": 10000000000,
          "ProgressDeadline": 600000000000,
          "Stagger": 30000000000
        },
        "Volumes": null
      }
    ],
    "Type": "service",
    "Update": {
      "AutoPromote": false,
      "AutoRevert": false,
      "Canary": 0,
      "HealthCheck": "",
      "HealthyDeadline": 0,
      "MaxParallel": 1,
      "MinHealthyTime": 0,
      "ProgressDeadline": 0,
      "Stagger": 30000000000
    },
    "VaultToken": "",
    "Version": 1
  },
  "JobID": "example",
  "Metrics": {
    "AllocationTime": 72165,
    "ClassExhausted": null,
    "ClassFiltered": null,
    "CoalescedFailures": 0,
    "ConstraintFiltered": null,
    "DimensionExhausted": null,
    "NodesAvailable": {
      "dc1": 2
    },
    "NodesEvaluated": 2,
    "NodesExhausted": 0,
    "NodesFiltered": 0,
    "QuotaExhausted": null,
    "ScoreMetaData": [
      {
        "NodeID": "1c61fbb7-df88-1737-c03f-7221fb8331eb",
        "NormScore": 0.20002653302990245,
        "Scores": {
          "binpack": 0.20002653302990245,
          "job-anti-affinity": 0,
          "node-reschedule-penalty": 0,
          "node-affinity": 0
        }
      },
      {
        "NodeID": "ff8ed4ac-6b01-f8e2-490c-8c3016a4745f",
        "NormScore": -0.39998673348504876,
        "Scores": {
          "node-affinity": 0,
          "binpack": 0.20002653302990245,
          "job-anti-affinity": 0,
          "node-reschedule-penalty": -1
        }
      }
    ],
    "Scores": null
  },
  "ModifyIndex": 76,
  "ModifyTime": 1574690749450031000,
  "Name": "example.cache[0]",
  "Namespace": "default",
  "NextAllocation": "",
  "NodeID": "1c61fbb7-df88-1737-c03f-7221fb8331eb",
  "NodeName": "ip-172-31-29-84",
  "PreemptedAllocations": null,
  "PreemptedByAllocation": "",
  "PreviousAllocation": "7b881420-729f-6fec-7d76-e43dd86d0352",
  "RescheduleTracker": null,
  "Resources": {
    "CPU": 500,
    "Devices": null,
    "DiskMB": 300,
    "IOPS": 0,
    "MemoryMB": 256,
    "Networks": [
      {
        "CIDR": "",
        "Device": "eth0",
        "DynamicPorts": [
          {
            "Label": "db",
            "To": 0,
            "Value": 27806
          }
        ],
        "IP": "172.31.29.84",
        "MBits": 10,
        "Mode": "",
        "ReservedPorts": null
      }
    ]
  },
  "Services": null,
  "TaskGroup": "cache",
  "TaskResources": {
    "redis": {
      "CPU": 500,
      "Devices": null,
      "DiskMB": 0,
      "IOPS": 0,
      "MemoryMB": 256,
      "Networks": [
        {
          "CIDR": "",
          "Device": "eth0",
          "DynamicPorts": [
            {
              "Label": "db",
              "To": 0,
              "Value": 27806
            }
          ],
          "IP": "172.31.29.84",
          "MBits": 10,
          "Mode": "",
          "ReservedPorts": null
        }
      ]
    }
  },
  "TaskStates": {
    "redis": {
      "Events": [
        {
          "Details": {},
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Task received by client",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690735475090000,
          "Type": "Received",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "message": "Building Task Directory"
          },
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Building Task Directory",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "Building Task Directory",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690736017497600,
          "Type": "Task Setup",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "image": "redis:3.2"
          },
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Downloading image",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "Downloading image",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690736039536600,
          "Type": "Driver",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {},
          "DiskLimit": 0,
          "DiskSize": 0,
          "DisplayMessage": "Task started by client",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1574690739354516700,
          "Type": "Started",
          "ValidationError": "",
          "VaultError": ""
        }
      ],
      "Failed": false,
      "FinishedAt": null,
      "LastRestart": null,
      "Restarts": 0,
      "StartedAt": "2019-11-25T14:05:39.354520859Z",
      "State": "running"
    }
  }
}

jobspec
job "example" {
  datacenters = ["dc1"]

  group "cache" {
    task "redis" {
      driver = "docker"

      env {
        version = "1"
      }

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      resources {
        cpu    = 500
        memory = 256

        network {
          mbits = 10
          port  "db"  {}
        }
      }
    }
  }
}

I'm digging into why this is happening but my initial look at scheduler/generic_sched.go#L573 suggests we're unconditionally adding prevAllocation.NodeID to the list of penalized nodes. But I may be misunderstanding what prevAllocation is in this context, so I'm going to work up a quick test to walk myself through it.

@tgross tgross added this to the unscheduled milestone Nov 25, 2019
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Nov 25, 2019
@tgross
Copy link
Member

tgross commented Nov 25, 2019

I just ran a test with that line commented out to at least verify that this is where the node penalty is coming from, and that results in the new allocation not being moved, just as we'd expect.

Looks like that path was originally introduced in 5ecb789 The unit tests for the scheduler still all pass with that change, so that suggests there's some unexercised code paths there.

▶ nomad job run example.nomad
==> Monitoring evaluation "8fb4bba0"
    Evaluation triggered by job "example"
    Evaluation within deployment: "1e731e13"
    Allocation "c3aa1e43" created: node "e30d500b", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "8fb4bba0" finished with status "complete"

▶ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
c3aa1e43  e30d500b  cache       0        run      running  5s ago   2s ago

▶ nomad job run example.nomad
==> Monitoring evaluation "0d5b56d5"
    Evaluation triggered by job "example"
    Evaluation within deployment: "7e744299"
    Allocation "4d37ba01" created: node "e30d500b", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "0d5b56d5" finished with status "complete"

▶ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
4d37ba01  e30d500b  cache       1        run      running   5s ago   4s ago
c3aa1e43  e30d500b  cache       0        stop     complete  24s ago  5s ago

▶ nomad alloc status -verbose 4d37ba01
ID                  = 4d37ba01-9cec-f93f-a681-0d22bfce895c
Eval ID             = 0d5b56d5-a1d1-21bd-9ed7-bf032608fdb7
Name                = example.cache[0]
Node ID             = e30d500b-0480-63e5-2077-0d07abd8226c
Node Name           = ip-172-31-25-80
Job ID              = example
Job Version         = 1
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 2019-11-25T10:09:16-05:00
Modified            = 2019-11-25T10:09:27-05:00
Deployment ID       = 7e744299-c05a-e959-3bc4-94bcaeac2165
Deployment Health   = healthy
Evaluated Nodes     = 2
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 93.481µs
Failures            = 0

Task "redis" is "running"
Task Resources
CPU        Memory           Disk     Addresses
2/500 MHz  6.3 MiB/256 MiB  300 MiB  db: 172.31.25.80:29354

Task Events:
Started At     = 2019-11-25T15:09:17Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-11-25T10:09:17-05:00  Started     Task started by client
2019-11-25T10:09:17-05:00  Task Setup  Building Task Directory
2019-11-25T10:09:16-05:00  Received    Task received by client

Placement Metrics
Node                                  binpack  job-anti-affinity  node-affinity  node-reschedule-penalty  final score
0ad76a7b-ab6e-569a-dfa9-83417379c8db  0.2      0                  0              0                        0.2
e30d500b-0480-63e5-2077-0d07abd8226c  0.2      0                  0              0                        0.2

@tgross
Copy link
Member

tgross commented Nov 26, 2019

I removed that line and did some tests on the behavior around failing allocations, and it looks like leaving it out removes the node-reschedule-penalty for failed allocations, which we don't want either. So that block might need to be reworked to pick up the penalty only on failure, not on updates.

tgross added a commit that referenced this issue Nov 26, 2019
Fixes #5856

When the scheduler looks for a placement for an allocation that's
replacing another allocation, it's supposed to penalize the previous
node if the allocation had been rescheduled or failed. But we're
currently always penalizing the node, which leads to unnecessary
migrations on job update.

This commit leaves in place the existing behavior where if the
previous alloc was itself rescheduled, its previous nodes are also
penalized. This is conservative but the right behavior especially on
larger clusters where a group of hosts might be having correlated
trouble (like an AZ failure).
tgross added a commit that referenced this issue Nov 26, 2019
Fixes #5856

When the scheduler looks for a placement for an allocation that's
replacing another allocation, it's supposed to penalize the previous
node if the allocation had been rescheduled or failed. But we're
currently always penalizing the node, which leads to unnecessary
migrations on job update.

This commit leaves in place the existing behavior where if the
previous alloc was itself rescheduled, its previous nodes are also
penalized. This is conservative but the right behavior especially on
larger clusters where a group of hosts might be having correlated
trouble (like an AZ failure).
@tgross
Copy link
Member

tgross commented Nov 26, 2019

I've opened #6781 for review which I think should fix this.

@tgross tgross modified the milestones: unscheduled , 0.10.3 Nov 26, 2019
@tgross tgross moved this from In Progress to In Review in Nomad - Community Issues Triage Nov 26, 2019
tgross added a commit that referenced this issue Dec 3, 2019
Fixes #5856

When the scheduler looks for a placement for an allocation that's
replacing another allocation, it's supposed to penalize the previous
node if the allocation had been rescheduled or failed. But we're
currently always penalizing the node, which leads to unnecessary
migrations on job update.

This commit leaves in place the existing behavior where if the
previous alloc was itself rescheduled, its previous nodes are also
penalized. This is conservative but the right behavior especially on
larger clusters where a group of hosts might be having correlated
trouble (like an AZ failure).

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
Nomad - Community Issues Triage automation moved this from In Review to Done Dec 3, 2019
tgross added a commit that referenced this issue Dec 3, 2019
Fixes #5856

When the scheduler looks for a placement for an allocation that's
replacing another allocation, it's supposed to penalize the previous
node if the allocation had been rescheduled or failed. But we're
currently always penalizing the node, which leads to unnecessary
migrations on job update.

This commit leaves in place the existing behavior where if the
previous alloc was itself rescheduled, its previous nodes are also
penalized. This is conservative but the right behavior especially on
larger clusters where a group of hosts might be having correlated
trouble (like an AZ failure).

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants