Completed batch jobs with leader task and second task switch to pending state after node drain #3210

hamann · 2017-09-14T10:20:43Z

Nomad version

Nomad v0.6.3
also reproducable with v0.6.0

Operating system and Environment details

coreos 1492.1.0 and/or alpine vm in docker-for-mac

Issue

If we drain the agent, where the batch job was allocated and in dead state, the job switches back to pending

Reproduction steps

This is reproducable in a multi-master/-agent cluster, but also with a single nomad agent running in dev mode

# nomad command
$ ps ax|grep nomad | head -1
    1 root       0:04 /usr/local/bin/nomad agent -dev -config /nomad.hcl

# nomad config
$ cat /nomad.hcl
data_dir = "/var/lib/docker/volumes/nextjournal_nomad/_data"
log_level = "INFO"
bind_addr = "172.17.0.1"

server {
  enabled = true
}

advertise {
  http = "172.17.0.1"
  rpc = "172.17.0.1"
  serf = "172.17.0.1"
}

client {
  enabled = true
  options {
    "docker.cleanup.image" = false
    "docker.privileged.enabled" = true
    "docker.volumes.enabled" = true
  }
  meta {
    "nextjournal_dir" = "/home/hamann/workspace/nextjournal/nextjournal.com"
    "environment" = "development"
  }
}

consul {
  address = "172.17.0.1:8500"
}

$ nomad version
Nomad v0.6.3

# jobfile
$ cat leadertest.json
{
    "Job": {
        "Region": "global",
        "ID": "leadertest",
        "Name": "leadertest",
        "Type": "batch",
        "Priority": 50,
        "Datacenters": [
            "dc1"
        ],
        "TaskGroups": [
            {
                "Name": "leader-group",
                "RestartPolicy": {
                  "Attempts": 0,
                  "Mode": "fail"
                },
                "Tasks": [
                      {
                        "Name": "shutdown",
                        "Driver": "docker",
                        "Config": {
                            "image": "alpine:latest",
                            "command": "/bin/sleep",
                            "args": ["60"]
                        },
                        "Resources": {
                            "CPU": 100,
                            "MemoryMB": 10,
                            "DiskMB": 10,
                            "IOPS": 0
                        },
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        }
                    },
                    {
                        "Name": "leader",
                        "Driver": "docker",
                        "Leader": true,
                        "Config": {
                            "image": "alpine:latest",
                            "command": "/bin/sleep",
                            "args": ["20"]
                        },
                        "Resources": {
                            "CPU": 20,
                            "MemoryMB": 256,
                            "DiskMB": 10,
                            "IOPS": 0
                        },
                        "LogConfig": {
                           "MaxFiles": 10,
                           "MaxFileSizeMB": 10
                        }
                    }
                ]
            }
        ]
    }
}

$ nomad status
No running jobs

# post json job file
$ nomad node-drain -disable -self -yes && curl -XPUT $NOMAD_ADDR/v1/system/gc && curl -XPOST -d@leadertest.json $NOMAD_ADDR/v1/jobs
{"EvalID":"d8591d98-85e7-ef5a-805e-444ddc9911d5","EvalCreateIndex":48,"JobModifyIndex":47,"Warnings":"","Index":48,"LastContact":0,"KnownLeader":false}

# wait ~25 seconds

$ nomad status
ID          Type   Priority  Status  Submit Date
leadertest  batch  50        dead    09/14/17 10:17:01 UTC

$ nomad status leadertest
ID            = leadertest
Name          = leadertest
Submit Date   = 09/14/17 10:17:01 UTC
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
leader-group  0       0         0        0       1         0

Allocations
ID        Node ID   Task Group    Version  Desired  Status    Created At
d4a85539  8b1bca7f  leader-group  0        run      complete  09/14/17 10:17:01 UTC

$ nomad node-drain -enable -self -yes

$ nomad status
ID          Type   Priority  Status   Submit Date
leadertest  batch  50        pending  09/14/17 10:17:01 UTC

$ nomad status leadertest
ID            = leadertest
Name          = leadertest
Submit Date   = 09/14/17 10:17:01 UTC
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
leader-group  1       0         0        0       1         0

Placement Failure
Task Group "leader-group":
  * No nodes were eligible for evaluation
  * No nodes are available in datacenter "dc1"

Allocations
ID        Node ID   Task Group    Version  Desired  Status    Created At
d4a85539  8b1bca7f  leader-group  0        stop     complete  09/14/17 10:17:01 UTC

This PR fixes: * An issue in which a node-drain that contains a complete batch alloc would cause a replacement * An issue in which allocations with the same name during a scale down/stop event wouldn't be properly stopped. * An issue in which batch allocations from previous job versions may not have been stopped properly. Fixes #3210

github-actions · 2022-12-08T02:16:39Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug theme/scheduling labels Sep 14, 2017

dadgar mentioned this issue Sep 14, 2017

Fix batch handling of complete allocs/node drains #3217

Merged

dadgar closed this as completed in #3217 Sep 14, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completed batch jobs with leader task and second task switch to pending state after node drain #3210

Completed batch jobs with leader task and second task switch to pending state after node drain #3210

hamann commented Sep 14, 2017

github-actions bot commented Dec 8, 2022

Completed batch jobs with leader task and second task switch to pending state after node drain #3210

Completed batch jobs with leader task and second task switch to pending state after node drain #3210

Comments

hamann commented Sep 14, 2017

Nomad version

Operating system and Environment details

Issue

Reproduction steps

github-actions bot commented Dec 8, 2022