Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completed batch jobs with leader task and second task switch to pending state after node drain #3210

Closed
hamann opened this issue Sep 14, 2017 · 1 comment · Fixed by #3217

Comments

@hamann
Copy link

hamann commented Sep 14, 2017

Nomad version

Nomad v0.6.3
also reproducable with v0.6.0

Operating system and Environment details

coreos 1492.1.0 and/or alpine vm in docker-for-mac

Issue

If we drain the agent, where the batch job was allocated and in dead state, the job switches back to pending

Reproduction steps

This is reproducable in a multi-master/-agent cluster, but also with a single nomad agent running in dev mode

# nomad command
$ ps ax|grep nomad | head -1
    1 root       0:04 /usr/local/bin/nomad agent -dev -config /nomad.hcl

# nomad config
$ cat /nomad.hcl
data_dir = "/var/lib/docker/volumes/nextjournal_nomad/_data"
log_level = "INFO"
bind_addr = "172.17.0.1"

server {
  enabled = true
}

advertise {
  http = "172.17.0.1"
  rpc = "172.17.0.1"
  serf = "172.17.0.1"
}

client {
  enabled = true
  options {
    "docker.cleanup.image" = false
    "docker.privileged.enabled" = true
    "docker.volumes.enabled" = true
  }
  meta {
    "nextjournal_dir" = "/home/hamann/workspace/nextjournal/nextjournal.com"
    "environment" = "development"
  }
}

consul {
  address = "172.17.0.1:8500"
}

$ nomad version
Nomad v0.6.3

# jobfile
$ cat leadertest.json
{
    "Job": {
        "Region": "global",
        "ID": "leadertest",
        "Name": "leadertest",
        "Type": "batch",
        "Priority": 50,
        "Datacenters": [
            "dc1"
        ],
        "TaskGroups": [
            {
                "Name": "leader-group",
                "RestartPolicy": {
                  "Attempts": 0,
                  "Mode": "fail"
                },
                "Tasks": [
                      {
                        "Name": "shutdown",
                        "Driver": "docker",
                        "Config": {
                            "image": "alpine:latest",
                            "command": "/bin/sleep",
                            "args": ["60"]
                        },
                        "Resources": {
                            "CPU": 100,
                            "MemoryMB": 10,
                            "DiskMB": 10,
                            "IOPS": 0
                        },
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        }
                    },
                    {
                        "Name": "leader",
                        "Driver": "docker",
                        "Leader": true,
                        "Config": {
                            "image": "alpine:latest",
                            "command": "/bin/sleep",
                            "args": ["20"]
                        },
                        "Resources": {
                            "CPU": 20,
                            "MemoryMB": 256,
                            "DiskMB": 10,
                            "IOPS": 0
                        },
                        "LogConfig": {
                           "MaxFiles": 10,
                           "MaxFileSizeMB": 10
                        }
                    }
                ]
            }
        ]
    }
}

$ nomad status
No running jobs

# post json job file
$ nomad node-drain -disable -self -yes && curl -XPUT $NOMAD_ADDR/v1/system/gc && curl -XPOST -d@leadertest.json $NOMAD_ADDR/v1/jobs
{"EvalID":"d8591d98-85e7-ef5a-805e-444ddc9911d5","EvalCreateIndex":48,"JobModifyIndex":47,"Warnings":"","Index":48,"LastContact":0,"KnownLeader":false}

# wait ~25 seconds

$ nomad status
ID          Type   Priority  Status  Submit Date
leadertest  batch  50        dead    09/14/17 10:17:01 UTC

$ nomad status leadertest
ID            = leadertest
Name          = leadertest
Submit Date   = 09/14/17 10:17:01 UTC
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
leader-group  0       0         0        0       1         0

Allocations
ID        Node ID   Task Group    Version  Desired  Status    Created At
d4a85539  8b1bca7f  leader-group  0        run      complete  09/14/17 10:17:01 UTC

$ nomad node-drain -enable -self -yes

$ nomad status
ID          Type   Priority  Status   Submit Date
leadertest  batch  50        pending  09/14/17 10:17:01 UTC

$ nomad status leadertest
ID            = leadertest
Name          = leadertest
Submit Date   = 09/14/17 10:17:01 UTC
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
leader-group  1       0         0        0       1         0

Placement Failure
Task Group "leader-group":
  * No nodes were eligible for evaluation
  * No nodes are available in datacenter "dc1"

Allocations
ID        Node ID   Task Group    Version  Desired  Status    Created At
d4a85539  8b1bca7f  leader-group  0        stop     complete  09/14/17 10:17:01 UTC
dadgar added a commit that referenced this issue Sep 14, 2017
This PR fixes:
* An issue in which a node-drain that contains a complete batch alloc
would cause a replacement
* An issue in which allocations with the same name during a scale
down/stop event wouldn't be properly stopped.
* An issue in which batch allocations from previous job versions may not
have been stopped properly.

Fixes #3210
dadgar added a commit that referenced this issue Sep 14, 2017
This PR fixes:
* An issue in which a node-drain that contains a complete batch alloc
would cause a replacement
* An issue in which allocations with the same name during a scale
down/stop event wouldn't be properly stopped.
* An issue in which batch allocations from previous job versions may not
have been stopped properly.

Fixes #3210
dadgar added a commit that referenced this issue Sep 14, 2017
This PR fixes:
* An issue in which a node-drain that contains a complete batch alloc
would cause a replacement
* An issue in which allocations with the same name during a scale
down/stop event wouldn't be properly stopped.
* An issue in which batch allocations from previous job versions may not
have been stopped properly.

Fixes #3210
@github-actions
Copy link

github-actions bot commented Dec 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants