Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dead task in job after node fail #1558

Closed
tantra35 opened this issue Aug 10, 2016 · 6 comments
Closed

dead task in job after node fail #1558

tantra35 opened this issue Aug 10, 2016 · 6 comments
Assignees
Milestone

Comments

@tantra35
Copy link
Contributor

tantra35 commented Aug 10, 2016

Nomad version

Nomad v0.4.1-dev
commit: 044e067

Issue

After isssue in our test infrastructure(failt of one server) some jobs have dead task, which doens't restarted(nomad run doesn't helps to kick nomad update state of dead task):

# nomad status townshipDynamoNode
ID          = townshipDynamoNode
Name        = townshipDynamoNode
Type        = service
Priority    = 50
Datacenters = test
Status      = running
Periodic    = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
townshipDynamoNode  0       0         1        0       1         0

Allocations
ID        Eval ID   Node ID   Task Group          Desired  Status
dc917914  23ef25df  6660088f  townshipDynamoNode  run      running

then we see alloc-status:

nomad alloc-status dc917914
ID            = dc917914
Eval ID       = 23ef25df
Name          = townshipDynamoNode.townshipDynamoNode[0]
Node ID       = 6660088f
Job ID        = townshipDynamoNode
Client Status = running

Task "fluend" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
100 MHz  300 MiB  300 MiB  0

Recent Events:
Time                   Type        Description
08/09/16 16:18:29 MSK  Killed      Task successfully killed
08/09/16 16:18:28 MSK  Restarting  Task restarting in 15.987723567s
08/09/16 16:18:28 MSK  Terminated  Exit Code: 0
08/09/16 16:18:27 MSK  Started     Task started by client
08/09/16 16:18:27 MSK  Received    Task received by client

Task "townshipDynamoNode" is "running"
Task Resources
CPU         Memory           Disk     IOPS  Addresses
18/100 MHz  130 MiB/800 MiB  300 MiB  0     appport: 172.16.7.4:24295

Recent Events:
Time                   Type                      Description
08/09/16 16:28:51 MSK  Started                   Task started by client
08/09/16 16:26:32 MSK  Downloading Artifacts     Client is downloading artifacts
08/09/16 16:26:15 MSK  Restarting                Task restarting in 17.452112152s
08/09/16 16:26:15 MSK  Failed Artifact Download  GET error: Get http://social.playrix.local/playrix-webworker-t19.tar.gz: dial tcp 172.16.4.220:80: getsockopt: connection refused
08/09/16 16:26:15 MSK  Downloading Artifacts     Client is downloading artifacts
08/09/16 16:25:58 MSK  Restarting                Task restarting in 16.941858047s
08/09/16 16:25:58 MSK  Failed Artifact Download  GET error: Get http://social.playrix.local/playrix-webworker-t19.tar.gz: dial tcp 172.16.4.220:80: getsockopt: connection refused
08/09/16 16:25:58 MSK  Downloading Artifacts     Client is downloading artifacts
08/09/16 16:25:41 MSK  Restarting                Task restarting in 16.491063421s
08/09/16 16:25:41 MSK  Failed Artifact Download  GET error: Get http://social.playrix.local/playrix-webworker-t19.tar.gz: dial tcp 172.16.4.220:80: getsockopt: no route to host

It seems that nomad, moved job from dead node(we shutdown that node, from linux console by the shutdown -P now), and make wrong decisions, as a result it get fluend task state from dead node, and doesn't update it on live node
our job file looks like this:

job "townshipDynamoNode" {
    region = "global"
    datacenters = ["test"]
    type = "service"

    priority = 50

    constraint {
        attribute = "${attr.kernel.name}"
        value = "linux"
        distinct_hosts = true
    }

    update {
        stagger = "10s"
        max_parallel = 1
    }

    group "townshipDynamoNode" {
        count = 1

        task "townshipDynamoNode" {
            driver = "docker"

            artifact {
                source = "http://social.playrix.local/playrix-webworker-t19.tar.gz"

                options {
                        archive=false
                }
            }

            config {
                image = "playrix/webworker:t19"
                load = ["playrix-webworker-t19.tar.gz"]
                network_mode = "plrx-aws"
                hostname = "townshipDynamoNode"
                command = "/sbin/init_plrx"
                args = ["-c", "/opt/startup.sh"]

                port_map {
                    appport = 8124
                }
            }

            env {
                APPNAME = "townshipDynamoNode"
            }

            service {
                name = "townshipDynamoNode"
                port = "appport"
                check {
                    name = "alive"
                    type = "tcp"
                    interval = "10s"
                    timeout = "2s"
                }
            }

            service {
                name = "townshipDynamoNode-codedeploy"
            }

            logs {
                max_files = 3
                max_file_size = 10
            }

            resources {
                memory = 800

                network {
                    mbits = 10
                    port "appport" {}
                }
            }
        }

        task "fluend"
        {
            driver = "raw_exec"

            config {
                command = "/usr/sbin/td-agent"
                args = ["--no-supervisor", "-c", "/etc/td-agent/td-agent.conf"]
            }

            env {
                APPNAME = "townshipDynamoNode"
            }

            logs {
                max_files = 3
                max_file_size = 10
            }

            resources {
                memory = 300
                cpu = 100
            }
        }
    }
}

@diptanu
Copy link
Contributor

diptanu commented Aug 11, 2016

@tantra35 Can you please paste the nomad server and client logs please? Also, can you please share the steps to reproduce this?

@tantra35
Copy link
Contributor Author

Here logs from client that accept job from failed node:

Aug  9 16:21:22 server3 nomad[30985]: raft: Rejecting vote request from 192.168.30.6:4647 since we have a leader: 192.168.30.4:4647
Aug  9 16:21:24 server3 nomad[30985]: raft: Rejecting vote request from 192.168.30.6:4647 since we have a leader: 192.168.30.4:4647
Aug  9 16:21:25 server3 nomad[30985]: worker: failed to dequeue evaluation: rpc error: eval broker disabled
Aug  9 16:21:25 server3 nomad[30985]: worker: failed to dequeue evaluation: rpc error: eval broker disabled
Aug  9 16:21:25 server3 nomad[30985]: raft: Rejecting vote request from 192.168.30.6:4647 since we have a leader: 192.168.30.4:4647
Aug  9 16:21:26 server3 nomad[30985]: raft: Rejecting vote request from 192.168.30.1:4647 since we have a leader: 192.168.30.4:4647
Aug  9 16:21:26 server3 nomad[30985]: raft: Heartbeat timeout from "192.168.30.4:4647" reached, starting election
Aug  9 16:21:26 server3 nomad[30985]: raft: Failed to make RequestVote RPC to 192.168.30.6:4647: read tcp 192.168.30.3:56588->192.168.30.6:4647: read: connect
ion reset by peer
Aug  9 16:21:26 server3 nomad[30985]: raft: Failed to make RequestVote RPC to 192.168.31.220:4647: EOF
Aug  9 16:21:27 server3 nomad[30985]: worker: failed to dequeue evaluation: rpc error: eval broker disabled
Aug  9 16:21:34 server3 consul[2459]: memberlist: Push/Pull with gardenscapesDynamo failed: dial tcp 172.16.32.28:8301: getsockopt: no route to host
Aug  9 16:23:14 server3 consul[2459]: memberlist: Push/Pull with zooMMMDynamo failed: dial tcp 172.16.32.31:8301: i/o timeout
Aug  9 16:24:14 server3 consul[2459]: memberlist: Push/Pull with townshipMacDynamoNode failed: dial tcp 172.16.32.24:8301: getsockopt: no route to host
Aug  9 16:21:27 server3 nomad[30985]: worker: failed to dequeue evaluation: rpc error: eval broker disabled
Aug  9 16:27:17 server3 nomad[30985]: raft: Heartbeat timeout from "192.168.30.4:4647" reached, starting election
Aug  9 16:27:19 server3 nomad[30985]: raft: Election timeout reached, restarting election
Aug  9 16:27:19 server3 nomad[30985]: worker: failed to dequeue evaluation: eval broker disabled
Aug  9 16:27:19 server3 nomad[30985]: raft: Failed to contact 192.168.30.4:4647 in 547.141822ms
Aug  9 16:27:20 server3 nomad[30985]: raft: Failed to contact 192.168.30.4:4647 in 1.007151026s
Aug  9 16:27:20 server3 nomad[30985]: raft: Failed to contact 192.168.30.4:4647 in 1.417681298s
Aug  9 16:27:29 server3 nomad[30985]: raft: Failed to make RequestVote RPC to 192.168.30.4:4647: read tcp 192.168.30.3:60590->192.168.30.4:4647: i/o timeout
Aug  9 16:27:29 server3 nomad[30985]: raft: Failed to AppendEntries to 192.168.30.4:4647: read tcp 192.168.30.3:56150->192.168.30.4:4647: i/o timeout
Aug  9 16:27:36 server3 nomad[30985]: raft: Failed to contact 192.168.30.4:4647 in 500.130773ms
Aug  9 16:27:38 server3 nomad[30985]: raft: Failed to contact 192.168.30.4:4647 in 500.168371ms

I don't know how to reproduce this. This happens one time, but agree this is not normat that some task placed in dead state, without restore, and only stop/run return all in normal operations

@diptanu
Copy link
Contributor

diptanu commented Aug 11, 2016

@tantra35 We will look into this. The logs you have shared are from a Nomad server. Please paste the logs of the client where the task was restarted so that we can follow the chain of events.

@tantra35
Copy link
Contributor Author

In our test environment we mix server + client on same node. So the logs that I brought, is all that exist on node where job was placed

@dadgar
Copy link
Contributor

dadgar commented Aug 12, 2016

This is reproducible by having two tasks, one that will fail its artifact download and one that starts successfully

@dadgar dadgar added this to the v0.5.0 milestone Aug 24, 2016
schmichael added a commit that referenced this issue Aug 25, 2016
The artifact fetching may be retried and succeed, so don't set the task
as dead.

Fixes #1558
schmichael added a commit that referenced this issue Aug 26, 2016
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants