Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 0.5.1 rc2 panic - alloc runner #2089

Closed
justinwalz opened this issue Dec 13, 2016 · 9 comments
Closed

Nomad 0.5.1 rc2 panic - alloc runner #2089

justinwalz opened this issue Dec 13, 2016 · 9 comments

Comments

@justinwalz
Copy link
Contributor

Hi,

There’s a problem with nomad-0.5.1-rc2. I was testing in the bug fix (ref #2024) (I am a colleague of @sheldonkwok)

All but one of our nomad clients have exhibited this behavior - I’ve pasted the relevant logs for three unique cases. Happy to provide more info, just let me know what you need. This also happened for a wide variety of services (not specific to one job), I just replaced the names with .

-Justin

/cc @dadgar @diptanu

Nomad version

$ nomad -v
Nomad v0.5.1-rc2 ('6f2ccf22be738a31cb2153c7e43422c4ba9a0e3f+CHANGES')

Operating system and Environment details

$ uname -a
Linux ip-10-XXX-X-XX 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Reproduction steps

After around 6 hours successfully running the release candidate, this occurred simultaneously on most machines.

Nomad Server logs (if appropriate)

At crash time, nothing unexpected (mostly this: nomad.heartbeat: node 'b1c470b9-3129-537e-ffba-10f5ebd0c1d2' TTL expired)

Nomad Client logs (if appropriate)

See below

Machine A

    2016/12/13 02:22:42.442520 [DEBUG] client: starting task runners for alloc '67911360-2fed-ddef-a738-a5cac87718b9'
    2016/12/13 02:22:42.442765 [DEBUG] client: starting task context for '<app-name>' (alloc '67911360-2fed-ddef-a738-a5cac87718b9')
panic: Alloc Runner already exists for alloc "67911360-2fed-ddef-a738-a5cac87718b9"

goroutine 141607 [running]:
panic(0xf50520, 0xc42044b610)
        /opt/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/nomad/client.(*Client).addAlloc(0xc420224500, 0xc425288e00, 0x0, 0xc45262c9c0, 0x20002)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1757 +0x2e9
github.com/hashicorp/nomad/client.(*Client).blockForRemoteAlloc(0xc420224500, 0xc425288e00)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1469 +0xf6
created by github.com/hashicorp/nomad/client.(*Client).runAllocs
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1401 +0xa78
    Loaded configuration from /etc/nomad/config.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 4 error(s) occurred:

* 1 error(s) occurred:

* task runner snapshot include nil Task
* 1 error(s) occurred:

* task runner snapshot include nil Task
* 1 error(s) occurred:

* task runner snapshot include nil Task
* 1 error(s) occurred:

* task runner snapshot include nil Task
    Loaded configuration from /etc/nomad/config.hcl
==> Starting Nomad agent...
....
    2016/12/13 02:23:40 [DEBUG] plugin: starting plugin: /usr/local/bin/nomad []string{"/usr/local/bin/nomad", "executor", "/var/lib/nomad/alloc/ba278674-a1b8-7e5c-8445-93e307c2eba9/<app-name>/<app-name>-executor.out"}
    2016/12/13 02:23:40 [DEBUG] plugin: waiting for RPC address for: /usr/local/bin/nomad
panic: Alloc Runner already exists for alloc "4c15ea9f-c405-bd16-ccfd-c51acb93a675"

goroutine 255055 [running]:
panic(0xf50520, 0xc45e65da20)
        /opt/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/nomad/client.(*Client).addAlloc(0xc420224500, 0xc425bcfe00, 0x0, 0xc46f22c960, 0xf46560)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1757 +0x2e9
github.com/hashicorp/nomad/client.(*Client).blockForRemoteAlloc(0xc420224500, 0xc425bcfe00)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1469 +0xf6
created by github.com/hashicorp/nomad/client.(*Client).runAllocs
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1401 +0xa78
    Loaded configuration from /etc/nomad/config.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 5 error(s) occurred:

* 1 error(s) occurred:

* task runner snapshot include nil Task
* 1 error(s) occurred:
....
    2016/12/13 02:26:46.497458 [DEBUG] client: starting task context for '<app-name>' (alloc '5b63e39d-606b-35c3-f5ca-1cd343409941')
    2016/12/13 02:26:47 [DEBUG] plugin: waiting for RPC address for: /usr/local/bin/nomad
    2016/12/13 02:26:47 [DEBUG] plugin: waiting for RPC address for: /usr/local/bin/nomad
    2016/12/13 02:26:50 [DEBUG] plugin: nomad: 2016/12/13 02:26:46 [DEBUG] plugin: plugin address: unix /tmp/plugin561372059
    2016/12/13 02:26:38.951576 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin993836761
panic: Alloc Runner already exists for alloc "745ccdcb-a58e-4ec4-2ec3-b2a2947c4995"

goroutine 40 [running]:
panic(0xf50520, 0xc43d6d6050)
        /opt/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/nomad/client.(*Client).addAlloc(0xc4202763c0, 0xc44eaecf00, 0x0, 0xc45b599320, 0xc427edbe01)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1757 +0x2e9
github.com/hashicorp/nomad/client.(*Client).runAllocs(0xc4202763c0, 0xc447a1d8c0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1415 +0x73c
github.com/hashicorp/nomad/client.(*Client).run(0xc4202763c0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:927 +0x119
created by github.com/hashicorp/nomad/client.NewClient
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:266 +0xc4a
    Loaded configuration from /etc/nomad/config.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 1 error(s) occurred:

* task runner snapshot include nil Task
    Loaded configuration from /etc/nomad/config.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 1 error(s) occurred:

* task runner snapshot include nil Task
    Loaded configuration from /etc/nomad/config.hcl
==> Starting Nomad agent...

When trying to remove the alloc_dir of one with issues…

$ rm -rf 67911360-2fed-ddef-a738-a5cac87718b9/*
rm: cannot remove '67911360-2fed-ddef-a738-a5cac87718b9/<app-name>/secrets': Device or resource busy

When trying to remove the entire /var/lib/nomad directory (all four were different apps).

rm: cannot remove '/var/lib/nomad/alloc/a0308523-7aca-14a9-c8da-371e9476ded7/<app-name>/secrets': Device or resource busy
rm: cannot remove '/var/lib/nomad/alloc/b7e86d10-cd14-fea2-2938-d9ead3bfaf55/<app-name>/secrets': Device or resource busy
rm: cannot remove '/var/lib/nomad/alloc/fed5826c-e158-5c4e-011b-6c0b5f899e15/<app-name>/secrets': Device or resource busy
rm: cannot remove '/var/lib/nomad/alloc/9ae836e5-3a99-de8e-d0d4-7ff9d0ec7129/<app-name>/secrets': Device or resource busy
@dadgar
Copy link
Contributor

dadgar commented Dec 13, 2016

Hey @justinwalz

Thanks for filing this! Would you mind getting the logs of the clients/servers from when they were upgraded to when this behavior was exhibited!

Thanks,
Alex

@justinwalz
Copy link
Contributor Author

I just rotated the machines out to get our cluster back online. For the most part, the logs looked fine, see the original post for the client machine errors.

Server logs are really long - grep for ERR and WARN you see:

    2016/12/13 02:22:43 [ERR] yamux: keepalive failed: session shutdown
    2016/12/13 02:23:44 [ERR] yamux: keepalive failed: session shutdown
    2016/12/13 02:23:44.072512 [ERR] nomad.rpc: multiplex conn accept failed: read tcp 10.XXX.4.6:4647->10.XXX.4.74:45566: read: connection reset by peer
    2016/12/13 02:26:51.906399 [ERR] nomad.rpc: multiplex conn accept failed: read tcp 10.XXX.4.6:4647->10.XXX.4.36:34800: read: connection reset by peer
    2016/12/12 17:33:28 [WARN] serf: Shutdown without a Leave
    2016/12/12 17:33:29 [WARN] memberlist: Refuting an alive message
    2016/12/12 17:38:12 [WARN] raft: Rejecting vote request from 10.XXX.0.7:4647 since we have a leader: 10.XXX.0.6:4647
    2016/12/12 17:38:12 [WARN] raft: Heartbeat timeout from "10.XXX.0.6:4647" reached, starting election
    2016/12/12 18:59:43.563509 [WARN] nomad.heartbeat: node 'd88caec6-d937-4293-259e-abdaadc1b9b8' TTL expired
    2016/12/12 18:59:45.202016 [WARN] nomad.heartbeat: node '914801c9-19f3-def7-a1e2-8c36a58b0837' TTL expired
    2016/12/12 18:59:48.835383 [WARN] nomad.heartbeat: node '020ba9cb-aad9-740a-cab1-3f83b4908762' TTL expired
    2016/12/12 18:59:53.299938 [WARN] nomad.heartbeat: node '0769cb75-1c13-6262-b1f2-f57fe9fbf270' TTL expired
    2016/12/13 02:22:14.561692 [WARN] nomad.heartbeat: node 'b1c470b9-3129-537e-ffba-10f5ebd0c1d2' TTL expired
    2016/12/13 02:23:01.215183 [WARN] nomad.heartbeat: node 'ef779ca4-8f67-73e7-81b4-40444938b8f3' TTL expired
    2016/12/13 02:23:21.977661 [WARN] nomad.heartbeat: node 'b1c470b9-3129-537e-ffba-10f5ebd0c1d2' TTL expired
    2016/12/13 02:24:05.955934 [WARN] nomad.heartbeat: node 'b1c470b9-3129-537e-ffba-10f5ebd0c1d2' TTL expired
    2016/12/13 02:24:28.000693 [WARN] nomad.heartbeat: node 'b8d9a37d-9300-e26f-3374-764f51d8d094' TTL expired
    2016/12/13 02:24:50.868729 [WARN] nomad.heartbeat: node 'b8d9a37d-9300-e26f-3374-764f51d8d094' TTL expired
    2016/12/13 02:25:49.823150 [WARN] nomad.heartbeat: node 'b8d9a37d-9300-e26f-3374-764f51d8d094' TTL expired
    2016/12/13 02:26:51.565979 [WARN] nomad.heartbeat: node 'b8d9a37d-9300-e26f-3374-764f51d8d094' TTL expired
    2016/12/13 02:27:14.042747 [WARN] nomad.heartbeat: node 'b8d9a37d-9300-e26f-3374-764f51d8d094' TTL expired

@dadgar
Copy link
Contributor

dadgar commented Dec 13, 2016

What I wanted to get a glimpse at is what the client looked like before? Did they just come up, where they just restarted, or just sitting there when the panics happened.

@justinwalz
Copy link
Contributor Author

Pretty much just sitting there. They had been up for about 6 hours, and everything was running smoothly (host metrics and file usage) and not much in the logs up until that point.

If/when it happens again I'll zip it up and attach.

@sheldonkwok
Copy link
Contributor

I'm also getting this error on Nomad version v0.5.0

* failed to remove the secret dir "/var/lib/ssi/nomad/alloc/02797923-bac1-25ba-15db-e0aed3c50964/service/secrets": remove /var/lib/ssi/nomad/alloc/02797923-bac1-25ba-15db-e0aed3c50964/service/secrets: device or resource busy
* remove /var/lib/ssi/nomad/alloc/02797923-bac1-25ba-15db-e0aed3c50964/service/secrets: device or resource busy
    2016/12/13 19:58:53.489177 [DEBUG] client: terminating runner for alloc '02797923-bac1-25ba-15db-e0aed3c50964'
    2016/12/13 19:58:53.548062 [INFO] driver.docker: stopped container ea1e81a77b2acf0e937379c2d80e1d680081988c4513b71d7d711ef1e482df4d
    2016/12/13 19:58:53.548097 [DEBUG] driver.docker: error collecting stats from container ea1e81a77b2acf0e937379c2d80e1d680081988c4513b71d7d711ef1e482df4d: io: read/write on closed pipe
    2016/12/13 19:58:53 [DEBUG] plugin: /usr/local/bin/nomad: plugin process exited
    2016/12/13 19:58:53.550752 [INFO] driver.docker: started container c1244d070adc9c6b96d9bf41afbd7022ffd30a581e6e055bf6c4dc88209b281e
    2016/12/13 19:58:53.554661 [WARN] client: error fetching stats of task epl-eagle-server: stats collection hasn't started yet
    2016/12/13 19:58:53.575353 [ERR] client: failed to destroy context for alloc '331fde94-8c85-11e2-cb82-b2b4b4807fe4': 2 error(s) occurred:

@dadgar
Copy link
Contributor

dadgar commented Dec 13, 2016

@justinwalz #2096

This PR should fix it. We cut a release with this fix, would you mind testing with it?

https://releases.hashicorp.com/nomad/0.5.2-rc1/

@justinwalz
Copy link
Contributor Author

@dadgar Sure I'll roll it out to our test env right now.

@justinwalz
Copy link
Contributor Author

@dadgar To follow up, things look alright with dev running 0.5.2-rc1. We'll continue to monitor, but I will close for the time being. Thanks.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants