Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v0.9.6] Nomad client panics after alloc exec #6728

Closed
daktari opened this issue Nov 19, 2019 · 6 comments
Closed

[v0.9.6] Nomad client panics after alloc exec #6728

daktari opened this issue Nov 19, 2019 · 6 comments

Comments

@daktari
Copy link
Contributor

daktari commented Nov 19, 2019

Nomad version

v0.9.6

Operating system and Environment details

Amazon Linux 2 - 4.14.146-119.123.amzn2.x86_64

Issue

Panic after alloc exec. This occurs in several nodes and this Issue #6332 kicks in making our services to go to critical state.

Nov 18 16:10:14 w-a6e382ff-0013 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-18T16:10:14.806617Z","access_token_id":"","access_token_name":"","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc","command":["bash"],"exec_id":"3fb00116-5c6b-1644-467e-9f20c1bac283","task":"my-awesome-task","tty":true}
Nov 18 16:10:14 w-a6e382ff-0013 nomad: panic: runtime error: invalid memory address or nil pointer dereference
Nov 18 16:10:14 w-a6e382ff-0013 nomad: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x125b24a]
Nov 18 16:10:14 w-a6e382ff-0013 nomad: goroutine 21159946 [running]:
Nov 18 16:10:14 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).ExecStreaming(0x0, 0x2498c80, 0xc0060f4c40, 0xc00612b770, 0x1, 0x1, 0x249b301, 0x2482940, 0xc002e5fd40, 0xc0003792d0, ...)
Nov 18 16:10:14 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:73 +0x3a
Nov 18 16:10:14 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).ExecStreaming-fm(0x2498c80, 0xc0060f4c40, 0xc00612b770, 0x1, 0x1, 0x1, 0x2482940, 0xc002e5fd40, 0x128, 0x7f44e4c5e6c0)
Nov 18 16:10:14 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:1350 +0x8c
Nov 18 16:10:14 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Allocations).execImpl(0xc0003002f8, 0xc00072d180, 0xc00058cc00, 0xc0002f5c20, 0x24, 0x0, 0x0, 0x0)
Nov 18 16:10:14 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_endpoint.go:266 +0x9a0
Nov 18 16:10:14 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Allocations).exec(0xc0003002f8, 0x7f44e4bd02c0, 0xc0017b40c0)
Nov 18 16:10:14 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_endpoint.go:150 +0x236
Nov 18 16:10:14 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Allocations).exec-fm(0x7f44e4bd02c0, 0xc0017b40c0)
Nov 18 16:10:14 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_endpoint.go:29 +0x3e
Nov 18 16:10:14 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Client).handleStreamingConn(0xc0003aa240, 0x24a6e60, 0xc0017b40c0)
Nov 18 16:10:14 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/rpc.go:367 +0x5de
Nov 18 16:10:14 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Client).handleConn(0xc0003aa240, 0x24a6e60, 0xc0017b40c0)
Nov 18 16:10:14 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/rpc.go:299 +0x22a
Nov 18 16:10:14 w-a6e382ff-0013 nomad: created by github.com/hashicorp/nomad/client.(*Client).listenConn
Nov 18 16:10:14 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/rpc.go:275 +0x78
Nov 18 16:10:14 w-a6e382ff-0013 systemd: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 18 16:10:14 w-a6e382ff-0013 systemd: Unit nomad.service entered failed state.
Nov 18 16:10:14 w-a6e382ff-0013 systemd: nomad.service failed.
Nov 18 16:10:15 w-a6e382ff-0013 systemd: nomad.service holdoff time over, scheduling restart.

Reproduction steps

I can't reproduce it.

Nomad Client logs (if appropriate)

This was the alloc that panics nomad-client.

image

It is weird that clients tried to kill this alloc several hours after this alloc was Killed.

Nov 18 15:22:51 w-a6e382ff-0013 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-18T15:22:51.262120Z","access_token_id":"","access_token_name":"","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc","command":["bash"],"exec_id":"01ef9d91-30e0-5dce-95d2-e45cafb0fbdf","task":"my-awesome-task","tty":true}
Nov 18 15:35:49 w-a6e382ff-0013 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-18T15:35:49.301158Z","access_token_id":"","access_token_name":"","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc","command":["bash"],"exec_id":"66fede01-ce2a-1e46-e0f1-5dc815fa1f28","task":"my-awesome-task","tty":true}
Nov 18 16:00:09 w-a6e382ff-0013 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-18T16:00:09.394583Z","access_token_id":"","access_token_name":"","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc","command":["bash"],"exec_id":"dc1b7bfe-a476-f490-eff0-da73cc996b0f","task":"my-awesome-task","tty":true}
Nov 18 16:06:26 w-a6e382ff-0013 consul: 2019/11/18 16:06:26 [INFO] agent: Deregistered service "_nomad-task-fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc-my-awesome-task-my-awesome-task-http"
Nov 18 16:06:28 w-a6e382ff-0013 nomad: {"@level":"error","@message":"failed to start stats collection for task with unrecoverable error","@module":"client.alloc_runner.task_runner.task_hook.stats_hook","@timestamp":"2019-11-18T16:06:28.522760Z","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc","error":"container stopped","task":"my-awesome-task"}
Nov 18 16:06:28 w-a6e382ff-0013 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-11-18T16:06:28.559039Z","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc"}
Nov 18 16:10:14 w-a6e382ff-0013 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-18T16:10:14.806617Z","access_token_id":"","access_token_name":"","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc","command":["bash"],"exec_id":"3fb00116-5c6b-1644-467e-9f20c1bac283","task":"my-awesome-task","tty":true}
Nov 18 16:10:15 w-a6e382ff-0013 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-11-18T16:10:15.663708Z","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc"}
Nov 19 00:33:34 w-a6e382ff-0013 nomad: {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2019-11-19T00:33:34.485140Z","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc","reason":"new allocations and over max (50)"}
Nov 19 00:33:34 w-a6e382ff-0013 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-11-19T00:33:34.491385Z","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc"}
Nov 19 14:48:42 w-a6e382ff-0013 nomad: {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2019-11-19T14:48:42.383720Z","alloc_id":"fdb1a3c0-f2ea-fefa-8fd4-9f0906babbcc","reason":"new allocations and over max (50)"}

Can this be related with #6461 ?

BR

@daktari
Copy link
Contributor Author

daktari commented Nov 19, 2019

Hi again

I am able to reproduce it.

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created     Modified
094298d7  db54eca7  my-awesome-task            11       stop     complete  5h31m ago   1m59s ago

if I try to exec in this stopped task it crash nomad-client

nomad alloc exec 094298d7 bash
failed to exec into task: disconnected without receiving the exit code
Nov 19 16:53:33 w-a6e382ff-0013 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-19T16:53:33.100994Z","access_token_id":"","access_token_name":"","alloc_id":"094298d7-84ce-cd1b-02d9-f869cf9009b7","command":["bash"],"exec_id":"d5cf8966-326e-4339-8757-726890937754","task":"my-awesome-task","tty":true}
Nov 19 16:53:33 w-a6e382ff-0013 nomad: panic: runtime error: invalid memory address or nil pointer dereference
Nov 19 16:53:33 w-a6e382ff-0013 nomad: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x125b24a]
Nov 19 16:53:33 w-a6e382ff-0013 nomad: goroutine 806752 [running]:
Nov 19 16:53:33 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).ExecStreaming(0x0, 0x2498c80, 0xc003e45ac0, 0xc002b0df50, 0x1, 0x1, 0x249b301, 0x2482940, 0xc002f56e60, 0xc0003ed650, ...)
Nov 19 16:53:33 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:73 +0x3a
Nov 19 16:53:33 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).ExecStreaming-fm(0x2498c80, 0xc003e45ac0, 0xc002b0df50, 0x1, 0x1, 0x1, 0x2482940, 0xc002f56e60, 0x128, 0x7f10986dc440)
Nov 19 16:53:33 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:1350 +0x8c
Nov 19 16:53:33 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Allocations).execImpl(0xc00000e3a8, 0xc004bad2c0, 0xc001b73800, 0xc00336f3b0, 0x24, 0x0, 0x0, 0x0)
Nov 19 16:53:33 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_endpoint.go:266 +0x9a0
Nov 19 16:53:33 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Allocations).exec(0xc00000e3a8, 0x7f10985763a8, 0xc001aaa000)
Nov 19 16:53:33 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_endpoint.go:150 +0x236
Nov 19 16:53:33 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Allocations).exec-fm(0x7f10985763a8, 0xc001aaa000)
Nov 19 16:53:33 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_endpoint.go:29 +0x3e
Nov 19 16:53:33 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Client).handleStreamingConn(0xc0004ae240, 0x24a6e60, 0xc001aaa000)
Nov 19 16:53:33 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/rpc.go:367 +0x5de
Nov 19 16:53:33 w-a6e382ff-0013 nomad: github.com/hashicorp/nomad/client.(*Client).handleConn(0xc0004ae240, 0x24a6e60, 0xc001aaa000)
Nov 19 16:53:33 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/rpc.go:299 +0x22a
Nov 19 16:53:33 w-a6e382ff-0013 nomad: created by github.com/hashicorp/nomad/client.(*Client).listenConn
Nov 19 16:53:33 w-a6e382ff-0013 nomad: /opt/gopath/src/github.com/hashicorp/nomad/client/rpc.go:275 +0x78
Nov 19 16:53:33 w-a6e382ff-0013 systemd: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 19 16:53:33 w-a6e382ff-0013 systemd: Unit nomad.service entered failed state.
Nov 19 16:53:33 w-a6e382ff-0013 systemd: nomad.service failed.
Nov 19 16:53:33 w-a6e382ff-0013 systemd: nomad.service holdoff time over, scheduling restart.

If you need more info do not hesitate to contact me.

BR

@notnoop
Copy link
Contributor

notnoop commented Nov 19, 2019

Thanks for reaching out. It's indeed a bug that was fixed in 0.10.0 by #6065 . Let me know if you have any questions.

@daktari
Copy link
Contributor Author

daktari commented Nov 19, 2019

Thank you for your answer. We are going to backport this change to our current version before upgrade to 0.10.

Only one question. It is normal that client tried to kill a dead allocation twice?

Thanks again

@notnoop
Copy link
Contributor

notnoop commented Nov 19, 2019

Anytime!

Only one question. It is normal that client tried to kill a dead allocation twice?

It's not typical, but possible specially in tight races to kill an allocation. Once it's dead for awhile, I don't expect it to be killed again. Would love to have logs to provide better guidance.

@daktari
Copy link
Contributor Author

daktari commented Nov 20, 2019

Yeah, I will open a new issue with all the information that I have.

Thanks!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants