Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 1.4.2 - panic: runtime error: invalid memory address or nil pointer dereference #15189

Closed
ngcmac opened this issue Nov 8, 2022 · 5 comments · Fixed by #15192
Closed

Comments

@ngcmac
Copy link

ngcmac commented Nov 8, 2022

Nomad version

1.4.2

Operating system and Environment details

Ubuntu 22.04 AMD64
1 x Nomad Server (1.4.2)
3 x Nomad Clients (1.4.2)
cni plugins (1.1.1) - using some nfs volumes

Issue

Nomad's service panic in our CI/DEV Cluster after upgrade to 1.4.2.
It happens in all the clients, Nomad service keeps restarting.
Draining the affected node seems to stop the panic/restarts.

Reproduction steps

Restart nomad service (systemd)

Actual Result

Nomad service panics and keeps restarting

Nomad Client logs (if appropriate)

Nov 8 17:37:40 <computer_name> nomad[998943]: panic: runtime error: invalid memory address or nil pointer dereference
Nov 8 17:37:40 <computer_name> nomad[998943]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x16b105d]
Nov 8 17:37:40 <computer_name> nomad[998943]: goroutine 1522 [running]:
Nov 8 17:37:40 <computer_name> nomad[998943]: github.com/hashicorp/nomad/client/allocrunner/taskrunner/template.(*TaskTemplateManager).SetDriverHandle(0x1?, {0x31c6440?, 0xc00282c780?})
Nov 8 17:37:40 <computer_name> nomad[998943]: #011git.luolix.top/hashicorp/nomad/client/allocrunner/taskrunner/template/template.go:201 +0x3d
Nov 8 17:37:40 <computer_name> nomad[998943]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*templateHook).Poststart(0xc0019ac480?, {0x31c66e0?, 0xc00012f580?}, 0xc004aa8270?, 0xc000ae4870?)
Nov 8 17:37:40 <computer_name> nomad[998943]: #011git.luolix.top/hashicorp/nomad/client/allocrunner/taskrunner/template_hook.go:120 +0x3d
Nov 8 17:37:40 <computer_name> nomad[998943]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).poststart(0xc0019b4000)
Nov 8 17:37:40 <computer_name> nomad[998943]: #011git.luolix.top/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:346 +0x63d
Nov 8 17:37:40 <computer_name> nomad[998943]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc0019b4000)
Nov 8 17:37:40 <computer_name> nomad[998943]: #011git.luolix.top/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:606 +0x965
Nov 8 17:37:40 <computer_name> nomad[998943]: created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
Nov 8 17:37:40 <computer_name> nomad[998943]: #011git.luolix.top/hashicorp/nomad/client/allocrunner/alloc_runner.go:395 +0x6c
Nov 8 17:37:43 <computer_name> nomad[1000004]: ==> Loaded configuration from /opt/nomad/nomad.d/nomad-consul.json, /opt/nomad/nomad.d/nomad-docker.hcl, /opt/nomad/nomad.d/nomad-docker.json, /opt/nomad/nomad.d/nomad-tls.json, /opt/nomad/nomad.d/nomad.json, /opt/nomad/nomad.d/nomad_dogstatsd.json
Nov 8 17:37:43 <computer_name> nomad[1000004]: ==> Starting Nomad agent...

@shoenig
Copy link
Member

shoenig commented Nov 8, 2022

Hi @ngcmac, can you specify which Nomad version the clients were upgraded from? And also can you provide agent logs from one of the panic-ing Nomad Clients.

@ngcmac
Copy link
Author

ngcmac commented Nov 8, 2022

Hi @shoenig ,

Upgraded from 1.4.1 --> 1.4.2.
I also just replicated the issue in another cluster running on AWS EC2 Debian 11 arm64 (3 servers + n clients). Upgraded from 1.3.5 --> 1.4.2

This is not happening every time i restart the nomad service, but is very frequent and likely to happen.
I was not able to reproduce the issue if the node is drained before the restart.

Attached some logs since the first restart where we can see the panic/loop restart

nomad.log

@pkazmierczak
Copy link
Contributor

pkazmierczak commented Nov 9, 2022

Hi @ngcmac, would you be able to share the job file or at least the template that is being used?

@ngcmac
Copy link
Author

ngcmac commented Nov 9, 2022

Hi @pkazmierczak,

I attach a common job template being used in most of our services.
As this happens in multiple nodes, it does not seams related to a single job/template. Also i don't remember changing anything recently regarding templating.
The issue does not happen when a deploy occurs, or a node is drained, only when nomad service is restarted (at least is what i experienced so far)

Our templates and jobs are pretty simple, consul, vault and consul template being used.

Thks
nomad-job-template.zip

shoenig added a commit that referenced this issue Nov 9, 2022
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189
shoenig added a commit that referenced this issue Nov 9, 2022
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189
shoenig added a commit that referenced this issue Nov 9, 2022
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189
shoenig added a commit that referenced this issue Nov 10, 2022
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189
shoenig added a commit that referenced this issue Nov 10, 2022
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189
shoenig added a commit that referenced this issue Nov 10, 2022
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189
shoenig added a commit that referenced this issue Nov 10, 2022
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189
shoenig added a commit that referenced this issue Nov 10, 2022
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189

Co-authored-by: Seth Hoenig <shoenig@duck.com>
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 11, 2023
jorgemarey pushed a commit to jorgemarey/nomad that referenced this issue Nov 1, 2023
This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes hashicorp#15189
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants