Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic/SIGSEGV in Docker driver, invalid memory address or nil pointer dereference #7738

Closed
martinb3 opened this issue Apr 17, 2020 · 1 comment · Fixed by #7749
Closed

panic/SIGSEGV in Docker driver, invalid memory address or nil pointer dereference #7738

martinb3 opened this issue Apr 17, 2020 · 1 comment · Fixed by #7749

Comments

@martinb3
Copy link

Nomad version

Output from nomad version

Nomad v0.10.3+ent (0cd9d29eda9d9495b94161f38c4b2f275a7a088f)

Also docker version, in case it's involved:

docker version
Client: Docker Engine - Community
 Version:           19.03.6
 API version:       1.40
 Go version:        go1.12.16
 Git commit:        369ce74a3c
 Built:             Thu Feb 13 01:27:49 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.6
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.16
  Git commit:       369ce74a3c
  Built:            Thu Feb 13 01:26:21 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Operating system and Environment details

Linux ip-10-0-74-196 4.15.0-1063-aws #67-Ubuntu SMP Mon Mar 2 07:24:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue

We're seeing segfaults occur on Nomad client instances that are running multiple jobs at the default settings. We are definitely seeing lots of containers get killed as they get close to or exceed those limits, so there's generally a lot of churn of jobs being killed and getting rescheduled in this environment.

As you can see from the dmesg output, it's also then failing to restart via systemd.

@notnoop had mentioned, in chat:

the issue seems that container is nil in > https://github.com/hashicorp/nomad/blob/v0.10.3/drivers/docker/driver.go#L430
but that means that d.containerByName returned a nil container AND a nil error

which is done in https://github.com/hashicorp/nomad/blob/v0.10.3/drivers/docker/driver.go#L1124 but we don't handle that case

Hope that also helps.

Reproduction steps

We're simply allocating memory in a container until it gets killed.

Job file (if appropriate)

It's a pretty generic job file, with count = 1 and no memory/cpu reservations specified. I'm happy to provide it out of band if you think it's helpful.

Nomad Client logs (if appropriate)

Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: panic: runtime error: invalid memory address or nil pointer dereference
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x14882db]
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: goroutine 388 [running]:
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: github.com/hashicorp/nomad/drivers/docker.(*Driver).createContainer(0xc000197360, 0x2991e00, 0xc0003b3710, 0xc000b60380, 0x35, 0xc0002e4780, 0xc000444800, 0x0, 0x0, 0x0, ...)
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/docker/driver.go:430 +0x86b
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: github.com/hashicorp/nomad/drivers/docker.(*Driver).StartTask(0xc000197360, 0xc000a39f00, 0xc0003cac10, 0x207e760, 0xc000f3d3b0, 0x0)
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: #011/opt/gopath/src/github.com/hashicorp/nomad/drivers/docker/driver.go:281 +0x654
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).runDriver(0xc00019a780, 0xc0003d5cc8, 0x3)
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:738 +0xa67
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc00019a780)
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:491 +0xbfd
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
Apr 14 15:03:00 ip-10-0-67-237 nomad[22553]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:312 +0x98
Apr 14 15:03:00 ip-10-0-67-237 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 14 15:03:00 ip-10-0-67-237 systemd[1]: nomad.service: Failed with result 'exit-code'.
Apr 14 15:03:02 ip-10-0-67-237 systemd[1]: nomad.service: Service hold-off time over, scheduling restart.
Apr 14 15:03:02 ip-10-0-67-237 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 5.
Apr 14 15:03:02 ip-10-0-67-237 systemd[1]: nomad.service: Start request repeated too quickly.
Apr 14 15:03:02 ip-10-0-67-237 systemd[1]: nomad.service: Failed with result 'exit-code'.
@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants