Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.11 panic deregistering job #7757

Closed
michaeldwan opened this issue Apr 20, 2020 · 5 comments
Closed

0.11 panic deregistering job #7757

michaeldwan opened this issue Apr 20, 2020 · 5 comments

Comments

@michaeldwan
Copy link
Contributor

Nomad version

0.11

Operating system and Environment details

Issue

Our nomad servers are crashing while applying a delete job request from the raft log.

Based on our logs, a request to deregister+purge a job failed with a 500, though I can’t find any more details why. Within a few seconds we began seeing errors like this on each server

nomad.fsm: DeleteJob failed: error=“job not found”
nomad.fsm: deregistering job failed: error=“job not found”

Followed by panics:

Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: panic: runtime error: invalid memory address or nil pointer dereference
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x17f190b]
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: goroutine 103 [running]:
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/nomad/state.(*StateStore).deleteJobFromPlugin(0xc000614120, 0x48beee, 0xc00ef8aa40, 0xc0028f30e0, 0xc010d2da40, 0x0)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/nomad/state/state_store.go:1192 +0x37b
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/nomad/state.(*StateStore).DeleteJobTxn(0xc000614120, 0x48beee, 0xc0035c26a0, 0x7, 0xc0035c2680, 0x8, 0xc00ef8aa40, 0x38061e0, 0xc010d2cfa0)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/nomad/state/state_store.go:1502 +0xafb
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/nomad.(*nomadFSM).handleJobDeregister(0xc000326ee0, 0x48beee, 0xc0035c2680, 0x8, 0xc0035c26a0, 0x7, 0x1, 0xc00ef8aa40, 0xc0081e7a38, 0x17dbfc1)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/nomad/fsm.go:618 +0x1f6
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyDeregisterJob.func1(0xc00ef8aa40, 0x3142c01, 0xc00ef8aa40)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/nomad/fsm.go:565 +0x78
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/nomad/state.(*StateStore).WithWriteTransaction(0xc000614120, 0xc0081e7b10, 0x0, 0x0)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/nomad/state/state_store.go:4873 +0x7b
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyDeregisterJob(0xc000326ee0, 0xc0040805a1, 0x4d, 0x4d, 0x48beee, 0x0, 0x0)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/nomad/fsm.go:564 +0x1f2
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc000326ee0, 0xc016b426e0, 0x577acc0, 0xbf9ea4efd4c4751b)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/nomad/fsm.go:208 +0x42d
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc016bd4120)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:90 +0x2c1
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc000250a00, 0x40, 0x40)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:113 +0x75
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM(0xc000322900)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:219 +0x42f
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000322900, 0xc003ff2c30)
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x55
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]: created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
Apr 17 21:14:39 nomad-do-nyc1-61d4 nomad[19436]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

Based on the stack trace this line (and appropriate comment above) was the culprit. We're not using any CSI plugins and this job had no other plugin configurations.

We were able to bring our servers back online with a patch that checked for nil task groups before ranging.

Reproduction steps

Job file (if appropriate)

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

I can share logs before and after the patch if you need.

@tgross
Copy link
Member

tgross commented Apr 20, 2020

Hi @michaeldwan! Sorry to hear about this. I ran into this just an hour ago myself while working on #7708. Thanks for the PR!

@michaeldwan
Copy link
Contributor Author

Out of curiosity, is there a way to remove an entry from the raft log that's preventing servers from starting?

@tgross
Copy link
Member

tgross commented Apr 21, 2020

Out of curiosity, is there a way to remove an entry from the raft log that's preventing servers from starting?

I can't say that I've ever seen it done, but it's not impossible. It's just much safer to fix it with a patch that's aware of the application's schema so that you don't get dangling cross-references.

Our raft implementation uses https://github.com/hashicorp/raft-boltdb as the backing store and hypothetically it'd be possible to edit the backing store directly (on a stopped server, wiping out the other servers and syncing the results to them manually). There'd be a change to the "bucket" (table) for both the object you want to edit and then one for the index table for that object type. It'd be pretty dangerous so you'd want to backup the store before hand.

@tgross tgross added this to the 0.11.1 milestone Apr 22, 2020
@tgross
Copy link
Member

tgross commented Apr 22, 2020

The patch for this will ship in the 0.11.1 release.

@tgross tgross closed this as completed Apr 22, 2020
@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants