Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token revocation 400s: panic: runtime error: invalid memory address or nil pointer dereference #6540

Closed
dekimsey opened this issue Oct 23, 2019 · 8 comments

Comments

@dekimsey
Copy link
Collaborator

dekimsey commented Oct 23, 2019

Nomad version

Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)

Operating system and Environment details

CentOS 7

Issue

I'm not really sure to be honest. I was trying to do a 'nomad run' a job, servers crashed and lost leadership. Looks like it was trying (and failing) to revoke a token. Logs don't tell me what the token accessor is so I cannot check the Vault audit logs to see what might be failing. Vault appears to be operating normally (I do see successful accesses).

At this time I'm unable to restore the cluster leadership, servers won't start. Not sure how to recover. From what I can tell the leader that wins election is unable to process the raft log and crashes few seconds later a new leader is elected and then crashes. After that the last node cannot elect a new leader and is stuck running but unable to complete an election.

Reproduction steps

$ nomad run jobs/prometheus.nomad
==> No evaluation with id "dfe07a84-017b-0766-99d1-86d472265e93" found
$ nomad run -verbose jobs/prometheus.nomad
Error submitting job: Unexpected response code: 500 (No cluster leader)

Job file (if appropriate)

https://gist.github.com/dekimsey/1c7e761936bd3cfdfb749b65f43c11f7

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

# looked at logs and saw this:
 nomad[7351]: 2019-10-23T12:09:18.870-0500 [INFO ] nomad: cluster leadership acquired
 nomad[7351]: 2019-10-23T12:09:18.871-0500 [WARN ] nomad: raft: AppendEntries to {Voter 10.70.255.235:4647 10.70.255.235:4647} rejected, sending older logs (next: 1059856)
 nomad[7351]: 2019-10-23T12:09:18.871-0500 [INFO ] nomad: raft: pipelining replication to peer {Voter 10.70.255.230:4647 10.70.255.230:4647}
 nomad[7351]: 2019-10-23T12:09:18.888-0500 [WARN ] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "4f88b0ba-2ee2-c684-116c-a32f20ec63c0", node: "606e8c9a-7732-f0ec-c425-24a44c98693b", task: "fabio"): Error making API request.
 nomad[7351]: URL: POST http://vault.service.consul:8200/v1/auth/token/revoke-accessor
 nomad[7351]: Code: 400. Errors:
 nomad[7351]: * 1 error occurred:
 nomad[7351]: * invalid accessor
 nomad[7351]: "
 nomad[7351]: 2019-10-23T12:09:19.125-0500 [INFO ] nomad: raft: pipelining replication to peer {Voter 10.70.255.235:4647 10.70.255.235:4647}
 nomad[7351]: panic: runtime error: invalid memory address or nil pointer dereference
 nomad[7351]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x17782af]
 nomad[7351]: goroutine 108 [running]:
 nomad[7351]: github.com/hashicorp/nomad/scheduler.genericAllocUpdateFn.func1(0xc0005b2380, 0xc0015c05a0, 0xc0010b0ee0, 0xc00118cf18, 0xc000c47440)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:844 +0x78f
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).computeUpdates(0xc0001de6e0, 0xc0010b0ee0, 0xc000c47440, 0x1, 0xc000c47440, 0xc000c47350)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:817 +0xf3
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).computeGroup(0xc0001de6e0, 0xc001086e80, 0xf, 0xc000c472c0, 0xc000c46e00)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:373 +0x4ef
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).Compute(0xc0001de6e0, 0xc0002f6f60)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:207 +0x173
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc0001de500, 0xc0015c05a0, 0xc002112dc0)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:346 +0x394
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc0001de500, 0x0, 0x2064b80, 0xc001204af0)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:253 +0x36d
 nomad[7351]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc000c33de0, 0xc000c33df0, 0xc, 0xffffffffffffffff)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x5d
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc0001de500, 0xc000019b00, 0x2ac40e0, 0xc0001ec360)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:152 +0x2bf
 nomad[7351]: github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc0005b6310, 0xc000c466c0, 0xc000019b00, 0xc00085c420, 0x24, 0x0, 0x0)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:268 +0x40a
 nomad[7351]: github.com/hashicorp/nomad/nomad.(*Worker).run(0xc0005b6310)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:129 +0x2e6
 nomad[7351]: created by github.com/hashicorp/nomad/nomad.NewWorker
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:81 +0x150
 systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
 systemd[1]: Unit nomad.service entered failed state.
 systemd[1]: nomad.service failed.

Attempted to restart the node:

 nomad[45323]: ==> Loaded configuration from /etc/nomad/server.hcl, /etc/nomad/vault.hcl
 nomad[45323]: ==> Starting Nomad agent...
 nomad[45323]: ==> Nomad agent configuration:
 nomad[45323]: Advertise Addrs: HTTP: 10.70.255.234:4646; RPC: 10.70.255.234:4647; Serf: 10.70.255.234:4648
 nomad[45323]: Bind Addrs: HTTP: 0.0.0.0:4646; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
 nomad[45323]: Client: false
 nomad[45323]: Log Level: INFO
 nomad[45323]: Region: global (DC: devchi)
 nomad[45323]: Server: true
 nomad[45323]: Version: 0.10.0
 nomad[45323]: ==> Nomad agent started! Log data will stream in below:
 nomad[45323]: 2019-10-23T16:46:25.296-0500 [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/lib/nomad/plugins
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.358-0500 [INFO ] nomad.vault: successfully renewed token: next_renewal=35h59m59.999979471s
 nomad[45323]: 2019-10-23T16:46:25.434-0500 [INFO ] nomad: raft: Restored from snapshot 470-1058215-1571827827334
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: raft: Initial configuration (index=1059856): [{Suffrage:Voter ID:10.70.255.230:4647 Address:10.70.255.230:4647} {Suffrage:Voter ID:10.70.255.234:4647 Address:10.70.255.234:4647} {Suffrage:Voter ID:10.70.255.235:4647 Address:10.70.255.235:4647}]
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.458-0500 [INFO ] nomad: raft: Node at 10.70.255.234:4647 [Follower] entering Follower state (Leader: "")
 nomad[45323]: 2019-10-23T16:46:25.459-0500 [INFO ] nomad: serf: EventMemberJoin: devchi-inf-005.tw-test.net.global 10.70.255.234
 nomad[45323]: 2019-10-23T16:46:25.459-0500 [INFO ] nomad: starting scheduling worker(s): num_workers=2 schedulers="[service batch system _core]"
 nomad[45323]: 2019-10-23T16:46:25.459-0500 [INFO ] nomad: serf: Attempting re-join to previously known node: devchi-inf-006.tw-test.net.global: 10.70.255.230:4648
 nomad[45323]: 2019-10-23T16:46:25.459-0500 [INFO ] nomad: adding server: server="devchi-inf-005.tw-test.net.global (Addr: 10.70.255.234:4647) (DC: devchi)"
 nomad[45323]: 2019-10-23T16:46:25.462-0500 [INFO ] nomad: serf: EventMemberJoin: devchi-inf-006.tw-test.net.global 10.70.255.230
 nomad[45323]: 2019-10-23T16:46:25.462-0500 [INFO ] nomad: serf: Re-joined to previously known node: devchi-inf-006.tw-test.net.global: 10.70.255.230:4648
 nomad[45323]: 2019-10-23T16:46:25.462-0500 [INFO ] nomad: adding server: server="devchi-inf-006.tw-test.net.global (Addr: 10.70.255.230:4647) (DC: devchi)"
 nomad[45323]: 2019-10-23T16:46:26.343-0500 [WARN ] nomad: raft: Failed to get previous log: 1060216 log not found (last: 1060214)
 nomad[45323]: panic: runtime error: invalid memory address or nil pointer dereference
 nomad[45323]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x17782af]
 nomad[45323]: goroutine 105 [running]:
 nomad[45323]: github.com/hashicorp/nomad/scheduler.genericAllocUpdateFn.func1(0xc00200d340, 0xc0010883c0, 0xc0011ffc00, 0xc000642f18, 0xc0019c6210)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:844 +0x78f
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).computeUpdates(0xc0000c0f00, 0xc0011ffc00, 0xc0019c6210, 0x1, 0xc0019c6210, 0xc0019c60f0)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:817 +0xf3
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).computeGroup(0xc0000c0f00, 0xc001054ba0, 0xf, 0xc0019c6060, 0xc0014f5a00)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:373 +0x4ef
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).Compute(0xc0000c0f00, 0xc00155e5a0)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:207 +0x173
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc00043dea0, 0xc0010883c0, 0xc001f6e370)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:346 +0x394
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc00043dea0, 0x0, 0x2064b80, 0xc0016509a0)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:253 +0x36d
 nomad[45323]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc000643de0, 0xc000643df0, 0xc, 0xffffffffffffffff)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x5d
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc00043dea0, 0xc0013ca480, 0x2ac40e0, 0xc0004e0780)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:152 +0x2bf
 nomad[45323]: github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc00042e230, 0xc0014f5020, 0xc0013ca480, 0xc000bfbec0, 0x24, 0x0, 0x0)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:268 +0x40a
 nomad[45323]: github.com/hashicorp/nomad/nomad.(*Worker).run(0xc00042e230)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:129 +0x2e6
 nomad[45323]: created by github.com/hashicorp/nomad/nomad.NewWorker
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:81 +0x150
 systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
 systemd[1]: Unit nomad.service entered failed state.
 systemd[1]: nomad.service failed.

Server Configuration

Environment:

OPTIONS=-data-dir=/var/lib/nomad -dc=devchi -config=/etc/nomad
VAULT_TOKEN=...

configuration:

bind_addr = "0.0.0.0"
leave_on_interrupt = true
server {
    enabled = true
    bootstrap_expect = 3
}
consul {
  token   = "..."
}
acl {
  enabled = true
}
vault {
  enabled = true
  address = "http://vault.service.consul:8200"
  allow_unauthenticated = true
  create_from_role = "nomad-cluster"
}
@dekimsey
Copy link
Collaborator Author

Since this cluster is a development cluster, I'm going to leave it non-functional for the moment. If you have questions or would like more information, please let me know.

@schmichael
Copy link
Member

schmichael commented Oct 23, 2019

So sorry you hit this @dekimsey, but thanks for reporting it! Obviously it's a top priority for us to address.

Is this a fresh cluster or did you upgrade from 0.9 or a 0.10 prerelease?

Do note that if you upgraded from 0.10.0-beta1 you'll need to set server.raft_protocol = 3 in your server config. I think mismatched raft versions looks different though.

@dekimsey
Copy link
Collaborator Author

I think I did a rolling update from a 0.8.4. No other changes were made to configs. I think the vault token may have expired (had jobs that couldn't get their secrets) and I went and re-issued those after the upgrade.

@dekimsey
Copy link
Collaborator Author

I was able to provision a new job earlier today that worked just fine. The only difference I can imagine here was I was running run on a job that had existed prior to the upgrade.

Note: I've updated the report to include server configuration settings.

@lgfa29
Copy link
Contributor

lgfa29 commented Oct 23, 2019

Hi @dekimsey,

could you try running this version that @schmichael built with his patch and see if it solves the problem you are having?

@dekimsey
Copy link
Collaborator Author

Thanks @lgfa29, I've gone home for the day. But I'll give it a try tomorrow morning!

Thank you all for the quick turnaround.

@dekimsey
Copy link
Collaborator Author

That resolved it, the cluster started and appears to be healthy!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants