token revocation 400s: panic: runtime error: invalid memory address or nil pointer dereference #6540

dekimsey · 2019-10-23T22:08:11Z

Nomad version

Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)

Operating system and Environment details

CentOS 7

Issue

I'm not really sure to be honest. I was trying to do a 'nomad run' a job, servers crashed and lost leadership. Looks like it was trying (and failing) to revoke a token. Logs don't tell me what the token accessor is so I cannot check the Vault audit logs to see what might be failing. Vault appears to be operating normally (I do see successful accesses).

At this time I'm unable to restore the cluster leadership, servers won't start. Not sure how to recover. From what I can tell the leader that wins election is unable to process the raft log and crashes few seconds later a new leader is elected and then crashes. After that the last node cannot elect a new leader and is stuck running but unable to complete an election.

Reproduction steps

$ nomad run jobs/prometheus.nomad
==> No evaluation with id "dfe07a84-017b-0766-99d1-86d472265e93" found
$ nomad run -verbose jobs/prometheus.nomad
Error submitting job: Unexpected response code: 500 (No cluster leader)

Job file (if appropriate)

https://gist.github.com/dekimsey/1c7e761936bd3cfdfb749b65f43c11f7

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

# looked at logs and saw this:
 nomad[7351]: 2019-10-23T12:09:18.870-0500 [INFO ] nomad: cluster leadership acquired
 nomad[7351]: 2019-10-23T12:09:18.871-0500 [WARN ] nomad: raft: AppendEntries to {Voter 10.70.255.235:4647 10.70.255.235:4647} rejected, sending older logs (next: 1059856)
 nomad[7351]: 2019-10-23T12:09:18.871-0500 [INFO ] nomad: raft: pipelining replication to peer {Voter 10.70.255.230:4647 10.70.255.230:4647}
 nomad[7351]: 2019-10-23T12:09:18.888-0500 [WARN ] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "4f88b0ba-2ee2-c684-116c-a32f20ec63c0", node: "606e8c9a-7732-f0ec-c425-24a44c98693b", task: "fabio"): Error making API request.
 nomad[7351]: URL: POST http://vault.service.consul:8200/v1/auth/token/revoke-accessor
 nomad[7351]: Code: 400. Errors:
 nomad[7351]: * 1 error occurred:
 nomad[7351]: * invalid accessor
 nomad[7351]: "
 nomad[7351]: 2019-10-23T12:09:19.125-0500 [INFO ] nomad: raft: pipelining replication to peer {Voter 10.70.255.235:4647 10.70.255.235:4647}
 nomad[7351]: panic: runtime error: invalid memory address or nil pointer dereference
 nomad[7351]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x17782af]
 nomad[7351]: goroutine 108 [running]:
 nomad[7351]: github.com/hashicorp/nomad/scheduler.genericAllocUpdateFn.func1(0xc0005b2380, 0xc0015c05a0, 0xc0010b0ee0, 0xc00118cf18, 0xc000c47440)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:844 +0x78f
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).computeUpdates(0xc0001de6e0, 0xc0010b0ee0, 0xc000c47440, 0x1, 0xc000c47440, 0xc000c47350)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:817 +0xf3
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).computeGroup(0xc0001de6e0, 0xc001086e80, 0xf, 0xc000c472c0, 0xc000c46e00)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:373 +0x4ef
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).Compute(0xc0001de6e0, 0xc0002f6f60)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:207 +0x173
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc0001de500, 0xc0015c05a0, 0xc002112dc0)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:346 +0x394
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc0001de500, 0x0, 0x2064b80, 0xc001204af0)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:253 +0x36d
 nomad[7351]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc000c33de0, 0xc000c33df0, 0xc, 0xffffffffffffffff)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x5d
 nomad[7351]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc0001de500, 0xc000019b00, 0x2ac40e0, 0xc0001ec360)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:152 +0x2bf
 nomad[7351]: github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc0005b6310, 0xc000c466c0, 0xc000019b00, 0xc00085c420, 0x24, 0x0, 0x0)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:268 +0x40a
 nomad[7351]: github.com/hashicorp/nomad/nomad.(*Worker).run(0xc0005b6310)
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:129 +0x2e6
 nomad[7351]: created by github.com/hashicorp/nomad/nomad.NewWorker
 nomad[7351]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:81 +0x150
 systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
 systemd[1]: Unit nomad.service entered failed state.
 systemd[1]: nomad.service failed.

Attempted to restart the node:

 nomad[45323]: ==> Loaded configuration from /etc/nomad/server.hcl, /etc/nomad/vault.hcl
 nomad[45323]: ==> Starting Nomad agent...
 nomad[45323]: ==> Nomad agent configuration:
 nomad[45323]: Advertise Addrs: HTTP: 10.70.255.234:4646; RPC: 10.70.255.234:4647; Serf: 10.70.255.234:4648
 nomad[45323]: Bind Addrs: HTTP: 0.0.0.0:4646; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
 nomad[45323]: Client: false
 nomad[45323]: Log Level: INFO
 nomad[45323]: Region: global (DC: devchi)
 nomad[45323]: Server: true
 nomad[45323]: Version: 0.10.0
 nomad[45323]: ==> Nomad agent started! Log data will stream in below:
 nomad[45323]: 2019-10-23T16:46:25.296-0500 [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/lib/nomad/plugins
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.298-0500 [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
 nomad[45323]: 2019-10-23T16:46:25.358-0500 [INFO ] nomad.vault: successfully renewed token: next_renewal=35h59m59.999979471s
 nomad[45323]: 2019-10-23T16:46:25.434-0500 [INFO ] nomad: raft: Restored from snapshot 470-1058215-1571827827334
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: raft: Initial configuration (index=1059856): [{Suffrage:Voter ID:10.70.255.230:4647 Address:10.70.255.230:4647} {Suffrage:Voter ID:10.70.255.234:4647 Address:10.70.255.234:4647} {Suffrage:Voter ID:10.70.255.235:4647 Address:10.70.255.235:4647}]
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.456-0500 [INFO ] nomad: serf: Ignoring previous leave in snapshot
 nomad[45323]: 2019-10-23T16:46:25.458-0500 [INFO ] nomad: raft: Node at 10.70.255.234:4647 [Follower] entering Follower state (Leader: "")
 nomad[45323]: 2019-10-23T16:46:25.459-0500 [INFO ] nomad: serf: EventMemberJoin: devchi-inf-005.tw-test.net.global 10.70.255.234
 nomad[45323]: 2019-10-23T16:46:25.459-0500 [INFO ] nomad: starting scheduling worker(s): num_workers=2 schedulers="[service batch system _core]"
 nomad[45323]: 2019-10-23T16:46:25.459-0500 [INFO ] nomad: serf: Attempting re-join to previously known node: devchi-inf-006.tw-test.net.global: 10.70.255.230:4648
 nomad[45323]: 2019-10-23T16:46:25.459-0500 [INFO ] nomad: adding server: server="devchi-inf-005.tw-test.net.global (Addr: 10.70.255.234:4647) (DC: devchi)"
 nomad[45323]: 2019-10-23T16:46:25.462-0500 [INFO ] nomad: serf: EventMemberJoin: devchi-inf-006.tw-test.net.global 10.70.255.230
 nomad[45323]: 2019-10-23T16:46:25.462-0500 [INFO ] nomad: serf: Re-joined to previously known node: devchi-inf-006.tw-test.net.global: 10.70.255.230:4648
 nomad[45323]: 2019-10-23T16:46:25.462-0500 [INFO ] nomad: adding server: server="devchi-inf-006.tw-test.net.global (Addr: 10.70.255.230:4647) (DC: devchi)"
 nomad[45323]: 2019-10-23T16:46:26.343-0500 [WARN ] nomad: raft: Failed to get previous log: 1060216 log not found (last: 1060214)
 nomad[45323]: panic: runtime error: invalid memory address or nil pointer dereference
 nomad[45323]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x17782af]
 nomad[45323]: goroutine 105 [running]:
 nomad[45323]: github.com/hashicorp/nomad/scheduler.genericAllocUpdateFn.func1(0xc00200d340, 0xc0010883c0, 0xc0011ffc00, 0xc000642f18, 0xc0019c6210)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:844 +0x78f
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).computeUpdates(0xc0000c0f00, 0xc0011ffc00, 0xc0019c6210, 0x1, 0xc0019c6210, 0xc0019c60f0)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:817 +0xf3
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).computeGroup(0xc0000c0f00, 0xc001054ba0, 0xf, 0xc0019c6060, 0xc0014f5a00)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:373 +0x4ef
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*allocReconciler).Compute(0xc0000c0f00, 0xc00155e5a0)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/reconcile.go:207 +0x173
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc00043dea0, 0xc0010883c0, 0xc001f6e370)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:346 +0x394
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc00043dea0, 0x0, 0x2064b80, 0xc0016509a0)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:253 +0x36d
 nomad[45323]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc000643de0, 0xc000643df0, 0xc, 0xffffffffffffffff)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x5d
 nomad[45323]: github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc00043dea0, 0xc0013ca480, 0x2ac40e0, 0xc0004e0780)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:152 +0x2bf
 nomad[45323]: github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc00042e230, 0xc0014f5020, 0xc0013ca480, 0xc000bfbec0, 0x24, 0x0, 0x0)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:268 +0x40a
 nomad[45323]: github.com/hashicorp/nomad/nomad.(*Worker).run(0xc00042e230)
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:129 +0x2e6
 nomad[45323]: created by github.com/hashicorp/nomad/nomad.NewWorker
 nomad[45323]: /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:81 +0x150
 systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
 systemd[1]: Unit nomad.service entered failed state.
 systemd[1]: nomad.service failed.

Server Configuration

Environment:

OPTIONS=-data-dir=/var/lib/nomad -dc=devchi -config=/etc/nomad
VAULT_TOKEN=...

configuration:

bind_addr = "0.0.0.0"
leave_on_interrupt = true
server {
    enabled = true
    bootstrap_expect = 3
}
consul {
  token   = "..."
}
acl {
  enabled = true
}
vault {
  enabled = true
  address = "http://vault.service.consul:8200"
  allow_unauthenticated = true
  create_from_role = "nomad-cluster"
}

The text was updated successfully, but these errors were encountered:

dekimsey · 2019-10-23T22:11:41Z

Since this cluster is a development cluster, I'm going to leave it non-functional for the moment. If you have questions or would like more information, please let me know.

schmichael · 2019-10-23T22:16:42Z

So sorry you hit this @dekimsey, but thanks for reporting it! Obviously it's a top priority for us to address.

Is this a fresh cluster or did you upgrade from 0.9 or a 0.10 prerelease?

Do note that if you upgraded from 0.10.0-beta1 you'll need to set server.raft_protocol = 3 in your server config. I think mismatched raft versions looks different though.

dekimsey · 2019-10-23T22:20:05Z

I think I did a rolling update from a 0.8.4. No other changes were made to configs. I think the vault token may have expired (had jobs that couldn't get their secrets) and I went and re-issued those after the upgrade.

dekimsey · 2019-10-23T22:22:50Z

I was able to provision a new job earlier today that worked just fine. The only difference I can imagine here was I was running run on a job that had existed prior to the upgrade.

Note: I've updated the report to include server configuration settings.

Potential fix for #6540

lgfa29 · 2019-10-23T22:56:33Z

Hi @dekimsey,

could you try running this version that @schmichael built with his patch and see if it solves the problem you are having?

dekimsey · 2019-10-23T23:08:15Z

Thanks @lgfa29, I've gone home for the day. But I'll give it a try tomorrow morning!

Thank you all for the quick turnaround.

dekimsey · 2019-10-24T15:38:39Z

That resolved it, the cluster started and appears to be healthy!

Fix for #6540

github-actions · 2022-11-17T02:28:25Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael added theme/core theme/crash labels Oct 23, 2019

schmichael added a commit that referenced this issue Oct 23, 2019

core: fix panic when AllocatedResources is nil

9859b83

Potential fix for #6540

schmichael mentioned this issue Oct 23, 2019

core: fix panic when AllocatedResources is nil #6541

Merged

schmichael added a commit that referenced this issue Oct 28, 2019

core: fix panic when AllocatedResources is nil

08a1785

Fix for #6540

schmichael closed this as completed Oct 31, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token revocation 400s: panic: runtime error: invalid memory address or nil pointer dereference #6540

token revocation 400s: panic: runtime error: invalid memory address or nil pointer dereference #6540

dekimsey commented Oct 23, 2019 •

edited

Loading

dekimsey commented Oct 23, 2019

schmichael commented Oct 23, 2019 •

edited

Loading

dekimsey commented Oct 23, 2019

dekimsey commented Oct 23, 2019

lgfa29 commented Oct 23, 2019

dekimsey commented Oct 23, 2019

dekimsey commented Oct 24, 2019

github-actions bot commented Nov 17, 2022

token revocation 400s: panic: runtime error: invalid memory address or nil pointer dereference #6540

token revocation 400s: panic: runtime error: invalid memory address or nil pointer dereference #6540

Comments

dekimsey commented Oct 23, 2019 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

Server Configuration

dekimsey commented Oct 23, 2019

schmichael commented Oct 23, 2019 • edited Loading

dekimsey commented Oct 23, 2019

dekimsey commented Oct 23, 2019

lgfa29 commented Oct 23, 2019

dekimsey commented Oct 23, 2019

dekimsey commented Oct 24, 2019

github-actions bot commented Nov 17, 2022

dekimsey commented Oct 23, 2019 •

edited

Loading

schmichael commented Oct 23, 2019 •

edited

Loading