Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad v1.4.3 crashes with 'panic: runtime error: invalid memory address or nil pointer dereference' #17310

Closed
brian-regrow opened this issue May 25, 2023 · 1 comment · Fixed by #17316

Comments

@brian-regrow
Copy link

Nomad version

v1.4.3
Autoscaler version - v0.3.7

Operating system and Environment details

Ubuntu 20.04
GCP - 3 server cluster, @180 client nodes, @1300 allocations.
Servers are n2-standard-16 - CPUx16, 64Gb RAM, 200Gb SSD - Usage on Cpu, Memory, Network, Disk is all low @20% max.
Config files, logs leading to crash and startup logs attached.

Issue

Nomad server crashes, cluster loses a node.
The issue occurred as Nomad was scaling in a node class from 144 -> 126 over a period of 7m
But this may be a red herring as we regularly have large scaling events, shortly afterwards the impaired cluster scaled the node class from 150 -> 5 nodes over a period of 10m without issue.

Reproduction steps

Cannot reproduce, this is the first time.

Expected Result

Nomad server does not crash.

Actual Result

Nomad server crashes, cluster loses a node.

Job file (if appropriate)

Many jobs running - the large node class is running a CPU intensive batch job.
The drain strategy is empty_ignore_system

Nomad Server logs (if appropriate)

Will send logs and config files by email.

@tgross
Copy link
Member

tgross commented May 25, 2023

Hi @brian-regrow! A crash is definitely a bug. I've had a quick look at the server logs you sent and the crash is when we're constructing the information about the node for the response back to the client (ref node_endpoint.go#L276). It appears that the node is missing from the state store, but this is the Node.Register request that does writes the node! As far as I can tell this bug still in 1.5.6 as well.

The relevant redacted portion of your server logs is as follows:

May 25 02:50:54 $redacted nomad[5215]: panic: runtime error: invalid memory address or nil pointer dereference
May 25 02:50:54 $redacted nomad[5215]: [signal SIGSEGV: segmentation violation code=0x1 addr=0xc8 pc=0x1c087f0]
May 25 02:50:54 $redacted nomad[5215]: goroutine 167726362 [running]:
May 25 02:50:54 $redacted nomad[5215]: github.com/hashicorp/nomad/nomad.(*Node).constructNodeServerInfoResponse(0xc000820850, {0xc17e7cd8c0, 0x24}, 0xc147cf9570, 0xc16fda6c80)
May 25 02:50:54 $redacted nomad[5215]:         github.com/hashicorp/nomad/nomad/node_endpoint.go:276 +0x2b0
May 25 02:50:54 $redacted nomad[5215]: github.com/hashicorp/nomad/nomad.(*Node).Register(0xc000820850, 0xc1657d0660, 0xc16fda6c80)
May 25 02:50:54 $redacted nomad[5215]:         github.com/hashicorp/nomad/nomad/node_endpoint.go:202 +0xb2a
May 25 02:50:54 $redacted nomad[5215]: reflect.Value.call({0xc0007066c0?, 0xc001e7b0f8?, 0x7fedaaa163c8?}, {0x2b6754a, 0x4}, {0xc147cf9de0, 0x3, 0xc147cf9cd8?})
May 25 02:50:54 $redacted nomad[5215]:         reflect/value.go:584 +0x8c5
May 25 02:50:54 $redacted nomad[5215]: reflect.Value.Call({0xc0007066c0?, 0xc001e7b0f8?, 0x2a0a440?}, {0xc147cf9de0?, 0xc147cf9e30?, 0x989678?})
May 25 02:50:54 $redacted nomad[5215]:         reflect/value.go:368 +0xbc
May 25 02:50:54 $redacted nomad[5215]: net/rpc.(*service).call(0xc0c565d900, 0xc0005a2660?, 0xc0005a2660?, 0x0, 0xc0b97b2600, 0xc12aed0210?, {0x2a0a440?, 0xc1657d0660?, 0xc0ea466680?}, {0x27804c0, ...}, ...)
May 25 02:50:54 $redacted nomad[5215]:         net/rpc/server.go:382 +0x226
May 25 02:50:54 $redacted nomad[5215]: net/rpc.(*Server).ServeRequest(0xc147cf9f68?, {0x320e230, 0xc135358600})
May 25 02:50:54 $redacted nomad[5215]:         net/rpc/server.go:503 +0x18c
May 25 02:50:54 $redacted nomad[5215]: github.com/hashicorp/nomad/nomad.(*rpcHandler).handleNomadConn(0xc00064d1c0, {0x320d318, 0xc083040600}, {0x321a260?, 0xc0bbec5860}, 0x0?)
May 25 02:50:54 $redacted nomad[5215]:         github.com/hashicorp/nomad/nomad/rpc.go:418 +0x1bf
May 25 02:50:54 $redacted nomad[5215]: created by github.com/hashicorp/nomad/nomad.(*rpcHandler).handleMultiplexV2
May 25 02:50:54 $redacted nomad[5215]:         github.com/hashicorp/nomad/nomad/rpc.go:523 +0x4b8
May 25 02:50:55 $redacted systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

My hunch is that during the scale in we deregistered a node from the state store concurrently with a fingerprint update from that same node. We'll investigate further and circle back here. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants