Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints #11563

Closed
dcarbone opened this issue Nov 24, 2021 · 8 comments · Fixed by #11565
Closed

panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints #11563

dcarbone opened this issue Nov 24, 2021 · 8 comments · Fixed by #11565
Assignees
Milestone

Comments

@dcarbone
Copy link

dcarbone commented Nov 24, 2021

Nomad version

Nomad v1.2.1 (719c53ac0ebee95d902faafe59a30422a091bc31)

Operating system and Environment details

Linux 5.11.0-1022-raspi #24-Ubuntu aarch64

Issue

Server nodes continuously panic on boot after a time

Reproduction steps

Unsure exactly, I've been experiencing random instability since upgrading to v1.2.1, and now we're here. The server never goes beyond the boot stage.

Nomad Server logs (if appropriate)

Nov 24 06:54:16 nomad[2559]: panic: assignment to entry in nil map
Nov 24 06:54:16 nomad[2559]: goroutine 85 [running]:
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.mergeNodeFiltered(0x4002666d20, 0x4002666dc0)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:291 +0xdc
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computePlacements(0x40000e8840, {0x400038c880, 0x4, 0x4})
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:341 +0x804
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computeJobAllocs(0x40000e8840)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:280 +0x930
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process(0x40000e8840)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:148 +0x4cc
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0x400054f808, 0x400054f7f8)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/util.go:322 +0x44
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).Process(0x40000e8840, 0x4000b16c00)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/scheduler/scheduler_system.go:94 +0x5e4
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/nomad.(*nomadFSM).reconcileQueuedAllocations(0x4000504540, 0xd0d25)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/nomad/fsm.go:1789 +0x47c
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyReconcileSummaries(0x4000504540, {0x4001ee80b1, 0xa2, 0xa2}, 0xd0d25)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/nomad/fsm.go:895 +0x74
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0x4000504540, 0x400211eb90)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/nomad/nomad/fsm.go:231 +0x52c
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/raft.(*Raft).runFSM.func1(0x4002112d30)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:90 +0x200
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/raft.(*Raft).runFSM.func2({0x400046c400, 0x40, 0x40})
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:113 +0x478
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/raft.(*Raft).runFSM(0x400093a000)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:219 +0x278
Nov 24 06:54:16 nomad[2559]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0x400093a000, 0x4000a66810)
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/state.go:146 +0x58
Nov 24 06:54:16 nomad[2559]: created by github.com/hashicorp/raft.(*raftState).goFunc
Nov 24 06:54:16 nomad[2559]:         github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/state.go:144 +0x60
@dcarbone
Copy link
Author

dcarbone commented Nov 24, 2021

If it helps, I've been experiencing an issue very similar to #7743. I tried to experiment with the driver-host-path csi plugin over a year ago in an attempt to familiarize myself with utilizing CSI plugins with Nomad. This did not yield useful results, and so I tried to delete it. This yielded a very unfortunate scenario where now my servers spew several hundred lines of

Nov 24 07:06:24 nomad[2843]:     2021-11-24T07:06:24.091Z [ERROR] nomad.fsm: deregistering job failed: job=csi-plugin error="DeleteJob failed: deleting job from plugin: plugin missing: hostpath-plugin0 <nil>"

upon each boot, as well as when I assume some GC routine attempts to reap this now entirely stuck plugin job. The panic always happens after a few hundred of these have been spit out.

@dcarbone
Copy link
Author

Additionally, downgrading back to v1.1.8 allows the servers to function once again. The fsm errors are still there, but the panic is gone.

@dcarbone dcarbone changed the title arm64 panic loop at startup Server node arm64 panic loop at startup Nov 24, 2021
@tgross
Copy link
Member

tgross commented Nov 24, 2021

Hi @dcarbone! Sorry to hear about your trouble. It looks like the panic bug was introduced in 41b853b which shipped in 1.2.0. When we're creating the AllocMetrics object, it's not getting correctly populated with its ClassFiltered map.

It looks like this won't just impact ARM64 and you were just the unlucky first reporter because your cluster has classes to filter. We'll get a patch up ASAP.

@tgross tgross changed the title Server node arm64 panic loop at startup panic in scheduler on nil ClassFilter map Nov 24, 2021
@tgross tgross changed the title panic in scheduler on nil ClassFilter map panic in 1.2.0+ in scheduler on nil ClassFilter map Nov 24, 2021
@tgross tgross self-assigned this Nov 24, 2021
@tgross
Copy link
Member

tgross commented Nov 24, 2021

Ok, I was able to reproduce this on Nomad 1.2.0 in the following circumstances:

  • Some subset of nodes has a class
  • Some subset of nodes does not have that class
  • A system job requires a class

If the system job is rejected for all nodes or accepted for all nodes, we don't hit this code path, which probably explains why testing unfortunately didn't catch it. (One more reason to resurrect the prop testing PR #8832.) There's another map right after this point in the code that can probably be hit as well, so patching just this bug would undoubtably reveal another panic there, so I'll fix them both.

To reproduce on a Vagrant box, run two Nomad processes. One server + client config without a node class:

log_level  = "debug"
data_dir   = "/var/nomad/data"
bind_addr  = "0.0.0.0"
plugin_dir = "/opt/nomad/plugins"

server {
  enabled          = true
  bootstrap_expect = 1
  raft_protocol = 3
}

client {
  enabled = true
  # node_class = # not enabled!
}

And one client with a node class:

log_level  = "debug"
data_dir   = "/var/nomad-client01/data"
bind_addr  = "0.0.0.0"
plugin_dir = "/opt/nomad/plugins"

server {
  enabled = false
}

client {
  enabled    = true
  node_class = "foo"
  servers    = ["10.0.2.15:4647"]
}

ports {
  http = 5646
  rpc  = 5647
  serf = 5648
}

Then run the following jobspec:

job "example" {
  datacenters = ["dc1"]
  type        = "system"

  group "web" {

    constraint {
      attribute = "${node.class}"
      value     = "fuzz"
    }

    task "http" {
      driver = "docker"
      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/var/www"]
      }
    }
  }
}

@tgross
Copy link
Member

tgross commented Nov 24, 2021

@dcarbone I've opened #11565 with the patch. I'll update here when I have a better idea of when that'll ship.

@tgross tgross changed the title panic in 1.2.0+ in scheduler on nil ClassFilter map panic in 1.2.0+ in scheduler for system jobs with class constraints Nov 24, 2021
@tgross tgross added this to the 1.2.2 milestone Nov 24, 2021
@jrasell jrasell pinned this issue Nov 24, 2021
@tgross tgross modified the milestones: 1.2.3, 1.2.2 Nov 24, 2021
@tgross
Copy link
Member

tgross commented Nov 24, 2021

Looks like we're on track to get this fixed a bit later today. Thanks again for the report, @dcarbone

@dcarbone
Copy link
Author

awesome, thanks for the lightning fast fix!

@tgross tgross changed the title panic in 1.2.0+ in scheduler for system jobs with class constraints panic in 1.2.0 and 1.2.1 in scheduler for system jobs with class constraints Nov 29, 2021
@tgross tgross unpinned this issue Dec 14, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants