Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler panic at evaluation time if there is a client with max_dynamic_port == min_dynamic_port #17585

Closed
stswidwinski opened this issue Jun 18, 2023 · 4 comments · Fixed by #17619

Comments

@stswidwinski
Copy link
Contributor

stswidwinski commented Jun 18, 2023

Nomad version

Tip

Operating system and Environment details

Unix. Doesn't matter.

Issue

The scheduler panics during evaluation if there exists a client whose max_dynamic_port is the same as min_dynamic_port. The documentation does not specify whether these bounds are exclusive or inclusive, but the implementation of the rand function in go does.

The answer is: "don't misconfigure your clients" (and I agree), but it would be nice if the client itself validated the configuration and crashes if the configuration were to produce unexpected result. A corollary of the above is that it seems hard to disable dynamic port allocation.

Reproduction steps

# Start a nomad server
$ cat server_config.hcl
data_dir = "/tmp/nomad/server"
log_level = "TRACE"
  
advertise {
  http = "127.0.0.1"
  rpc = "127.0.0.1"
  serf = "127.0.0.1"
}
  
server {
  enabled = true
  bootstrap_expect = 1
}
$ ./nomad agent -config server_config.hcl
 
# Start a nomad client
#
## It is important that max_dynamic_port == min_dynamic_port
$ cat client_config.hcl
data_dir = "/tmp/nomad/client-1"
log_level = "debug"
  
advertise {
  http = "127.0.0.1"
  rpc = "127.0.0.1"
  serf = "127.0.0.1"
}
  
ports {
  http = "9876"
  rpc = "9875"
  serf = "9874"
}
  
client {
  enabled = true
  servers = ["127.0.0.1"]
  max_dynamic_port = 20000
  min_dynamic_port = 20000
}
  
plugin "raw_exec" {
  config {
    enabled = true
  }
}
$ ./nomad agent -config client_config.hcl

The cluster will start without any issues:

$ ./nomad node status
ID        DC   Name             Class   Drain  Eligibility  Status
003e7e01  dc1  <name>           <none>  false  eligible     ready
--

Now, start a job with dynamically allocates a port:

$ cat one_time_job.hcl
job "example" {
  datacenters = ["dc1"]
  type = "batch"
 
  group "test-group" {
    network {
      port "http" {}
    }
 
    task "test-task" {
      driver = "raw_exec"
 
  
      config {
        command = "bash"
        args = [ "-c", "env; sleep 1000" ]
      }
    }
  }
}
$ ./nomad run one_time_job.hcl

The result is a scheduler panic:

    2023-06-18T09:40:57.428-0400 [ERROR] worker.batch_sched: processing eval panicked scheduler - please report this as a bug!: eval_id=f160c437-5fb2-9e4f-c92c-3f97fc3fe6e6 job_id=example namespace=default worker_id=f1734152-7264-9075-8725-2aa6ab63d4c5 eval_id=f16[0/3273]b2-9e4f-c92c-3f97fc3fe6e6 error="invalid argument to Intn"
  stack_trace=
  | goroutine 229 [running]:
  | runtime/debug.Stack()
  | \truntime/debug/stack.go:24 +0x65
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process.func1()
  | \tgit.luolix.top/hashicorp/nomad/scheduler/generic_sched.go:149 +0x66
  | panic({0x2758d80, 0x34c0bf0})
  | \truntime/panic.go:884 +0x213
  | math/rand.(*Rand).Intn(0x41ab50?, 0xc00115dc01?)
  | \tmath/rand/rand.go:179 +0x65
  | math/rand.Intn(...)
  | \tmath/rand/rand.go:358
  | github.com/hashicorp/nomad/nomad/structs.getDynamicPortsStochastic({0xc001542000, 0x2000, 0xf4e589?}, {0x0?, 0x0, 0x0?}, 0x4e20, 0x4e20, {0x0, 0x0, ...}, ...)
  | \tgit.luolix.top/hashicorp/nomad/nomad/structs/network.go:749 +0x325 
  | github.com/hashicorp/nomad/nomad/structs.(*NetworkIndex).AssignPorts(0xc00115e9b0, 0xc0003dc900)
  | \tgit.luolix.top/hashicorp/nomad/nomad/structs/network.go:560 +0xb4d 
  | github.com/hashicorp/nomad/scheduler.(*BinPackIterator).Next(0xc00074c000)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/rank.go:297 +0xb3c
  | github.com/hashicorp/nomad/scheduler.(*JobAntiAffinityIterator).Next(0xc001404d20)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/rank.go:590 +0x73
  | github.com/hashicorp/nomad/scheduler.(*NodeReschedulingPenaltyIterator).Next(0xc0007f3ec0)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/rank.go:651 +0x2e
  | github.com/hashicorp/nomad/scheduler.(*NodeAffinityIterator).Next(0xc001404d70)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/rank.go:723 +0x3f
  | github.com/hashicorp/nomad/scheduler.(*SpreadIterator).Next(0xc000d4ecc0)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/spread.go:118 +0x3f
  | github.com/hashicorp/nomad/scheduler.(*PreemptionScoringIterator).Next(0xc00133ce40)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/rank.go:818 +0x2e
  | github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc00133ce60)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/rank.go:782 +0x2e
  | github.com/hashicorp/nomad/scheduler.(*LimitIterator).nextOption(0xc000d4ee40)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/select.go:60 +0x2a
  | github.com/hashicorp/nomad/scheduler.(*LimitIterator).Next(0xc000d4ee40)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/select.go:39 +0x30
  | github.com/hashicorp/nomad/scheduler.(*MaxScoreIterator).Next(0xc0007f3fb0)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/select.go:102 +0x4f
  | github.com/hashicorp/nomad/scheduler.(*GenericStack).Select(0xc0013c84e0, 0xc000b8bd40, 0xc00115f570)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/stack.go:182 +0xed7
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).selectNextOption(0xc000a8abb0, 0x34f1fe0?, 0xc00115f570)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/generic_sched.go:813 +0x33
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computePlacements(0xc000a8abb0, {0x4cc7a48, 0x0, 0x0}, {0xc000f91110, 0x1, 0x1})
  | \tgit.luolix.top/hashicorp/nomad/scheduler/generic_sched.go:587 +0x948
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc000a8abb0)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/generic_sched.go:465 +0x140a
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc000a8abb0)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/generic_sched.go:285 +0x4e5
  | github.com/hashicorp/nomad/scheduler.retryMax(0x2, 0xc00115fd28, 0xc00115fd18)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/util.go:85 +0x52
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc000a8abb0, 0xc001120a80)
  | \tgit.luolix.top/hashicorp/nomad/scheduler/generic_sched.go:184 +0x57f
  | github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc0005c40d0, 0xc0007f38c0, 0xc001120a80, {0xc000650cf0, 0x24})
  | \tgit.luolix.top/hashicorp/nomad/nomad/worker.go:623 +0x379
  | github.com/hashicorp/nomad/nomad.(*Worker).run(0xc0005c40d0, 0x12a05f200)
  | \tgit.luolix.top/hashicorp/nomad/nomad/worker.go:452 +0x5ad
  | created by github.com/hashicorp/nomad/nomad.(*Worker).Start
  | \tgit.luolix.top/hashicorp/nomad/nomad/worker.go:152 +0x65

Expected Result

No scheduler panic.

Actual Result

Scheduler panic.

Job file (if appropriate)

See above.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 20, 2023
@tgross tgross self-assigned this Jun 20, 2023
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jun 20, 2023
@tgross
Copy link
Member

tgross commented Jun 20, 2023

Hi @stswidwinski! This is definitely a scheduler bug, as setting the min and max to be the same should result in there being exactly 1 port available. I'm working up a quick PR with the fix right now.

tgross added a commit that referenced this issue Jun 20, 2023
If the dynamic port range for a node is set so that the min is equal to the max,
there's only one port available and this passes config validation. But the
scheduler panics when it tries to pick a random port. Only add the randomness
when there's more than one to pick from.

Adds a test for the behavior but also adjusts the commentary on a couple of the
existing tests that made it seem like this case was already covered if you
didn't look too closely.

Fixes: #17585
@stswidwinski
Copy link
Contributor Author

This is definitely a scheduler bug, as setting the min and max to be the same should result in there being exactly 1 port available.

How would one express lack of ports available? Would one make max smaller than min?

@tgross
Copy link
Member

tgross commented Jun 20, 2023

How would one express lack of ports available? Would one make max smaller than min?

That would fail config validation. I think you could do this by setting a small range and then reserve those same ports via the reserved_ports config.

Nomad - Community Issues Triage automation moved this from Triaging to Done Jun 20, 2023
tgross added a commit that referenced this issue Jun 20, 2023
If the dynamic port range for a node is set so that the min is equal to the max,
there's only one port available and this passes config validation. But the
scheduler panics when it tries to pick a random port. Only add the randomness
when there's more than one to pick from.

Adds a test for the behavior but also adjusts the commentary on a couple of the
existing tests that made it seem like this case was already covered if you
didn't look too closely.

Fixes: #17585
@tgross
Copy link
Member

tgross commented Jun 20, 2023

Fixed in #17619 which will ship in the upcoming 1.6.0 beta (plus backports when those ship with the GA)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants