Rebalance Nomad Scheduled Allocations #10039

idrennanvmware · 2021-02-17T13:47:05Z

Nomad 1.0.3

Operating system and Environment details

PhotonOS3

Issue

Over time we have noticed our nodes getting allocations placed on the same node more and more (we use the spread algorithm as a default for the scheduler config). This has produced some interesting scenarios where we have system jobs that are unable to run due to dimensions being exhausted on the node BUT there are other nodes available for allocations on the saturated node.

Right now we manually restart the allocation group which results in the allocations starting to spread as expected but we are a little surprised by this behavior - and in a few instances an application and all it's instances are on the same node.
EDIT: We have a strong suspicion the above is the the culprit. We recently had a sizing operation done on some clusters and ALL of them that were affected by this sizing (and reboot) are displaying this inbalance. We will run tests to see if we can get verifiable behavior. More concerning is that applications seem to "stack up" like the screenshot below.

It may be coincidental but we have also recently migrated all these services to Service Mesh - Not sure that's relevant but I also don't want to exclude any significant changes that coincide with this observed behavior.

In addition we were wondering if there's a way we can issue a rebalance command across the cluster where the scheduler can move allocations? It would be really helpful in some scenarios we have where we might do a rolling restart through a cluster (of the actual nodes themselves) - what happens here is that the node when drains then starts the allocations on the remaining nodes as expected - but at the end of all this we end up with 1 node (the last one to restart) significantly unbalanced. In these scenarios we would really like to be able to trigger a rebalance to ensure the saturation described above does not happen.

Thanks!
Ian

idrennanvmware · 2021-02-17T13:48:51Z

Here's an example we saw on a production server this morning. All allocations on the same client

tgross · 2021-02-17T14:37:34Z

Hi @idrennanvmware! So there's sort of two parts to this issue: why are the allocations getting packed and a feature request for a rebalance. If you're using spread scheduling, I would only expect to see allocations getting packed if the node drains and reboots were happening too close together: on small clusters you'll want to wait for each node should get a chance to come back up and get registered and eligible (although on large clusters there's more headroom to have a few in draining at once).

If you take a look at nomad alloc status -verbose :alloc_id, it'll give you the metrics that the scheduler used for placement of that allocation. That might provide some insight into how they're all getting packed.

As far as the feature request goes, that seems reasonable and I think we have an open issue for that already that needs some roadmapping: #8368

idrennanvmware · 2021-02-17T16:15:50Z

@tgross - here's the alloc from one of the services all packed on the same node

ID                  = ec019321-2797-2398-93ed-8a46e80e5a0f
Eval ID             = 81baa965-3ddc-c506-2b46-bd7a7c4cb0db
Name                = device-log-job.device-log-api-group[1]
Node ID             = 63102b60-fe1d-5285-89e8-91deffc6d5e8
Node Name           = <redacted>
Job ID              = device-log-job
Job Version         = 10
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 2021-02-16T22:04:15Z
Modified            = 2021-02-16T22:06:16Z
Evaluated Nodes     = 65
Filtered Nodes      = 64
Exhausted Nodes     = 0
Allocation Time     = 100.058µs
Failures            = 0

Allocation Addresses (mode = "bridge")
Label                          Dynamic  Address
*api_http                      yes      10.104.7.106:22996
*connect-proxy-device-log-api  yes      10.104.7.106:31209 -> 31209

Task "connect-proxy-device-log-api" (prestart sidecar) is "running"
Task Resources
CPU        Memory          Disk     Addresses
8/250 MHz  18 MiB/128 MiB  300 MiB  

Task Events:
Started At     = 2021-02-16T22:04:54Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2021-02-16T22:04:54Z  Started     Task started by client
2021-02-16T22:04:30Z  Driver      Downloading image
2021-02-16T22:04:27Z  Task Setup  Building Task Directory
2021-02-16T22:04:16Z  Received    Task received by client

Task "device-log-api-logging-task" is "running"
Task Resources
CPU         Memory          Disk     Addresses
19/100 MHz  35 MiB/100 MiB  300 MiB  

Task Events:
Started At     = 2021-02-16T22:04:56Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2021-02-16T22:04:56Z  Started     Task started by client
2021-02-16T22:04:54Z  Task Setup  Building Task Directory
2021-02-16T22:04:16Z  Received    Task received by client

Task "device-log-api-task" is "running"
Task Resources
CPU          Memory           Disk     Addresses
115/512 MHz  140 MiB/512 MiB  300 MiB  

Task Events:
Started At     = 2021-02-16T22:05:10Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2021-02-16T22:05:10Z  Started     Task started by client
2021-02-16T22:04:55Z  Driver      Downloading image
2021-02-16T22:04:54Z  Task Setup  Building Task Directory
2021-02-16T22:04:16Z  Received    Task received by client

Placement Metrics
  * Constraint "computed class ineligible": 64 nodes excluded by filter
Node                                  allocation-spread  binpack  job-anti-affinity  node-affinity  node-reschedule-penalty  final score
63102b60-fe1d-5285-89e8-91deffc6d5e8  -1                 0.094    -0.667             0              0                        -0.524

tgross · 2021-05-03T19:57:57Z

Sorry it took me a while to get back this one, @idrennanvmware. I do want to dig into the placement score metrics there in more detail; I don't think we have as good of test coverage as we'd like on how the spread scheduling config interacts with other placement metrics. But I'm also noticing 64 nodes excluded because the computed class is ineligible. Is that expected for your environment?

idrennanvmware · 2021-05-21T11:19:14Z

@tgross - apologies here, too - been buried in a million things :)

To answer your question, yes there could be a large number of excluded nodes given we have a mix of windows and linux and windows consist of the bulk of our "clients" even though they make very limited use of Nomad at this point (but we do rely on it for some critical services like client side load balancers, logging, and telemetry binaries)

maziadi · 2021-10-14T09:14:57Z

@tgross : Any updates on this please ?

lgfa29 · 2021-10-15T22:01:48Z

@tgross : Any updates on this please ?

Hi @maziadi 👋

No updates here yet. We'll let you know once we do.

sstent · 2022-08-11T21:00:57Z

+1 this would be a really great feature

tjohnston-cd · 2022-08-22T17:43:16Z

+1, seeing this with a small cluster during rolling restarts (and similar scenarios) as mentioned in the OP.

Many of our jobs are long-running / static, so the overburdened clients will remain overburdened without intervention. We can ~~reschedule jobs~~ manually stop allocations that are on the overburdened clients, but it would be nice to have some kind of operator workflow to rebalance, as mentioned..

maxim-design · 2023-01-04T13:29:56Z

+1

…mad#698, hashicorp/nomad#10727, hashicorp/nomad#10039, hashicorp/nomad#1635, hashicorp/nomad#8368 + basic readme

mwang365 · 2023-02-21T17:10:55Z

We are also experiencing unexplained skewed distribution of allocation. A function to prevent that and a separate function to rebalance are essential!

Nukesor · 2024-01-08T14:36:37Z

We're currently running into the same scenario.
Once the cluster enters an unbalanced state, it doesn't recover from it, unless there's significant resource excess.

Our cluster setup:

We're running 3 Master and 3 worker servers
There're multiple jobs running on the cluster
Each job has 3 allocs that are to be distributed equally across the cluster
While deploying a new version, we spin up canary jobs for all jobs, effectively doubling our workload.
This is to allow simultanous and fast deployments for all instances at the same time. Sequential deployments would take hours.

How does it get unbalanced

If one starts from a clean slate, the allocs get allocated as intended and are equally distributed.

In our case, the first unbalanced state occurs while updating the nomad/consul cluster.
During the update, each worker instance gets shut down, updated and started up. During this time, the allocs on that machine get rebalanced to the other two workers.

After the update on all machines are done, the topography usually looks like this:

worker0: 1.5x normal load
worker1: 1.5x normal load
worker2: 0x normal load

This behavior is expected, but leads to follow-up problems.
When we start a new deployment, we double the workload as usual, which gets equally distributed this time around.

worker0: 2.5x normal load
worker1: 2.5x normal load
worker2: 1x normal load

If the workers cannot provide 2.5x the normal resource need, allocs are now moved to worker2, which again gets duplicate allocs as a result.
This usually eventually rebalances itself over the course of multiple deployments, but results in a degraded state of reliability for quite some time.

To prevent this scenario, we have to scale our servers to handle 2.5 times the normal load instead of just 2x the normal load or we have to wipe the cluster and start from a clean slate (which isn't an option).

I hope this gives some insights on why rebalancing would be nice to prevent users from requiring additional (unneeded) resources.

tgross self-assigned this Feb 17, 2021

tgross added theme/scheduling type/question stage/waiting-reply labels Feb 17, 2021

tgross removed the stage/waiting-reply label Feb 26, 2021

tgross added the stage/waiting-reply label May 20, 2021

tgross removed the stage/waiting-reply label May 20, 2021

tgross removed their assignment May 20, 2021

lgfa29 added the stage/needs-investigation label Oct 15, 2021

valodzka pushed a commit to valodzka/nomad-workarounds that referenced this issue Jan 9, 2023

add workaround around nomad issues hashicorp/nomad#3093, hashicorp/no…

1f42f71

…mad#698, hashicorp/nomad#10727, hashicorp/nomad#10039, hashicorp/nomad#1635, hashicorp/nomad#8368 + basic readme

hashicorp deleted a comment from tarmolehtpuu Dec 7, 2023

hashicorp deleted a comment from driassou Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebalance Nomad Scheduled Allocations #10039

Rebalance Nomad Scheduled Allocations #10039

idrennanvmware commented Feb 17, 2021 •

edited

Loading

idrennanvmware commented Feb 17, 2021

tgross commented Feb 17, 2021

idrennanvmware commented Feb 17, 2021

tgross commented May 3, 2021

idrennanvmware commented May 21, 2021 •

edited

Loading

maziadi commented Oct 14, 2021

lgfa29 commented Oct 15, 2021

sstent commented Aug 11, 2022

tjohnston-cd commented Aug 22, 2022 •

edited

Loading

maxim-design commented Jan 4, 2023

mwang365 commented Feb 21, 2023

Nukesor commented Jan 8, 2024 •

edited

Loading

Rebalance Nomad Scheduled Allocations #10039

Rebalance Nomad Scheduled Allocations #10039

Comments

idrennanvmware commented Feb 17, 2021 • edited Loading

Operating system and Environment details

Issue

idrennanvmware commented Feb 17, 2021

tgross commented Feb 17, 2021

idrennanvmware commented Feb 17, 2021

tgross commented May 3, 2021

idrennanvmware commented May 21, 2021 • edited Loading

maziadi commented Oct 14, 2021

lgfa29 commented Oct 15, 2021

sstent commented Aug 11, 2022

tjohnston-cd commented Aug 22, 2022 • edited Loading

maxim-design commented Jan 4, 2023

mwang365 commented Feb 21, 2023

Nukesor commented Jan 8, 2024 • edited Loading

Our cluster setup:

How does it get unbalanced

idrennanvmware commented Feb 17, 2021 •

edited

Loading

idrennanvmware commented May 21, 2021 •

edited

Loading

tjohnston-cd commented Aug 22, 2022 •

edited

Loading

Nukesor commented Jan 8, 2024 •

edited

Loading