Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebalance Nomad Scheduled Allocations #10039

Open
idrennanvmware opened this issue Feb 17, 2021 · 12 comments
Open

Rebalance Nomad Scheduled Allocations #10039

idrennanvmware opened this issue Feb 17, 2021 · 12 comments

Comments

@idrennanvmware
Copy link
Contributor

idrennanvmware commented Feb 17, 2021

Nomad 1.0.3

Operating system and Environment details

PhotonOS3

Issue

Over time we have noticed our nodes getting allocations placed on the same node more and more (we use the spread algorithm as a default for the scheduler config). This has produced some interesting scenarios where we have system jobs that are unable to run due to dimensions being exhausted on the node BUT there are other nodes available for allocations on the saturated node.

Right now we manually restart the allocation group which results in the allocations starting to spread as expected but we are a little surprised by this behavior - and in a few instances an application and all it's instances are on the same node.
EDIT: We have a strong suspicion the above is the the culprit. We recently had a sizing operation done on some clusters and ALL of them that were affected by this sizing (and reboot) are displaying this inbalance. We will run tests to see if we can get verifiable behavior. More concerning is that applications seem to "stack up" like the screenshot below.

It may be coincidental but we have also recently migrated all these services to Service Mesh - Not sure that's relevant but I also don't want to exclude any significant changes that coincide with this observed behavior.

In addition we were wondering if there's a way we can issue a rebalance command across the cluster where the scheduler can move allocations? It would be really helpful in some scenarios we have where we might do a rolling restart through a cluster (of the actual nodes themselves) - what happens here is that the node when drains then starts the allocations on the remaining nodes as expected - but at the end of all this we end up with 1 node (the last one to restart) significantly unbalanced. In these scenarios we would really like to be able to trigger a rebalance to ensure the saturation described above does not happen.

Thanks!
Ian

@idrennanvmware
Copy link
Contributor Author

Screen Shot 2021-02-17 at 8 35 47 AM

Here's an example we saw on a production server this morning. All allocations on the same client

@tgross
Copy link
Member

tgross commented Feb 17, 2021

Hi @idrennanvmware! So there's sort of two parts to this issue: why are the allocations getting packed and a feature request for a rebalance. If you're using spread scheduling, I would only expect to see allocations getting packed if the node drains and reboots were happening too close together: on small clusters you'll want to wait for each node should get a chance to come back up and get registered and eligible (although on large clusters there's more headroom to have a few in draining at once).

If you take a look at nomad alloc status -verbose :alloc_id, it'll give you the metrics that the scheduler used for placement of that allocation. That might provide some insight into how they're all getting packed.

As far as the feature request goes, that seems reasonable and I think we have an open issue for that already that needs some roadmapping: #8368

@idrennanvmware
Copy link
Contributor Author

@tgross - here's the alloc from one of the services all packed on the same node

ID                  = ec019321-2797-2398-93ed-8a46e80e5a0f
Eval ID             = 81baa965-3ddc-c506-2b46-bd7a7c4cb0db
Name                = device-log-job.device-log-api-group[1]
Node ID             = 63102b60-fe1d-5285-89e8-91deffc6d5e8
Node Name           = <redacted>
Job ID              = device-log-job
Job Version         = 10
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 2021-02-16T22:04:15Z
Modified            = 2021-02-16T22:06:16Z
Evaluated Nodes     = 65
Filtered Nodes      = 64
Exhausted Nodes     = 0
Allocation Time     = 100.058µs
Failures            = 0

Allocation Addresses (mode = "bridge")
Label                          Dynamic  Address
*api_http                      yes      10.104.7.106:22996
*connect-proxy-device-log-api  yes      10.104.7.106:31209 -> 31209

Task "connect-proxy-device-log-api" (prestart sidecar) is "running"
Task Resources
CPU        Memory          Disk     Addresses
8/250 MHz  18 MiB/128 MiB  300 MiB  

Task Events:
Started At     = 2021-02-16T22:04:54Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2021-02-16T22:04:54Z  Started     Task started by client
2021-02-16T22:04:30Z  Driver      Downloading image
2021-02-16T22:04:27Z  Task Setup  Building Task Directory
2021-02-16T22:04:16Z  Received    Task received by client

Task "device-log-api-logging-task" is "running"
Task Resources
CPU         Memory          Disk     Addresses
19/100 MHz  35 MiB/100 MiB  300 MiB  

Task Events:
Started At     = 2021-02-16T22:04:56Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2021-02-16T22:04:56Z  Started     Task started by client
2021-02-16T22:04:54Z  Task Setup  Building Task Directory
2021-02-16T22:04:16Z  Received    Task received by client

Task "device-log-api-task" is "running"
Task Resources
CPU          Memory           Disk     Addresses
115/512 MHz  140 MiB/512 MiB  300 MiB  

Task Events:
Started At     = 2021-02-16T22:05:10Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2021-02-16T22:05:10Z  Started     Task started by client
2021-02-16T22:04:55Z  Driver      Downloading image
2021-02-16T22:04:54Z  Task Setup  Building Task Directory
2021-02-16T22:04:16Z  Received    Task received by client

Placement Metrics
  * Constraint "computed class ineligible": 64 nodes excluded by filter
Node                                  allocation-spread  binpack  job-anti-affinity  node-affinity  node-reschedule-penalty  final score
63102b60-fe1d-5285-89e8-91deffc6d5e8  -1                 0.094    -0.667             0              0                        -0.524

@tgross
Copy link
Member

tgross commented May 3, 2021

Sorry it took me a while to get back this one, @idrennanvmware. I do want to dig into the placement score metrics there in more detail; I don't think we have as good of test coverage as we'd like on how the spread scheduling config interacts with other placement metrics. But I'm also noticing 64 nodes excluded because the computed class is ineligible. Is that expected for your environment?

@tgross tgross removed their assignment May 20, 2021
@idrennanvmware
Copy link
Contributor Author

idrennanvmware commented May 21, 2021

@tgross - apologies here, too - been buried in a million things :)

To answer your question, yes there could be a large number of excluded nodes given we have a mix of windows and linux and windows consist of the bulk of our "clients" even though they make very limited use of Nomad at this point (but we do rely on it for some critical services like client side load balancers, logging, and telemetry binaries)

@maziadi
Copy link

maziadi commented Oct 14, 2021

@tgross : Any updates on this please ?

@lgfa29
Copy link
Contributor

lgfa29 commented Oct 15, 2021

@tgross : Any updates on this please ?

Hi @maziadi 👋

No updates here yet. We'll let you know once we do.

@sstent
Copy link

sstent commented Aug 11, 2022

+1 this would be a really great feature

@tjohnston-cd
Copy link

tjohnston-cd commented Aug 22, 2022

+1, seeing this with a small cluster during rolling restarts (and similar scenarios) as mentioned in the OP.

Many of our jobs are long-running / static, so the overburdened clients will remain overburdened without intervention. We can reschedule jobs manually stop allocations that are on the overburdened clients, but it would be nice to have some kind of operator workflow to rebalance, as mentioned..

@maxim-design
Copy link

+1

@mwang365
Copy link

We are also experiencing unexplained skewed distribution of allocation. A function to prevent that and a separate function to rebalance are essential!

@hashicorp hashicorp deleted a comment from tarmolehtpuu Dec 7, 2023
@Nukesor
Copy link

Nukesor commented Jan 8, 2024

We're currently running into the same scenario.
Once the cluster enters an unbalanced state, it doesn't recover from it, unless there's significant resource excess.

Our cluster setup:

  • We're running 3 Master and 3 worker servers
  • There're multiple jobs running on the cluster
  • Each job has 3 allocs that are to be distributed equally across the cluster
  • While deploying a new version, we spin up canary jobs for all jobs, effectively doubling our workload.
    This is to allow simultanous and fast deployments for all instances at the same time. Sequential deployments would take hours.

How does it get unbalanced

If one starts from a clean slate, the allocs get allocated as intended and are equally distributed.

In our case, the first unbalanced state occurs while updating the nomad/consul cluster.
During the update, each worker instance gets shut down, updated and started up. During this time, the allocs on that machine get rebalanced to the other two workers.

After the update on all machines are done, the topography usually looks like this:

  • worker0: 1.5x normal load
  • worker1: 1.5x normal load
  • worker2: 0x normal load

This behavior is expected, but leads to follow-up problems.
When we start a new deployment, we double the workload as usual, which gets equally distributed this time around.

  • worker0: 2.5x normal load
  • worker1: 2.5x normal load
  • worker2: 1x normal load

If the workers cannot provide 2.5x the normal resource need, allocs are now moved to worker2, which again gets duplicate allocs as a result.
This usually eventually rebalances itself over the course of multiple deployments, but results in a degraded state of reliability for quite some time.

To prevent this scenario, we have to scale our servers to handle 2.5 times the normal load instead of just 2x the normal load or we have to wipe the cluster and start from a clean slate (which isn't an option).

I hope this gives some insights on why rebalancing would be nice to prevent users from requiring additional (unneeded) resources.

@hashicorp hashicorp deleted a comment from driassou Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

9 participants