New Ranker that would facilitate "fairly" spreading job groups to eligible nodes #3690

markpapadakis · 2017-12-27T07:44:19Z

We are currently in the process of adopting nomad and consul, and other than the following use-case that currently is, to the extent of our knowledge, hard or impossible to address based on the current Nomad semantics and available constraints/ranker impl., we think Nomad is "perfect", for us.

Specifically, we have various classes of services that we currently spread across a fleet of physical nodes (say, 50), based on which node runs the least number of instances. We use a combination of home-grown tools and scripts to accomplish this now, but we would like to transition this(like with everything else) to Nomad, but currently that's not possible, because the bin-packing algo. based ranker cannot accommodate that design.

One of those services is an multi-threaded Applications Server. It dynamically resizes the threads pool, and usually runs very hot (i.e keeps the CPU busy, and memory and I/O pressure is also high), which is what it is was designed to do, given that they handle potentially 1000s/reqs, each of such instances do. So by spreading them across the fleet of nodes, we get to optimally utilise them.

Bin-packing would have say, 5 instances on the first node, and then another on the next, until it gets upon 5 and then the next, and so on. Which is to say, nodes will be extremely busy and saturated while all other nodes will be idle, as opposed to just selecting, among those 50 nodes the ones that pass the constraint checks, just figure out which runs the least app.server instances, and running the job group there, which would pretty much solve the problem. (We have a few more other such use cases, this is not specific to the app.server ).

So, ideally, for us, there should be a(nother) constraint for maximum number of instances of a specific job allowed on a node (in case we need, among those 5 nodes, to reserve them capacity for something else and we only want to allow, say, only upto 2 instances of the "application server" job group on them ), and a new ranker stanza for selecting the spread ranker for that job(group?).

I understand a new ranker is currently in the works that would make his possible, and we are really looking forward to this.

Here's some pseudocode in C++ for how it could probably work:

auto eligible_nodes = job_group->filter_nodes_by_constraints(); // all nodes that pass constraints 

if (eligible_nodes.empty()) {
        // can't run it anywhere, maybe there should be some fallback stanza for those kind of situations?
}
else {
        const auto n = eligible_nodes.size();
        std::uint32_t best{0};
        auto best_cnt = eligible_nodes.front().instances_cnt(job_group_id);

        for (std::uint32_t i{1}; i != n; ++i)
        {
                const auto cnt = eligible_nodes[i].instances_cnt(job_group_id);

                if (cnt < best_cnt)
                {
                        best = i;
                        best_cnt = cnt;
                }
        }


        // just matched the node with the least running job/group instances
        return eligible_nodes[best];
}

jippi · 2017-12-27T10:13:58Z

Hi,

Maybe you could use distinct_property or one of the other constraints to ensure a proper spread of job - thats I've done it historically.

Example with max 2 of your job type per instance

constraint {
  operator  = "distinct_property"
  attribute = "${node.unique.id}"
  value     = "2"
}

Or simply put the resource requirements high enough (e.g. 51% of the instance) so nomad never co-locate them :)

markpapadakis · 2017-12-27T10:20:35Z

Hey @jippi,

I think distinct_property can be used to limit placement so that e.g no more than it's set to 5 for a special attribute, then no more than 5 instances of a job/group can be scheduled on that node. But, unless I am missing something, it won't instruct the scheduler to spread allocations -- just to select a different node once say, 5, job/groups have been scheduled on a node.

Also, re: resources requirements, nodes always run hot anyway, so it probably won't help.

Thanks

jippi · 2017-12-27T10:56:32Z

In my experience, nomad will do a pretty good job at spreading allocs of the same type across multiple nodes out of the box - i would test with distinct_property and check for your self.

In my experience, if you have a count of 5, and all 5 allocs fit on one instance, nomad will not put all 5 on that instance anyway, but will try to ensure a spread within the same job, to ensure losing one node won't take down all allocs of a job. There is some anti entropy going on as well.

I got jobs similar to yours, and with the above config example, i've never seen nomad put all eggs in one basket :)

markpapadakis · 2017-12-27T11:02:36Z

This is mostly about making optimal use of the nodes, not so much about reducing likelihood of service disruption by say, placing multiple instances on a node and node having issues and end up losing whatever was served by that node(thought that's definitely important).

Thank you though. We will try that configuration, see how far it will take us :)

samart · 2017-12-28T00:07:12Z

It would be nice to choose a spread algorithm vs a pack.

distinct doest help when you have more jobs than nodes. imagine some cpu intensive batch job or process that you want to spread out vs taking down a single node, or simply you don't want to have multiple different jobs schedule on the same nodes - i..e you want to simply spread any blast radius / impact of node failure or spare the same node from constant deployment churn. you also wont be subject to a docker engine failure on the one node nomad is choosing and hence all jobs just go into a black hole because the node is bad and the jobs fail - lets say docker issue.

schmichael · 2018-01-03T19:09:53Z

Anti-affinity / spread-algo is definitely something we intend to support in the future. Doubt it will be in the next (0.8) release though. We've had interest from other users as well who are currently using distincty_property and distinct_hosts with some success. I think distinct_hosts may work for your use case, but I'll definitely leave this issue open as being able to choose an alternative scheduling algorithm to bin packing is on our roadmap.

linuxgood1230 · 2018-06-21T06:53:26Z

It would be nice to choose a spread algorithm vs a pack.
comment and follow the issue

ramm · 2019-01-18T14:24:47Z

Oh. Indeed, that would be really nice. It's extremely nice to have, if you're using soft limits.

jippi · 2019-01-18T14:34:04Z

spread and affinity will come in 0.9 :)

ramm · 2019-01-21T10:08:42Z

@jippi that's still job-(and further-)level, right?
we're talking about spreading all the tasks throughout the cluster, not instances of one task.

"Spreading" as an opposite to "bin packing".

jippi · 2019-01-21T12:27:40Z

it is per job, yep, but if all your jobs do spread, it would basically be the same thing at least unblcok your requirements until nomad might support it on the cluster level. Generally in my experience nomad is pretty good out of the box to not dump 40 of the same alloc on the same box, so not something I've personally have suffered issues from

ramm · 2019-01-22T17:49:41Z

Not the same alloc. But if you have 150+jobs with soft limits — you WILL have overloaded and almost free nodes, because of binpacking.

preetapan · 2019-01-22T20:24:20Z

@ramm Spread at the cluster level is in a future roadmap, no ETA on that yet. We would need to introduce configuring that at a node class level so that you can have a cluster where a set of nodes are using spread rather than binpacking for scoring placement.

alexiri · 2019-04-26T10:12:55Z

We're very interested in this feature as well. We thought the new spread parameter would do exactly this and are very disappointed that it doesn't.

Here's a graph of several dispatched jobs running on a 4-node cluster, each node represented with a different color. The job is spread by node.unique.name, which as you can tell has no effect whatsoever: starting at ~9:30 all jobs are scheduled on the same node, leaving the other 3 nodes totally idle.

alexiri · 2019-06-24T14:29:56Z

Hi @preetapan, still no ETA on this?

recursionbane · 2020-01-05T19:58:21Z

@alexiri, we just spent close to a week struggling to understand exactly this behavior. A cluster-wide spread is required for efficient oversubscription of resources for unary task specs in our use-case.

Any ideas to exploit existing stanzas (like the datacenter string, perhaps), to get a spread effect (instead of bin-packing)?

recursionbane · 2020-01-06T21:14:07Z

Spreading works when exploiting node_class affinity.

E.g., if you have ten clients with node_class as 0,1,2...,9, then you can set an affinity for each job/group/task to be int(rand(9)) for better spread.

Obviously, this is not ideal because your job submission mechanism now needs to be aware of the number of existing clients and their valid node_classes, but it works for our use-case for now.

alexiri · 2020-04-01T14:21:50Z

Hi @preetapan, has there been any progress on this?

tgross · 2020-07-31T18:46:46Z

In Nomad 0.11.2 we released the new spread scheduling option. See #7810 and the default_scheduler_config option

fwkz · 2020-08-01T17:34:14Z

In Nomad 0.11.2 we released the new spread scheduling option. See #7810 and the default_scheduler_config option

I guess this feature was released in v0.12.0.

github-actions · 2022-11-04T02:37:47Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

chelseakomlo added type/enhancement theme/scheduling labels Jan 4, 2018

erhlee-bird mentioned this issue Apr 4, 2019

[Feature] Apply Spread stanzas to Parameterized Jobs #5519

Closed

tgross closed this as completed Jul 31, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Ranker that would facilitate "fairly" spreading job groups to eligible nodes #3690

New Ranker that would facilitate "fairly" spreading job groups to eligible nodes #3690

markpapadakis commented Dec 27, 2017

jippi commented Dec 27, 2017 •

edited

Loading

markpapadakis commented Dec 27, 2017

jippi commented Dec 27, 2017

markpapadakis commented Dec 27, 2017

samart commented Dec 28, 2017 •

edited

Loading

schmichael commented Jan 3, 2018

linuxgood1230 commented Jun 21, 2018 •

edited

Loading

ramm commented Jan 18, 2019

jippi commented Jan 18, 2019 •

edited

Loading

ramm commented Jan 21, 2019

jippi commented Jan 21, 2019

ramm commented Jan 22, 2019

preetapan commented Jan 22, 2019

alexiri commented Apr 26, 2019

alexiri commented Jun 24, 2019

recursionbane commented Jan 5, 2020

recursionbane commented Jan 6, 2020

alexiri commented Apr 1, 2020

tgross commented Jul 31, 2020

fwkz commented Aug 1, 2020 •

edited

Loading

github-actions bot commented Nov 4, 2022

New Ranker that would facilitate "fairly" spreading job groups to eligible nodes #3690

New Ranker that would facilitate "fairly" spreading job groups to eligible nodes #3690

Comments

markpapadakis commented Dec 27, 2017

jippi commented Dec 27, 2017 • edited Loading

markpapadakis commented Dec 27, 2017

jippi commented Dec 27, 2017

markpapadakis commented Dec 27, 2017

samart commented Dec 28, 2017 • edited Loading

schmichael commented Jan 3, 2018

linuxgood1230 commented Jun 21, 2018 • edited Loading

ramm commented Jan 18, 2019

jippi commented Jan 18, 2019 • edited Loading

ramm commented Jan 21, 2019

jippi commented Jan 21, 2019

ramm commented Jan 22, 2019

preetapan commented Jan 22, 2019

alexiri commented Apr 26, 2019

alexiri commented Jun 24, 2019

recursionbane commented Jan 5, 2020

recursionbane commented Jan 6, 2020

alexiri commented Apr 1, 2020

tgross commented Jul 31, 2020

fwkz commented Aug 1, 2020 • edited Loading

github-actions bot commented Nov 4, 2022

jippi commented Dec 27, 2017 •

edited

Loading

samart commented Dec 28, 2017 •

edited

Loading

linuxgood1230 commented Jun 21, 2018 •

edited

Loading

jippi commented Jan 18, 2019 •

edited

Loading

fwkz commented Aug 1, 2020 •

edited

Loading