Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Ranker that would facilitate "fairly" spreading job groups to eligible nodes #3690

Closed
markpapadakis opened this issue Dec 27, 2017 · 21 comments

Comments

@markpapadakis
Copy link

We are currently in the process of adopting nomad and consul, and other than the following use-case that currently is, to the extent of our knowledge, hard or impossible to address based on the current Nomad semantics and available constraints/ranker impl., we think Nomad is "perfect", for us.

Specifically, we have various classes of services that we currently spread across a fleet of physical nodes (say, 50), based on which node runs the least number of instances. We use a combination of home-grown tools and scripts to accomplish this now, but we would like to transition this(like with everything else) to Nomad, but currently that's not possible, because the bin-packing algo. based ranker cannot accommodate that design.

One of those services is an multi-threaded Applications Server. It dynamically resizes the threads pool, and usually runs very hot (i.e keeps the CPU busy, and memory and I/O pressure is also high), which is what it is was designed to do, given that they handle potentially 1000s/reqs, each of such instances do. So by spreading them across the fleet of nodes, we get to optimally utilise them.

Bin-packing would have say, 5 instances on the first node, and then another on the next, until it gets upon 5 and then the next, and so on. Which is to say, nodes will be extremely busy and saturated while all other nodes will be idle, as opposed to just selecting, among those 50 nodes the ones that pass the constraint checks, just figure out which runs the least app.server instances, and running the job group there, which would pretty much solve the problem. (We have a few more other such use cases, this is not specific to the app.server ).

So, ideally, for us, there should be a(nother) constraint for maximum number of instances of a specific job allowed on a node (in case we need, among those 5 nodes, to reserve them capacity for something else and we only want to allow, say, only upto 2 instances of the "application server" job group on them ), and a new ranker stanza for selecting the spread ranker for that job(group?).

I understand a new ranker is currently in the works that would make his possible, and we are really looking forward to this.

Here's some pseudocode in C++ for how it could probably work:

auto eligible_nodes = job_group->filter_nodes_by_constraints(); // all nodes that pass constraints 

if (eligible_nodes.empty()) {
        // can't run it anywhere, maybe there should be some fallback stanza for those kind of situations?
}
else {
        const auto n = eligible_nodes.size();
        std::uint32_t best{0};
        auto best_cnt = eligible_nodes.front().instances_cnt(job_group_id);

        for (std::uint32_t i{1}; i != n; ++i)
        {
                const auto cnt = eligible_nodes[i].instances_cnt(job_group_id);

                if (cnt < best_cnt)
                {
                        best = i;
                        best_cnt = cnt;
                }
        }


        // just matched the node with the least running job/group instances
        return eligible_nodes[best];
}
@jippi
Copy link
Contributor

jippi commented Dec 27, 2017

Hi,

Maybe you could use distinct_property or one of the other constraints to ensure a proper spread of job - thats I've done it historically.

Example with max 2 of your job type per instance

constraint {
  operator  = "distinct_property"
  attribute = "${node.unique.id}"
  value     = "2"
}

Or simply put the resource requirements high enough (e.g. 51% of the instance) so nomad never co-locate them :)

@markpapadakis
Copy link
Author

Hey @jippi,

I think distinct_property can be used to limit placement so that e.g no more than it's set to 5 for a special attribute, then no more than 5 instances of a job/group can be scheduled on that node. But, unless I am missing something, it won't instruct the scheduler to spread allocations -- just to select a different node once say, 5, job/groups have been scheduled on a node.

Also, re: resources requirements, nodes always run hot anyway, so it probably won't help.

Thanks

@jippi
Copy link
Contributor

jippi commented Dec 27, 2017

In my experience, nomad will do a pretty good job at spreading allocs of the same type across multiple nodes out of the box - i would test with distinct_property and check for your self.

In my experience, if you have a count of 5, and all 5 allocs fit on one instance, nomad will not put all 5 on that instance anyway, but will try to ensure a spread within the same job, to ensure losing one node won't take down all allocs of a job. There is some anti entropy going on as well.

I got jobs similar to yours, and with the above config example, i've never seen nomad put all eggs in one basket :)

@markpapadakis
Copy link
Author

This is mostly about making optimal use of the nodes, not so much about reducing likelihood of service disruption by say, placing multiple instances on a node and node having issues and end up losing whatever was served by that node(thought that's definitely important).

Thank you though. We will try that configuration, see how far it will take us :)

@samart
Copy link

samart commented Dec 28, 2017

It would be nice to choose a spread algorithm vs a pack.

distinct doest help when you have more jobs than nodes. imagine some cpu intensive batch job or process that you want to spread out vs taking down a single node, or simply you don't want to have multiple different jobs schedule on the same nodes - i..e you want to simply spread any blast radius / impact of node failure or spare the same node from constant deployment churn. you also wont be subject to a docker engine failure on the one node nomad is choosing and hence all jobs just go into a black hole because the node is bad and the jobs fail - lets say docker issue.

@schmichael
Copy link
Member

Anti-affinity / spread-algo is definitely something we intend to support in the future. Doubt it will be in the next (0.8) release though. We've had interest from other users as well who are currently using distincty_property and distinct_hosts with some success. I think distinct_hosts may work for your use case, but I'll definitely leave this issue open as being able to choose an alternative scheduling algorithm to bin packing is on our roadmap.

@linuxgood1230
Copy link

linuxgood1230 commented Jun 21, 2018

It would be nice to choose a spread algorithm vs a pack.
comment and follow the issue

@ramm
Copy link

ramm commented Jan 18, 2019

Oh. Indeed, that would be really nice. It's extremely nice to have, if you're using soft limits.

@jippi
Copy link
Contributor

jippi commented Jan 18, 2019

spread and affinity will come in 0.9 :)

@ramm
Copy link

ramm commented Jan 21, 2019

@jippi that's still job-(and further-)level, right?
we're talking about spreading all the tasks throughout the cluster, not instances of one task.

"Spreading" as an opposite to "bin packing".

@jippi
Copy link
Contributor

jippi commented Jan 21, 2019

it is per job, yep, but if all your jobs do spread, it would basically be the same thing at least unblcok your requirements until nomad might support it on the cluster level. Generally in my experience nomad is pretty good out of the box to not dump 40 of the same alloc on the same box, so not something I've personally have suffered issues from

@ramm
Copy link

ramm commented Jan 22, 2019

Not the same alloc. But if you have 150+jobs with soft limits — you WILL have overloaded and almost free nodes, because of binpacking.

@preetapan
Copy link
Contributor

@ramm Spread at the cluster level is in a future roadmap, no ETA on that yet. We would need to introduce configuring that at a node class level so that you can have a cluster where a set of nodes are using spread rather than binpacking for scoring placement.

@alexiri
Copy link
Contributor

alexiri commented Apr 26, 2019

We're very interested in this feature as well. We thought the new spread parameter would do exactly this and are very disappointed that it doesn't.

Here's a graph of several dispatched jobs running on a 4-node cluster, each node represented with a different color. The job is spread by node.unique.name, which as you can tell has no effect whatsoever: starting at ~9:30 all jobs are scheduled on the same node, leaving the other 3 nodes totally idle.

image

@alexiri
Copy link
Contributor

alexiri commented Jun 24, 2019

Hi @preetapan, still no ETA on this?

@recursionbane
Copy link

@alexiri, we just spent close to a week struggling to understand exactly this behavior. A cluster-wide spread is required for efficient oversubscription of resources for unary task specs in our use-case.

Any ideas to exploit existing stanzas (like the datacenter string, perhaps), to get a spread effect (instead of bin-packing)?

@recursionbane
Copy link

Spreading works when exploiting node_class affinity.

E.g., if you have ten clients with node_class as 0,1,2...,9, then you can set an affinity for each job/group/task to be int(rand(9)) for better spread.

Obviously, this is not ideal because your job submission mechanism now needs to be aware of the number of existing clients and their valid node_classes, but it works for our use-case for now.

@alexiri
Copy link
Contributor

alexiri commented Apr 1, 2020

Hi @preetapan, has there been any progress on this?

@tgross
Copy link
Member

tgross commented Jul 31, 2020

In Nomad 0.11.2 we released the new spread scheduling option. See #7810 and the default_scheduler_config option

@tgross tgross closed this as completed Jul 31, 2020
@fwkz
Copy link
Contributor

fwkz commented Aug 1, 2020

In Nomad 0.11.2 we released the new spread scheduling option. See #7810 and the default_scheduler_config option

I guess this feature was released in v0.12.0.

@github-actions
Copy link

github-actions bot commented Nov 4, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests