-
Notifications
You must be signed in to change notification settings - Fork 302
Simple share based scheduling #945
Comments
+1 Full disclosure, I am one of those colleagues @epipho that mentioned. This would give us all the controls we need to balance a cluster, but without the weight of Kubernetes or Mesos (which are overkill for our needs). |
+1 I am another of @epipho's colleagues so I may be some what bias but this solution would provide a good balance between ease of use and control that currently seems to be lacking in the space. |
@epipho interesting proposal. I am a little unclear on how the job multiplier and shares reservation would interact (e.g. do reservations always take precedence, and multipliers are only considered if reservations are not present or equal?) If you could come up with an illustrative example also that would be really great. |
I see it working something like the following. I chose the numbers for the shares and job multipliers to illustrate the math, any number of jobs can have the same multiplier. Consider the following cluster of two machines. One starts with 100 shares, one with 50, they have been running for some time and their current state looks like the table below.
Machine 1 has 4 jobs with multipliers 1, 2, 5, 10. Machine 2 has 3 jobs with multipliers 3, 6, 7. A new job is submitted with a job multiplier of 4. The scheduler first calculates what the new load would be on each machine if the job were to be scheduled there. The load here is only relative, no amount of current load will prevent this job from being scheduled somewhere. 1: (1+2+5+10+4)/100 = 0.22 0.22 < 0.26 so machine 1 is chosen. Current state:
A new job is submitted, but this one uses 25 reserved shares with a multiplier of 1. 1: (1+2+5+10+4+1)/(100-25) = 0.307 Machine 1 still wins the job, but its current share count is reduce since our new job consumed them. If this job finishes or exits the shares will be returned. Current state:
From now until the shares are returned, Machine 1 is treated as a machine with only 75 shares instead of 100. Again a job comes in, this time with a multiplier of 7. Same pattern as the first one. 1: (1+2+5+10+4+1+7)/(100-25) = 0.4 The second machine wins the job, leaving our cluster in the final state of
Hopefully this helps flesh out how jobs with reserved shares interact with the scheduler. **edit: fixing busted up table markdown |
ping. Looking for more feedback on this. |
@epipho sorry for the delay, bcwaldon is away at the moment and I really want to sit down with him to chat about this a bit. |
Not a problem, just wanted to make sure it wasn't lost. I would also be happy to schedule a time to jump on IRC to chat in more detail once bcwaldon is back, |
FWIW here is our fork that does something quite similar, but includes memory constraints as well: I think your comment on #922 is the way to go, that way this sort of thing could be just added in as part of the chain. |
We had a bit of a shakeup over here but are ready to get started soon. Talked to several team members, including @polvi at reInvent last week and they seemed enthusiastic about the concept. |
With all the scheduling chatter I thought it was time to put in an idea that my colleagues and I have been thinking about for a while.
This proposal and is related to #922 and #747
The idea behind this scheduler proposal is to be an extension to the, currently very simple, "flatten the number of jobs" scheduler without attempting to solve any dynamics resource scheduling problems. I believe this method would meet a large number of users's scheduling needs and could be used as the "default" scheduler for fleet.
Machine Shares
Each machine receives a fixed number of shares (default 1024) as a parameter to fleet at startup either as a env variable/flag (i.e. FLEET_ETCD_SERVERS) or as a "well-known" meta-data entry. For cases other than Share reservation (see below) the exact values do not matter and are only used for relative weighting of the number of jobs on each node.
A node with 2048 shares would receive twice the number of jobs as a node with 1024 shares. Total share count should be modifiable via api either via the machine resource or the meta-data api in #555.
A node with zero shares is not eligible to receive any new (non-global) jobs. If the entire cluster is reporting zero shares no new jobs can be sheduled.
Use cases
Job multiplier
A new X-Fleet option (Multiplier or Weight) for weighing a jobs against each other. The multiplier is a floating point option, defaulting to 1.0. This value is then used when scheduling jobs to balance the heavier jobs with the lighter jobs by ensuring each machine unit has approximately the same "weight" of units.
Use cases
Share reservation
A new X-Fleet option (ReserveShares) to consume a number of shares from the machine during the job's run. Once the job has exited the shares are returned to the pool. While the shares are tied up the machine appears to have less shares than normal to the regular scheduler. A machine with 1024 shares running a 256 share reserved task would appear as a 768 point machine to the scheduler.
A job that reserves shares cannot be started if no machines report an appropriate number of unconsumed shares, with the exception of global jobs (See Global Jobs).
Use cases
Global Jobs
Global jobs present an edge case for this system, particularly where share reservation is concerned. To resolve the edge case the scheduler would have the following rules.
Wrap-up
Overall I think this would be a good addition to the default fleet scheduler. All options are opt-in and if none are set or used the behavior stays exactly how it is today as all machines would have the same number of shares, all jobs would be weighted at 1.0 and no shares would be reserved.
I also think this is inline with the goals of fleet. It is easy to use and provides significant power without trying to be everything for everyone.
Investigations on what it would take to implement all three features are currently underway but I wanted to engage with the CoreOS developers and the community before I went too far down a path.
I look forward to any and all feedback.
The text was updated successfully, but these errors were encountered: