Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental planner PR #2: twindow replacement #233

Closed
wants to merge 6 commits into from
Closed

Experimental planner PR #2: twindow replacement #233

wants to merge 6 commits into from

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Jan 3, 2017

This second experimental PR is strictly for @lipari. I replaced the linear twindow searching with planner's API which should be more scalable.

I confirmed all of the unit and integration tests pass with this commit. But I will ultimately want to run a larger emulation (e.g. borrowing from @SteVwonder's work) and make sure the planner-based scheduler produces identical schedules as the original twindow implementation.

As you and I discussed this morning, I decided to continue with this branch and add an initial version of scheduler-driven update on top of this before requesting an official PR review. This is a bit of a plan chang, as I initially planned to fork off a new matcher/visitor branch and implement the first version of aggregate work in that branch.

In any case, since I will have to work on other projects/tasks for a while, this is to inform you with my progress.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.0%) to 73.647% when pulling 4348bff on dongahn:planner_twindow into 8f43078 on flux-framework:master.

@codecov-io
Copy link

codecov-io commented Jan 3, 2017

Codecov Report

Merging #233 into master will increase coverage by 1.59%.
The diff coverage is 68.39%.

@@            Coverage Diff             @@
##           master     #233      +/-   ##
==========================================
+ Coverage   55.05%   56.65%   +1.59%     
==========================================
  Files          25       30       +5     
  Lines        5184     6019     +835     
  Branches     1168     1389     +221     
==========================================
+ Hits         2854     3410     +556     
- Misses       1626     1811     +185     
- Partials      704      798      +94
Impacted Files Coverage Δ
resrc/resrc.h 100% <ø> (ø)
src/common/libutil/interval_tree.c 0% <ø> (ø)
resrc/resrc_flow.c 32.19% <ø> (ø)
src/common/libutil/rbtree.h 100% <100%> (ø)
resrc/planner.c 64.29% <64.29%> (ø)
resrc/resrc_reqst.c 74.54% <66.66%> (-0.76%)
src/common/libutil/rbtree_augmented.h 68.51% <68.51%> (ø)
src/common/libutil/rbtree.c 76.87% <76.87%> (ø)
resrc/resrc.c 67.83% <80%> (-0.83%)
sched/sched.c 57.04% <84.37%> (+0.74%)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b95852...419038d. Read the comment docs.

@dongahn
Copy link
Member Author

dongahn commented Jan 14, 2017

@lipari: I spent sometime today to map out a few initial steps for scheduler-driven aggregate updates. Here is what I sketched -- so that I can get your feedback better if we have a chance to talk next week.

resrc

Start simple and place the planner-based aggregate structure only on a small set of resource type:

  • At the resrc vertex of cluster type, maintain aggregates on nodes and cores under its subtree;
  • At each resrc vertex of node type, maintain aggregates on cores under its subtree;
  • Initialize and populate these aggregate planners at resrc generation time;
  • Avoid using graph elements (e.g., BW and power), for now.

resrc_reqst

Instead of computing aggregate requirements of the resrc_reqst by walking its hierarchy at each match-check time, compute node/core aggregate requirements upfront and stick them into resrc_reqst at certain levels:

  • Stick the total node and core count requirement in aggregate to the top-level request vertex;
  • Stick the core count requirement in aggregate to the node-level request vertex.

match_resource

Augment match_resource to query the aggregate planner and avoid unnecessary resrc tree decent

insertion point at which the scheduler updates the aggregate planners right after a job is successfully scheduled

I still need to do some hacking to find the appropriate place. I initially envisioned I do this at the job tagging time but I didn't have a chance to look at the code today.

Performance testers and target

  • Introduce large graph stress test -- we should in general benefit from adding another resrc unit test with a large cluster configuration with common walk patterns (e.g., including a large graph with available resources varying from many to small);
  • Significantly speed up and scale for scheduling through many unsatisfiable high-priority jobs in the pending queue;
  • Significantly speed up and scale for the case where many jobs need to complete and free resources before a large job can be scheduled under the conservative backfill policy

Correctness

I will certainly utilize existing tests and make sure all tests pass. But test cases to validate/verify the correctness of our scheduling algorithms are still somewhat limited, and it feels like we need some solutions there.

@lipari
Copy link
Contributor

lipari commented Jan 17, 2017

@dongahn, the resource matching implementation that currently exists is a brute-force approach. Its advantage is that is very versatile in the complex resource requests that can be matched. The drawbacks are the performance penalty for visiting each child resource individually.
It is therefore ripe for optimizing and your efforts are welcome. My only request at this point is that you think of a way to maintain the old brute force approach as a fall-back option. If we dig too deeply to convert to an aggregate solution, we may discover someday a resource request that is complex enough to be un-satisfiable using the optimized code, but possible using the original, brute force method.

@dongahn
Copy link
Member Author

dongahn commented Jan 17, 2017

@lipari: Thanks. Yes, this is a great idea for debugging and verification!

@coveralls
Copy link

Coverage Status

Coverage increased (+1.0%) to 73.639% when pulling 9732495 on dongahn:planner_twindow into 8f43078 on flux-framework:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.0%) to 73.673% when pulling f62705b on dongahn:planner_twindow into 8f43078 on flux-framework:master.

Pull in Linux red black tree and interval tree adapted
for user space use.

Original repo is at https://github.com/markfasheh/interval-tree
Add the planner class, a simple API and efficient mechanisms to allow
a Flux scheduler to keep track of the state of resource aggregates
of a composite resource.

[From its header file]

Planner provides a simple API and efficient mechanisms to allow
a Flux scheduler to keep track of the state of resource aggregates
of a composite resource.

In a resource hierarchy used by flux-sched (e.g., hardware
hierarchy), a composite resource is represented as a tree graph
in which a higher-level vertex has essentially pointers to its
immediate child resources, each of which also has pointers to
its immediate children etc. With such an organization, the
scheduler must essentially walk "all" of the vertices below any
composite resource in order to determine if the "sub-resources"
requirement can be met.

When the scheduler performs such walks excessively in particular,
on large graph, however, this can quickly become a performance and
scalability bottleneck. Planner addresses this problem by allowing
the scheduler to track the "sub-resources" summary information
(i.e., aggregates) efficiently at each upper-level composite
resource vertex and to use this aggregate information to prune
unneccessary descent down into the subtree.

Planner offers update and query APIs to support these schemes.
Through a planner API, the scheduler can ask a high-level composite
a question: "given a request of x, y, z "sub-resources" in aggregate
for d time unit, when is the earliest time t at which this request
can be satisfied?"
Another example would be to answer, "from time t to t+d, does
this composite resource vertex has y, z sub-resources available
in aggregate.  By composing these queries at different levels in a
resource hierarchy, the scheduler can significantly reduce the
numbers of tree walks.  Ultimately, planner will be integrated
into our preorder tree-walk pruning filter in our future
visitor-pattern-based resource matching scheme.
Replace resrc's twindow using planner-based API.
two things to note:

1. Because of an API mismatch between
resrc's "avail" functions and planner's, this commit
changes the signiture of the resrc's "avail" functions a bit
and adjust the resrc test cases.

2. Because there is yet no use case for resrc's copy
constructor, this commit doesn't make a deep copy of
the planner-based twindow member. The exact semantics
of the copy constructor needs to be discussed.
Add get_resrc_reqst and move the logic to fill resrc_reqst to
this function.

Enhance readability of schedule_job which has grown somewhat large.

More importantly, being ready to encode resource aggregate
info without having to modify the main schedule_job function.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 73.477% when pulling 419038d on dongahn:planner_twindow into 7b95852 on flux-framework:master.

@dongahn
Copy link
Member Author

dongahn commented Sep 5, 2017

As discussed with @morrone and @tpatki today, I'm closing this in preparation for a new planner-only PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants