Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for resource matching service #386

Merged
merged 18 commits into from
Sep 20, 2018

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Sep 1, 2018

This PR addresses all of the comments I had received on the posting of the previous experimental PR.

  • Refactor, augment, modify the resource scheduling infrastructure used by both the resource module and resource-query utility.

  • Add the resource matching service module that uses this infrastructure to populate the resource graph data store and expose four request handlers (match, cancel, info, and next_jobid) next_jobid is only needed for testing.

  • Add flux-resource.py as the front-end command that can interact with resource module. This is essential to drive our tests when we don't have the upcoming schedule loop service.

Don't merge this yet as we have a few more steps before landing this. The proposed remaining steps are:

  1. Add tests. They will focus on testing the handler and RPC logic as I prefer using resource-query for the general correctness and performance tests for the resource scheduling infrastructure (which I already have a bunch).

  2. Squash bugs we found during adding tests

  3. Land this

  4. Merge in @SteVwonder's PR resource: support for hwloc ingestion #385 so that resource module can be populated with hwloc data.

  5. Add rc1 and rc3 support with the hwloc mode being default. (Another PR).

If you want to play with this:

copy t/data/resource/jobspecs/basics/test001.yaml and t/data/resource/grugs/tiny.graphml into a directory first.

% flux start -s 1 -o,-S,log-filename=out
% flux module load resource grug-conf=tiny.graphml subsystems=containment policy=high
% flux resource match allocate test001.yaml
JOBID                STATUS               AT                   OVERHEAD (Secs)
0                    ALLOCATED            Now                  0.00181102752686
================================================================================
MATCHED RESOURCES:
      ---------------core35[1:x]
      ------------socket1[1:x]
      ---------node1[1:s]
      ------rack0[1:s]
      ---tiny0[1:s]

% flux resource match allocate test001.yaml
JOBID                STATUS               AT                   OVERHEAD (Secs)
1                    ALLOCATED            Now                  0.000900983810425
================================================================================
MATCHED RESOURCES:
      ---------------core17[1:x]
      ------------socket0[1:x]
      ---------node1[1:s]
      ------rack0[1:s]
      ---tiny0[1:s]

% flux resource match allocate test001.yaml
JOBID                STATUS               AT                   OVERHEAD (Secs)
2                    ALLOCATED            Now                  0.000859975814819
================================================================================
MATCHED RESOURCES:
      ---------------core35[1:x]
      ------------socket1[1:x]
      ---------node0[1:s]
      ------rack0[1:s]
      ---tiny0[1:s]

% flux resource match allocate test001.yaml
JOBID                STATUS               AT                   OVERHEAD (Secs)
3                    ALLOCATED            Now                  0.000686883926392
================================================================================
MATCHED RESOURCES:
      ---------------core17[1:x]
      ------------socket0[1:x]
      ---------node0[1:s]
      ------rack0[1:s]
      ---tiny0[1:s]

% flux resource match allocate test001.yaml
MATCHED RESOURCES: Not found

% flux resource info 3
JOBID                STATUS               AT                   OVERHEAD (Secs)
3                    ALLOCATED            Now                  0.000686883926392

% flux resource match allocate_orelse_reserve test001.yaml
JOBID                STATUS               AT                   OVERHEAD (Secs)
4                    RESERVED             3600                 0.00125002861023
================================================================================
MATCHED RESOURCES:
      ---------------core35[1:x]
      ------------socket1[1:x]
      ---------node1[1:s]
      ------rack0[1:s]
      ---tiny0[1:s]

@coveralls
Copy link

coveralls commented Sep 1, 2018

Coverage Status

Coverage increased (+0.2%) to 77.745% when pulling 170f972 on dongahn:matching-service into 3f36f33 on flux-framework:master.

@codecov-io
Copy link

codecov-io commented Sep 1, 2018

Codecov Report

Merging #386 into master will increase coverage by 0.12%.
The diff coverage is 72.67%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #386      +/-   ##
=========================================
+ Coverage   75.88%     76%   +0.12%     
=========================================
  Files          59      64       +5     
  Lines        9829   10175     +346     
=========================================
+ Hits         7459    7734     +275     
- Misses       2370    2441      +71
Impacted Files Coverage Δ
resource/schema/sched_data.cpp 61.76% <ø> (ø) ⬆️
resource/evaluators/scoring_api.cpp 69.02% <ø> (+0.7%) ⬆️
resource/policies/dfu_match_low_id_first.cpp 83.78% <ø> (ø) ⬆️
resource/traversers/dfu_impl.cpp 82.38% <ø> (+0.74%) ⬆️
resource/schema/infra_data.cpp 40.74% <ø> (+5.55%) ⬆️
resource/policies/base/matcher.hpp 100% <ø> (ø) ⬆️
resource/policies/dfu_match_locality.cpp 0% <ø> (ø) ⬆️
resource/evaluators/fold.hpp 14.28% <ø> (ø) ⬆️
resource/generators/gen.cpp 84.61% <ø> (+1.09%) ⬆️
resource/policies/dfu_match_high_id_first.cpp 86.48% <ø> (+2.7%) ⬆️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f36f33...170f972. Read the comment docs.

@dongahn dongahn force-pushed the matching-service branch 3 times, most recently from ea54bfa to f5b4f1f Compare September 2, 2018 16:44
@dongahn
Copy link
Member Author

dongahn commented Sep 2, 2018

@SteVwonder: Ok, I added a few test cases and adjusted minor things (e.g., changing flux-resource.py to flux-resource. Of all the planned test cases, I decide to add resource load option tests after we merge your hwloc PR. In a near future, I will also need a valgrind test as well. When Travis CI turns green, this is ready for your review.

@dongahn
Copy link
Member Author

dongahn commented Sep 2, 2018

Requested @tpatki as a reviewer as well.

@dongahn
Copy link
Member Author

dongahn commented Sep 3, 2018

Found ./flux-sched-0.6.0-17-g2d7394d/_build/sub/t/t4001-match-allocate.output
expecting success: 
    flux module load resource grug-conf=${grug} \
subsystems=containment policy=high
flux-module: resource: not found in module search path
not ok 1 - loading resource module with a tiny machine config works

Somehow make distcheck doesn't find the new resource module (and also flux-resource command). make distcheck works on quartz so this could have something to do with the lib tool version.

I wonder if this is related to the fact that I have the containing directory specified in the dependency...

fluxmod_LTLIBRARIES = modules/resource.la

@tpatki
Copy link
Member

tpatki commented Sep 4, 2018

@dongahn: I'll test this tmrw in detail, just sending a note as a quick update.

@dongahn
Copy link
Member Author

dongahn commented Sep 6, 2018

Hmmm. I moved resource module build rules to its own directory (resource/modules/Makefile.am) But I am still getting:

Found ./flux-sched-0.6.0-19-g2961643/_build/sub/t/t4001-match-allocate.output
expecting success: 
    flux module load resource grug-conf=${grug} \
subsystems=containment policy=high
flux-module: resource: not found in module search path
not ok 1 - loading resource module with a tiny machine config works
expecting success: 
    flux resource match allocate ${jobspec} &&
    flux resource match allocate ${jobspec} &&
    flux resource match allocate ${jobspec} &&
    flux resource match allocate ${jobspec}
flux: `resource' is not a flux command.  See 'flux --help'
not ok 2 - match-allocate works with a 1-node, 1-socket jobspec

Probably missing something more fundamental...

@dongahn
Copy link
Member Author

dongahn commented Sep 6, 2018

Seems that I just needed to add resource/modules/.libs to FLUX_MODULE_PATH_PREPEND as make distcheck runs in-build check. Hopefully this will do.

FLUX_MODULE_PATH_PREPEND="${SHARNESS_BUILD_DIRECTORY}/sched/.libs"
FLUX_EXEC_PATH_PREPEND="${SHARNESS_BUILD_DIRECTORY}/sched"
FLUX_MODULE_PATH_PREPEND="${SHARNESS_BUILD_DIRECTORY}/sched/.libs:${SHARNESS_BUILD_DIRECTORY}/resource/modules/.libs"
FLUX_EXEC_PATH_PREPEND="${SHARNESS_BUILD_DIRECTORY}/sched:${SHARNESS_BUILD_DIRECTORY}/resource/modules"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since your flux-resource utility is a script and not a built program, I think you will want to point to ${SHARNESS_TEST_SRCDIR}/../resource/modules here and not to the "build" directory.

@dongahn
Copy link
Member Author

dongahn commented Sep 6, 2018

Ah. That makes a lot of sense and likely resolve the current issue. Thanks @grondo!

@grondo
Copy link
Contributor

grondo commented Sep 6, 2018

@dongahn, is the flux-resource utility meant to be a first class flux utility, or more of a test or "plumbing" utility? If the latter, I suggest renaming it (if it is installed) so that it does not conflict with a future, user-facing flux-resource utility or similar. If it is to be a first-class utility, we may have to discuss how to make a standardized UI or something so that other schedulers can also provide a flux-resource interface.

Does that make any sense?

@dongahn
Copy link
Member Author

dongahn commented Sep 6, 2018

It meant to be a utility for testing for now.

In terms of first class scheduler UIs, what I would like to do is to get an end to end demonstration between the new exec system, R, this matching service and the upcoming scheduler loop service first before standardizing user facing interfaces.

It is unclear if flux-resource should be the name for the ultimate UI though. I envision something like flux sched which interacts with both the scheduler loop and resource services to do status and control on the scheduler components. Then, other scheduler with a single module design can also easily satisfy the same UI.

For now, I'd be happy to change the name toflux-resource-query or flux-resource-test

FLUX_MODULE_PATH_PREPEND="${SHARNESS_BUILD_DIRECTORY}/sched/.libs"
FLUX_EXEC_PATH_PREPEND="${SHARNESS_BUILD_DIRECTORY}/sched"
FLUX_MODULE_PATH_PREPEND="${SHARNESS_BUILD_DIRECTORY}/sched/.libs:${SHARNESS_BUILD_DIRECTORY}/resource/modules/.libs"
FLUX_EXEC_PATH_PREPEND="${SHARNESS_BUILD_DIRECTORY}/sched:${SHARNESS_BUILD_DIRECTORY}/../resource/modules"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For FLUX_PATH_PREPEND, try s/SHARNESS_BUILD_DIRECTORY/SHARNESS_TEST_SRCDIR/.

Also, if the flux-resource utility is just for testing, you could save some trouble by placing it under ./t (we have some test command like this in flux-core)

@dongahn dongahn force-pushed the matching-service branch 3 times, most recently from 7024402 to 0f3d8b0 Compare September 7, 2018 00:18
Pick this up from flux-core.

Fix a side-effect where eval turns the semicolon
into a command separator.
Use the factory pattern to allow the upper layer to easily
instantiate an object of different match policy class.
Factor out the jobinfo data structure and
a utility function to a new source file
as this will be used by both resource-query
and upcoming resource module.

Adjust resource-query code as well.

Update Makefile.am
Name: resource.

Use the resource scheduler infrastructure code to populate
the resource graph database.

Use the request handlers in place of the control
loop in resource-query to receive various match
requests: e.g., match allocate and allocate_orelse_reserve.

Receive a jobspec directly from the RPC message and
return the matched R in response.

Note that this will likely change later, getting
some handle from the user (e.g., schedule loop service)
of this service to find the jobspec from KVS and
put the matched R into KVS instead of sending
it back in the response message.
Make a convenience library out of the resource
scheduling infrastructure.  This reduces
the compilation time.

Add rules for resource module, building it against
the convenience library.

Update rules for resource-query to build
against the same library
@dongahn dongahn force-pushed the matching-service branch 3 times, most recently from 6f2a716 to 01380e0 Compare September 11, 2018 22:11
@dongahn
Copy link
Member Author

dongahn commented Sep 11, 2018

Thank you all for reviewing this PR.

I decided to export the HOME environment variable in the new test scripts so that we don't have to hit this problem next time sharness.sh is pulled from the upstream.

Travis CI is green now.

In addition, I believe I've also addressed all of the great review comments from @grondo, @SteVwonder and @trws. Please let me know if there is anything else you want me to address.

From my perspective, this can go in.

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more minor change. Otherwise this looks ready to go in to me.

t/scripts/flux-resource Outdated Show resolved Hide resolved
Add this command primarily for testing.

Add several sub-commands matching the request handlers
within the resource module.

Use python's argparse module with Action class to
register a handler per each sub-commmand.

Support match allocate and allocate_orelse_reserve,
info and cancel sub-commands.
Need for both in-build testing and install testing.
Test the correctness of the match(-allocate) handler within
the resource module using flux-resource front-end command.

Add data/resource/jobspecs/basics/bad.yaml, a malformed jobspec
yaml file to test if flux-resource handles a Yaml exception
gracefully.

Use the original HOME environment variable as sharness
interferes with Python's module search path.
Python uses HOME to determine its local site-package path,
and changing this leads to an incorrect search path.

Update the Makefile rule.
Test the functionality of the match(-allocate_orelse_reserve)
handler within the resource module using flux-resource
front-end command.

Use the original HOME environment variable as sharness
interferes with Python's module search path.
Python uses HOME to determine its local site-package path,
and changing this leads to an incorrect search path.

The Makefile rule updated.
Test the functionality of the cancel and info handlers within
resource module using flux-resource front-end command.

Use the original HOME environment variable as sharness
interferes with Python's module search path.
Python uses HOME to determine its local site-package path,
and changing this leads to an incorrect search path.

Update Makefile rule.
@dongahn
Copy link
Member Author

dongahn commented Sep 12, 2018

OK. Forced a push!

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongahn. LGTM! Anyone else have any feedback? If there is no more feedback, I will push the button tomorrow.

@dongahn
Copy link
Member Author

dongahn commented Sep 20, 2018

Ping?

@SteVwonder
Copy link
Member

My bad. Merging now.

@SteVwonder SteVwonder merged commit c27dc14 into flux-framework:master Sep 20, 2018
@dongahn
Copy link
Member Author

dongahn commented Sep 20, 2018

Thanks @SteVwonder!

@dongahn dongahn deleted the matching-service branch July 13, 2019 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants