Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache R in the job manager #5472

Merged
merged 21 commits into from
Oct 25, 2023
Merged

cache R in the job manager #5472

merged 21 commits into from
Oct 25, 2023

Conversation

garlick
Copy link
Member

@garlick garlick commented Sep 26, 2023

Problem: jobtap plugins might need convenient access to R per #5471

Per discussion, this prototypes keeping a copy of R in the job manager so that it could potentially be offered to jobtap plugins. Some notes/caveats:

  • R is not deleted after the job becomes inactive, which could be a potential optimization to conserve memory on JGF systems
  • jobtap interfaces are unchanged (I thought @grondo might already have a good idea of how he wants to do that)
  • if this is acceptable, RFC 27 will need updates to sched.alloc and job-manager.sched-hello responses.
  • this is set up so that job manager restart could replay resource-update events on top of the original R loaded from the KVS
  • R could be passed to job-exec.start to save a lookup (would require an update to RFC 32). (Edit: oops I included a couple of commits that started this - maybe I'll just finish it and tack it on)

@garlick
Copy link
Member Author

garlick commented Sep 26, 2023

Oops one place where job->R won't be populated is with alloc-bypass.

I made some sloppy errors caught by CI and will force push with fixes in a sec as soon as tests pass locally.

@grondo
Copy link
Contributor

grondo commented Sep 26, 2023

Great!

R is not deleted after the job becomes inactive, which could be a potential optimization to conserve memory on JGF systems

A quick thought here - should the scheduling key of R be redacted in a similar vein to the environment attribute of jobspec? I don't see a use to keeping scheduler opaque data in memory of the job manager...

jobtap interfaces are unchanged (I thought @grondo might already have a good idea of how he wants to do that)

All I was thinking was to add "R" to the callback input "args" when it is available (i.e. callbacks after the alloc event)

@@ -177,8 +182,10 @@
}
}

static void start_response_cb (flux_t *h, flux_msg_handler_t *mh,
const flux_msg_t *msg, void *arg)
static void start_response_cb (flux_t *h,

Check warning

Code scanning / CodeQL

Poorly documented large function Warning

Poorly documented function: fewer than 2% comments for a function of 129 lines.
@grondo
Copy link
Contributor

grondo commented Sep 26, 2023

Oops one place where job->R won't be populated is with alloc-bypass.

I didn't check the code here yet, does the scheduler still commit R to the KVS and send it back in the alloc response in parallel, or does the job-manager now commit R? The alloc-bypass plugin has to commit the provided R to the KVS itself, but if the job-manager already does this on behalf of the scheduler, perhaps we just need a way to allow a plugin to assign R and then we can drop this step and the job manager will have the copy of R in this case. (Oh, wait, won't this be necessary anyway if R is sent to the job-exec module with the start request?)

@garlick
Copy link
Member Author

garlick commented Sep 26, 2023

A quick thought here - should the scheduling key of R be redacted in a similar vein to the environment attribute of jobspec? I don't see a use to keeping scheduler opaque data in memory of the job manager...

That was true before, but after this PR, the job-manager.sched-hello streaming responses include a copy of R from memory. My thought was if R has been modified, then we had better give the scheduler the updated copy. For example, if the expiration has changed, the scheduler needs to re-allocate the resources for the updated time. Plus it saves a KVS lookup.

I'm not sure if fluxion relies on this. Maybe (R is passed to queue->reconstruct() )
https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L159

@garlick
Copy link
Member Author

garlick commented Sep 26, 2023

does the scheduler still commit R to the KVS and send it back in the alloc response in parallel

Yes. That seemed the safest since I think we've advertised the invariant that once the alloc event is posted to the eventlog, R can be read from the KVS.

The alloc-bypass plugin has to commit the provided R to the KVS itself, but if the job-manager already does this on behalf of the scheduler, perhaps we just need a way to allow a plugin to assign R and then we can drop this step and the job manager will have the copy of R in this case. (Oh, wait, won't this be necessary anyway if R is sent to the job-exec module with the start request?)

It does not already do it so alloc-bypass would need a way to assign R, and would need to retain the KVS commit also.

Edit: well actually that invariant could be maintained if we committed R from the job-manager, so that's not necessarily a reason to do it this way if we need to reconsider.

@grondo
Copy link
Contributor

grondo commented Sep 26, 2023

That was true before, but after this PR, the job-manager.sched-hello streaming responses include a copy of R from memory. My thought was if R has been modified, then we had better give the scheduler the updated copy. For example, if the expiration has changed, the scheduler needs to re-allocate the resources for the updated time. Plus it saves a KVS lookup.

All very true! However, it saves a KVS lookup at restart only, with a tradeoff of keeping JGF for every job (including inactive jobs) and we currently don't even support restart with running jobs (though I guess we do support reloading the scheduler). If the scheduler continued to lookup R in the KVS, but instead used the new job-info lookup which incorporates resource-update events, then this would just work and R could be redacted. I'm not saying that's the right solution, but it should be considered?

Yes. That seemed the safest since I think we've advertised the invariant that once the alloc event is posted to the eventlog, R can be read from the KVS.

True. Though it would presumably be easy to delay the alloc event in the job manager until the commit of R is complete. It actually seems fine to me do it this way for now though (probably faster too)

It does not already do it so alloc-bypass would need a way to assign R, and would need to retain the KVS commit also.

Yes, that is what I was proposing, a way for a plugin to assign R. Not sure what that would look like. Sounds like if we send R in the start request this will be necessary to keep alloc-bypass functionality though?

@garlick
Copy link
Member Author

garlick commented Sep 26, 2023

it saves a KVS lookup at restart only, with a tradeoff of keeping JGF for every job

True!

If the scheduler continued to lookup R in the KVS, but instead used the new job-info lookup which incorporates resource-update events, then this would just work and R could be redacted.

Good point, let's do it that way!

Regarding the interface to R for jobtap plugins, I have a commit ready to push that adds an "R" key to the jobtap input args. I was going to try also adding "R" to the output args just for alloc-bypass, then if it's found there and job->R is unset, letting it be set. Hopefully that's not too big of a foot-gun, but I figured I'd try the simple thing first.

@garlick
Copy link
Member Author

garlick commented Sep 26, 2023

Re-pushed with the following changes

  • added support for R in jobtap IN and OUT args
  • updated alloc-bypass to set R in OUT args
  • Include R in the job-exec.start request
  • Dropped the schedutil hello changes (R is fetched from the KVS, which we can easily change to job-info when it provides the modified R.
  • rebased on current master

@garlick
Copy link
Member Author

garlick commented Sep 26, 2023

instead used the new job-info lookup which incorporates resource-update events

That's not available yet, correct?

@grondo
Copy link
Contributor

grondo commented Sep 26, 2023

That's not available yet, correct?

PR posted in #5467 - unfortunately I just realized that job-info lookup RPC still returns the original R because the application of the resource-update events is done on the client side in flux-job. I wonder now if that was the right solution (though it does move processing out of the job-info module itself, it means users of the API have to apply the updates themselves? Maybe @chu11 can comment here.

Edit: In a way it would be nice if job-info.lookup just magically returned the updated R and jobspec.. however I did not look into this as deeply as @chu11 so not sure if there are roadblocks to doing it that way.

@chu11
Copy link
Member

chu11 commented Sep 27, 2023

PR posted in #5467 - unfortunately I just realized that job-info lookup RPC still returns the original R because the application of the resource-update events is done on the client side in flux-job. I wonder now if that was the right solution (though it does move processing out of the job-info module itself, it means users of the API have to apply the updates themselves? Maybe @chu11 can comment here.

apologies, there's actually two PRs so it may be confusing

#5467 - an update-watch service that will get the currently up to date "R" (w/ resource-update events applied) and then stream any new R changes that may occur in the future.

#5464 - flux job info <jobid> R gets the updated R, but does this on the client side. This is a mirror of flux job info <jobid> jobspec, to keep the burden off of the job-info module. This PR could be updated to use the service in #5467, but I elected to keep it out of the broker for the time being because these are likely "one off" lookups, vs the streaming updates are needed for various broker services. (and admittedly this PR was in development before issue #5451 was written )

Edit: In a way it would be nice if job-info.lookup just magically returned the updated R and jobspec.. however I did not look into this as deeply as @chu11 so not sure if there are roadblocks to doing it that way.

There is no roadblock that I can see. The entirety of why it was done client side in flux job info (and python equivalent code) is to keep the burden out of the job-info module. Some of my pro-con brainstorming here: #5411 (comment).

@grondo
Copy link
Contributor

grondo commented Oct 24, 2023

This PR would be useful in development of duration update for running jobs. The job-manager would not have to fetch R from the job-info service to apply a duration update. A cursory examination of this PR indicates it is in pretty good shape. @garlick, what remains to do here to remove the WIP? (Did we ever decide if the cached R should be freed for inactive jobs? Given that we have users wanting run 1M jobs, perhaps that would be a good idea...)

@garlick garlick changed the title WIP: cache R in the job manager cache R in the job manager Oct 24, 2023
@garlick
Copy link
Member Author

garlick commented Oct 24, 2023

This has been sitting for a while. A brief review didn't turn up any big gaps, that I caught anyway. Just force pushed with the following changes:

  • drop the scheduling key from R when the job transitions to INACTIVE state to save memory
  • drop the scheduling key from R when including it in the exec.start request
  • rebase on current master

@garlick
Copy link
Member Author

garlick commented Oct 24, 2023

Per offline discussion with @grondo, repushed with the R scheduling key redacted upon receipt. It's still in the KVS of course, and schedutil still looks it up there when the scheduler is reloaded.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Problem: libschedutil code sometimes breaks long function
parameter lists in blocks instead of one per line, which is
inconsistent with the modern flux code base.

Break function parameters one per line when the list cannot
fit on one line (except json pack/unpack key-value pairs).
Problem: alloc responses use a common function that is not
suited to adding one more optional parameter.

Refactor alloc.c to use a more general "pack" style utility
function.
Problem: schedutil_alloc_respond_success_pack() writes R to
the KVS then responds to the job-manager without R as required
by RFC 27, but in emerging cases, it may be handy to give jobtap
plugins access to R.

Include R in the alloc response.
Problem: resource_set_create uses a calloc() buffer without
checking for NULL.

Add check.
Problem: when R is passed to job-exec as a json object, it will
be convenient to have some rset interfaces to handle R as a
json object.

Add resource_set_create_fromjson ().
Add resource_set_get_json ().
Problem: there is no place to store R in the job manager
job object.

Add R to struct and call json_decref() from job destructor.
Problem: job_create_from_eventlog() does not accept R, but
this will be required once R is tracked by the job-manager.

Add R argument to job_create_from_eventlog().  This initializes
job->R before the eventlog is replayed, for the eventual support
of resource-update events.

Update unit test.
Problem: the scheduler now returns R in the sched.alloc
response, but the job manager ignores it.

Accept R and make it part of the job object.
Problem: lookup_job() repeats a block of code for each KVS
key it looks up, and adding one more raises the annoyance level.

Create local job data lookup helpers to reduce code duplication.
Problem: R is expected to be part of the job-manager job object
after allocation, but this is not the case for jobs loaded from
the KVS after a job manager restart.

Load R from the KVS, if available, and make it part of the job
object.
Problem: there is not a simple way to access the job manager's
in-memory copy of R for testing.

Allow R to be fetched with the job-manager.getattr RPC.
Problem: there are no tests that show the job manager holds
a copy of R.

Add a sharness test.  In the future it can cover resource-update
changes to R.
Problem: job-manager/start.c sometimes breaks long function
parameter lists in blocks instead of one per line, which is
inconsistent with the modern flux code base.

Break function parameters one per line when the list cannot
fit on one line (except json pack/unpack key-value pairs).
Problem: jobtap_call() defines 'rc' both at function scope
and within a local block.

Rename the variable in the local block so they don't confuse
anyone.
Problem: jobtap plugins may need access to R.

Add an "R" key to the input args if R has been allocated.
Problem: alloc-bypass needs to need set job->R.

Add an "R" key to the output args.  If present in output args
and preconditions are met, namely:
- callback is job.state.sched
- not already set in the job struct
then set job->R_redacted to the argument value.
Problem: alloc-bypass represents R internally as a string for
storage to the KVS, but soon we will need it in object form
for adding to the plugin output args.

Represent it internally as an object and use KVS interfaces
that accept an object.
Problem: alloc-bypass now needs to set R in the output args so
that the job manager's in-memory copy of R is set.

Set "R" in the output args of the job.state.sched callback.
Problem: job-exec fetches R from the KVS but we have it
now in the job manager so it could be included in the
exec.start request.

Add R to the exec.start request.
Problem: R is now present in the exec.start so job-exec
no longer needs to look it up.

Initialize job->R and critical ranks in the start request
handler rather than later in a continuation.
Problem: a job exec test that fakes an invalid R by
placing it in the KVS no longer works because R comes
directly from the job manager.

Remove the test for now.
@codecov
Copy link

codecov bot commented Oct 25, 2023

Codecov Report

Merging #5472 (4d7ee26) into master (fce7220) will decrease coverage by 0.06%.
The diff coverage is 73.29%.

@@            Coverage Diff             @@
##           master    #5472      +/-   ##
==========================================
- Coverage   83.47%   83.42%   -0.06%     
==========================================
  Files         487      487              
  Lines       81918    81990      +72     
==========================================
+ Hits        68381    68398      +17     
- Misses      13537    13592      +55     
Files Coverage Δ
src/common/libschedutil/ops.c 73.17% <100.00%> (ø)
src/common/libschedutil/ready.c 63.33% <100.00%> (ø)
src/modules/job-manager/job.c 89.16% <100.00%> (+0.20%) ⬆️
src/common/libschedutil/hello.c 70.00% <66.66%> (ø)
src/modules/job-manager/plugins/alloc-bypass.c 68.18% <83.33%> (ø)
src/modules/job-manager/restart.c 83.02% <96.15%> (+0.67%) ⬆️
src/modules/job-manager/start.c 72.78% <80.00%> (ø)
src/modules/job-exec/rset.c 88.34% <76.92%> (-1.87%) ⬇️
src/modules/job-manager/alloc.c 77.59% <62.50%> (-0.36%) ⬇️
src/modules/job-manager/getattr.c 65.15% <42.85%> (-0.96%) ⬇️
... and 3 more

... and 9 files with indirect coverage changes

@mergify mergify bot merged commit 9528de4 into flux-framework:master Oct 25, 2023
32 of 33 checks passed
@garlick garlick deleted the issue#5471 branch March 1, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants