Allow expiration of running jobs to be adjusted via `sched.expiration` RPC #1079

grondo · 2023-09-25T19:32:39Z

RFC 27 now defines a sched.expiration RPC which schedulers may implement to support adjustment to the duration/expiration of a running job.

This RPC should be implemented in Fluxion so that its internal endtime for jobs is synchronized with the job execution system when a job expiration is updated.

grondo · 2023-09-28T14:00:56Z

Summarizing a conversation from the team meeting: There is no need to reject a sched.expiration update that overlaps with existing reservation (job or future DAT or downtime reservation). Since this expiration adjustment is considered "administrative" it should be applied to the job in question and the plan discarded. When DAT/downtime reservations are supported, those reservations should be allowed to "overlap" with jobs (The sysadmins will clean up running jobs in this case)

grondo · 2023-11-01T17:05:18Z

A WIP PR with flux-core support for updating the duration of running jobs is in flux-framework/flux-core#5522. It contains the implementation of the sched.expiration RPC to request the expiration update from the scheduler. Once that is merged, Fluxion will need support for that RPC, or the job manager update service will go ahead with the expiration update (it assumes the update is valid if the RPC fails with ENOSYS). This PR is planned to be merged before the Nov release.

grondo · 2023-11-08T14:30:15Z

Update: flux-framework/flux-core#5522 was merged and released with flux-core v0.56.0. Until this issue is resolved, updates of running jobs will be allowed without notification to the Fluxion scheduler, which will have to adapt to the new time limits. In testing, this doesn't seem to be a critical issue, but it would probably be best if the sched.expiration RPC were supported by Fluxion eventually.

grondo · 2024-03-01T15:20:23Z

@milroy: As requested in this week's meeting, some references for implementation of this feature:

A definition of the sched.expiration update RPC is defined in RFC 27
The initial implementation of the sched.expiration RPC in sched-simple can be found in flux-framework/flux-core@ae127d1
A larger description of how this all works is in the PR referenced above support duration update for running jobs flux-core#5522, as well as in the DURATION UPDATE OF A RUNNING JOB section of flux-update(1)

milroy · 2024-03-02T08:08:54Z

Thanks for consolidating the information @grondo!

milroy · 2024-03-10T08:24:10Z

RFC 27 states in Expiration:
"The request MAY fail, for example if: [...] The new expiration time would invalidate an advance reservation." Is an advance reservation a system reservation or a job reservation?

grondo · 2024-03-10T16:35:03Z

The intent is to prevent a expiration extension overlapping an administrative reservation (however we end up implementing that), not just a normal reservation that's part of the current schedule plan (if that's what is meant by a job reservation)

However, I do recall @ryanday36 mentioning that admins can just kill jobs running on an administrative reservation if necessary, so maybe we don't actually need to worry about this for now?

milroy · 2024-03-23T23:25:50Z

I've implemented the basic expiration functionality in my fork, but am wondering how best to handle the RPC.

Using a simple relay like what's done with sched.resource-status is the most straightforward, but requires fairly extensive modification to the Fluxion planner. This is because the relay callback can be executed at any time in the scheduler loop, meaning that the allocated job that requires updated expiration can extend into one or more reservations (administrative or normal). The extension can render the reservations invalid, requiring them to be pushed back, which can create a cascade of reservation pushbacks. I've implemented much of the logic to handle the cascade and could complete the implementation with a bit more work.

The other route we discussed is to handle the RPC after the scheduler loop, which guarantees that all reservations will be cleared. I have a working implementation in the Fluxion planner, and it's much simpler than dealing with reservation conflicts. However, handling the RPC is clumsier. I think what's needed is to check for a sched.resource-status RPC in the post_sched_loop in the qmanager_cb_t class and relay the RPC to Fluxion. I don't think qmanager_cb_t was designed for sending RPCs, though.

@trws or @grondo do you have suggestions?

grondo · 2024-03-25T00:07:12Z

This is because the relay callback can be executed at any time in the scheduler loop,

I apologize, but I don't know much about the Fluxion planner. However, an RPC callback can't be invoked until you re-enter the Flux reactor, and I'd be a little surprised if this occurs in the middle of a scheduler loop. Feel free to correct me if I'm wrong.

Also, I just noticed that the RPC relay implementation in qmanager referenced above makes a blocking RPC get, so that is not a good example. Instead flux_future_then(3) should be used to schedule handling the response and returning it to the original caller. I can make a PR for this, even though we're probably deprecating sched.resource-status anyway (flux-framework/flux-core#5796)

grondo mentioned this issue Sep 25, 2023

extend expiration (time limit) of running job flux-framework/flux-core#4175

Open

milroy self-assigned this Mar 2, 2024

milroy linked a pull request Mar 27, 2024 that will close this issue

[WIP] Add support for expiration update #1158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow expiration of running jobs to be adjusted via `sched.expiration` RPC #1079

Allow expiration of running jobs to be adjusted via `sched.expiration` RPC #1079

grondo commented Sep 25, 2023

grondo commented Sep 28, 2023

grondo commented Nov 1, 2023 •

edited

Loading

grondo commented Nov 8, 2023 •

edited

Loading

grondo commented Mar 1, 2024

milroy commented Mar 2, 2024

milroy commented Mar 10, 2024

grondo commented Mar 10, 2024

milroy commented Mar 23, 2024 •

edited

Loading

grondo commented Mar 25, 2024

Allow expiration of running jobs to be adjusted via sched.expiration RPC #1079

Allow expiration of running jobs to be adjusted via sched.expiration RPC #1079

Comments

grondo commented Sep 25, 2023

grondo commented Sep 28, 2023

grondo commented Nov 1, 2023 • edited Loading

grondo commented Nov 8, 2023 • edited Loading

grondo commented Mar 1, 2024

milroy commented Mar 2, 2024

milroy commented Mar 10, 2024

grondo commented Mar 10, 2024

milroy commented Mar 23, 2024 • edited Loading

grondo commented Mar 25, 2024

Allow expiration of running jobs to be adjusted via `sched.expiration` RPC #1079

Allow expiration of running jobs to be adjusted via `sched.expiration` RPC #1079

grondo commented Nov 1, 2023 •

edited

Loading

grondo commented Nov 8, 2023 •

edited

Loading

milroy commented Mar 23, 2024 •

edited

Loading