-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow expiration of running jobs to be adjusted via sched.expiration
RPC
#1079
Comments
Summarizing a conversation from the team meeting: There is no need to reject a |
A WIP PR with flux-core support for updating the duration of running jobs is in flux-framework/flux-core#5522. It contains the implementation of the sched.expiration RPC to request the expiration update from the scheduler. Once that is merged, Fluxion will need support for that RPC, or the job manager update service will go ahead with the expiration update (it assumes the update is valid if the RPC fails with |
Update: flux-framework/flux-core#5522 was merged and released with flux-core v0.56.0. Until this issue is resolved, updates of running jobs will be allowed without notification to the Fluxion scheduler, which will have to adapt to the new time limits. In testing, this doesn't seem to be a critical issue, but it would probably be best if the |
@milroy: As requested in this week's meeting, some references for implementation of this feature:
|
Thanks for consolidating the information @grondo! |
RFC 27 states in Expiration: |
The intent is to prevent a expiration extension overlapping an administrative reservation (however we end up implementing that), not just a normal reservation that's part of the current schedule plan (if that's what is meant by a job reservation) However, I do recall @ryanday36 mentioning that admins can just kill jobs running on an administrative reservation if necessary, so maybe we don't actually need to worry about this for now? |
I've implemented the basic Using a simple relay like what's done with The other route we discussed is to handle the RPC after the scheduler loop, which guarantees that all reservations will be cleared. I have a working implementation in the Fluxion planner, and it's much simpler than dealing with reservation conflicts. However, handling the RPC is clumsier. I think what's needed is to check for a |
I apologize, but I don't know much about the Fluxion planner. However, an RPC callback can't be invoked until you re-enter the Flux reactor, and I'd be a little surprised if this occurs in the middle of a scheduler loop. Feel free to correct me if I'm wrong. Also, I just noticed that the RPC relay implementation in qmanager referenced above makes a blocking RPC get, so that is not a good example. Instead |
RFC 27 now defines a sched.expiration RPC which schedulers may implement to support adjustment to the duration/expiration of a running job.
This RPC should be implemented in Fluxion so that its internal endtime for jobs is synchronized with the job execution system when a job expiration is updated.
The text was updated successfully, but these errors were encountered: