Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: job-info: add streaming RPC to "watch" R #5451

Closed
grondo opened this issue Sep 13, 2023 · 10 comments
Closed

Idea: job-info: add streaming RPC to "watch" R #5451

grondo opened this issue Sep 13, 2023 · 10 comments
Assignees

Comments

@grondo
Copy link
Contributor

grondo commented Sep 13, 2023

As part of the proposed solution for #4175, resource-update events will be posted to the job eventlog to indicate a change in a job's expiration. Interested parties can then monitor the job eventlog to be notified of these updates and adjust accordingly. However, there are multiple components that will require a new eventlog watch RPC, including the job execution service, the job shell, and any scheduler of a Flux subinstance. This is 2-3 new watches on the job eventlog per job. Not to mention the amount of code that will need to be copied to each of these use cases, processing of every eventlog event to check for the rare resource-update event, etc.

Each of these entities already fetches R from the job-info service. It might be convenient if the job-info service offered an RPC to start watching R instead of just fetching it. This could replace the existing lookup RPC and in the common case would have just one response. The job-info module would then watch the eventlog of the job and notify the client of updates by sending a new R. Code would be simplified for each use case, and there would only be one new eventlog watch instead of (up to) 3. Plus, I imagine the job-info module already has code for monitoring and processing the job eventlog.

Thoughts?

BTW, as an experiment I added an eventlog watch to the job-exec module for every job. It seemed to impact throughput by 5-8%.

@chu11
Copy link
Member

chu11 commented Sep 13, 2023

I think this is a great idea.

The job-info module would then watch the eventlog of the job and notify the client of updates by sending a new R.

My first thought was we can just outright watch R itself and forward the resulting changes that occur. job-info effectively is convenience on determining the R KVS path and doing a guest access check. The major difference is just returning an error if R doesn't exist, so it'd be on the caller to only watch R when appropriate.

Edit: As I re-read the above, what I was really thinking was a watch service on any key or keys of the job, jobspec, R, etc. It'd be dumb for just R. I was thinking generally speaking watching the key directly vs watching the eventlog.

Dunno if you were thinking of watching the eventlog b/c you want extra smarts in there, i.e. if job not running wait, end stream when job ends, etc.

@grondo
Copy link
Contributor Author

grondo commented Sep 13, 2023

Dunno if you were thinking of watching the eventlog b/c you want extra smarts in there, i.e. if job not running wait, end stream when job ends, etc.

No, I had thought watching the eventlog is required to process jobspec-update and resource-update events and send the updated R, jobspec, etc. in the response. I don't think we plan to overwrite R in the KVS on updates at this point (and I'm pretty sure we don't for the jobspec)

@grondo
Copy link
Contributor Author

grondo commented Sep 13, 2023

Not to say we couldn't update R in the kvs when changes occur... Note: I don't think the job manager currently even fetches a copy of R. I'll have to go review what our current plan was here.

@chu11
Copy link
Member

chu11 commented Sep 13, 2023

No, I had thought watching the eventlog is required to process jobspec-update and resource-update events and send the updated R, jobspec, etc. in the response. I don't think we plan to overwrite R in the KVS on updates at this point (and I'm pretty sure we don't for the jobspec)

Agh, I'm dumb. yeah, we're not updating R in the KVS ;-)

One thought, when I implemented:

#5428

I specifically avoided putting it into job-info b/c I didn't want to overburden job-info with a lot of parsing of the eventlog and updating jobspec all of the time. Thus doing it in the userland side.

But perhaps this discussion pushes the update back into job-info instead as its a "net win" service for all of flux collectively. We could make the service be 1 time lookup (i.e. for flux job info) or streaming.

Do you imagine it will need smarts to wait for a job to be assigned resources? end the stream when the job is finished?

@grondo
Copy link
Contributor Author

grondo commented Sep 13, 2023

I imagine it should end the stream when the job is finished, but I think it could return an error if R does not exist at the time of the query since the existing lookup does that now.

But perhaps this discussion pushes the update back into job-info instead as its a "net win" service for all of flux collectively. We could make the service be 1 time lookup (i.e. for flux job info) or streaming.

Yeah, that is what I was thinking. Though we should carefully consider before moving forward so as not to unnecessarily undo any of your previous work. I was focused on R here because we may have many components that need to watch for updates, vs just getting the updated R at any given time which is more the use case for jobspec. (However, it makes sense that is a use case for R as well -- if a user fetches it should reflect updates)

We should also consider if updating R in the kvs is indeed the right approach, since that could simplify some of this...

@chu11
Copy link
Member

chu11 commented Sep 13, 2023

Just spit balling here, but is there a reason to not update jobspec/R in the KVS? I assume it's a raciness issue / potential for different folks to have different views? Or perhaps who "owns" jobspec/R? Or a desire to keep the "original"?

Or now that I think about it while typing this, perhaps all of the above. And probably some other reasons too I'm not thinking of yet.

@garlick
Copy link
Member

garlick commented Sep 13, 2023

Yeah, I'd second a vote for keeping the original R/jobspec in the KVS and building the "current" R/jobspec by applying job events and with a replay on a restart. The "job schema" just makes more sense to me this way. If R were updated in place, then the update events in the eventlog would not have a context to be meaningful.

@chu11 chu11 self-assigned this Sep 14, 2023
@chu11
Copy link
Member

chu11 commented Sep 15, 2023

question, should the very first reply from the service send the original R? Or should it reply with an updated R based on all resource-update events in the eventlog at the time of the initial call? I assume the latter is desired.

@grondo
Copy link
Contributor Author

grondo commented Sep 15, 2023

Ooh, good question. I think your intuition is correct. If we want to show "history" at some point maybe we could add a flag.

@chu11
Copy link
Member

chu11 commented Sep 15, 2023

note for future possible flag, UNIQ option. I think it's rare enough to not waste time doing big json-diffs in the normal case.

chu11 added a commit to chu11/flux-core that referenced this issue Sep 22, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, watch the eventlog,
and then apply `resource-update` events as they happen.  It would
be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-watch streaming service.  It currently
supports only the key "R", but can be extended to other keys in the future.

The service will read R and the eventlog for a job and apply all resource-update
changes as needed.  This initial "R" will sent back to the caller.  If the
job has completed, the RPC streaming service ends.  If not, the eventlog
will be watched for future resource-update events.  On each new
resource-update event, a new R will be streamed back to the caller.  This
continues until the job ends or the caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 22, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, watch the eventlog,
and then apply `resource-update` events as they happen.  It would
be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-watch streaming service.  It currently
supports only the key "R", but can be extended to other keys in the future.

The service will read R and the eventlog for a job and apply all resource-update
changes as needed.  This initial "R" will sent back to the caller.  If the
job has completed, the RPC streaming service ends.  If not, the eventlog
will be watched for future resource-update events.  On each new
resource-update event, a new R will be streamed back to the caller.  This
continues until the job ends or the caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 23, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, watch the eventlog,
and then apply `resource-update` events as they happen.  It would
be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-watch streaming service.  It currently
supports only the key "R", but can be extended to other keys in the future.

The service will read R and the eventlog for a job and apply all resource-update
changes as needed.  This initial "R" will sent back to the caller.  If the
job has completed, the RPC streaming service ends.  If not, the eventlog
will be watched for future resource-update events.  On each new
resource-update event, a new R will be streamed back to the caller.  This
continues until the job ends or the caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 23, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, watch the eventlog,
and then apply `resource-update` events as they happen.  It would
be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-watch streaming service.  It currently
supports only the key "R", but can be extended to other keys in the future.

The service will read R and the eventlog for a job and apply all resource-update
changes as needed.  This initial "R" will sent back to the caller.  If the
job has completed, the RPC streaming service ends.  If not, the eventlog
will be watched for future resource-update events.  On each new
resource-update event, a new R will be streamed back to the caller.  This
continues until the job ends or the caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 23, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, watch the eventlog,
and then apply `resource-update` events as they happen.  It would
be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-watch streaming service.  It currently
supports only the key "R", but can be extended to other keys in the future.

The service will read R and the eventlog for a job and apply all resource-update
changes as needed.  This initial "R" will sent back to the caller.  If the
job has completed, the RPC streaming service ends.  If not, the eventlog
will be watched for future resource-update events.  On each new
resource-update event, a new R will be streamed back to the caller.  This
continues until the job ends or the caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 23, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, watch the eventlog,
and then apply `resource-update` events as they happen.  It would
be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-watch streaming service.  It currently
supports only the key "R", but can be extended to other keys in the future.

The service will read R and the eventlog for a job and apply all resource-update
changes as needed.  This initial "R" will sent back to the caller.  If the
job has completed, the RPC streaming service ends.  If not, the eventlog
will be watched for future resource-update events.  On each new
resource-update event, a new R will be streamed back to the caller.  This
continues until the job ends or the caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 25, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, watch the eventlog,
and then apply `resource-update` events as they happen.  It would
be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-watch streaming service.  It currently
supports only the key "R", but can be extended to other keys in the future.

The service will read R and the eventlog for a job and apply all resource-update
changes as needed.  This initial "R" will sent back to the caller.  If the
job has completed, the RPC streaming service ends.  If not, the eventlog
will be watched for future resource-update events.  On each new
resource-update event, a new R will be streamed back to the caller.  This
continues until the job ends or the caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 27, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, watch the eventlog,
and then apply `resource-update` events as they happen.  It would
be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-watch streaming service.  It currently
supports only the key "R", but can be extended to other keys in the future.

The service will read R and the eventlog for a job and apply all resource-update
changes as needed.  This initial "R" will sent back to the caller.  If the
job has completed, the RPC streaming service ends.  If not, the eventlog
will be watched for future resource-update events.  On each new
resource-update event, a new R will be streamed back to the caller.  This
continues until the job ends or the caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 27, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, read the eventlog,
and then apply `resource-update` events to R.  Some other users
would also need to know when there are changes to R, necessitating
watching the eventlog for future resource-update changes.

It would be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-lookup service and
job-info.update-watch streaming service.  It currently supports only the key "R",
but can be extended to other keys in the future.

The job-info.update-lookup service will read R and the eventlog for a job.  it
then apples all resource-update changes to R and returns the result.

job-info.update-watch service will do the same as the above, but if
the job is not completed, it will continue to watch the eventlog for
future resource-update events.  On each new resource-update event, a new R
will be streamed back to the caller.  This continues until the job ends or the
caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 27, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, read the eventlog,
and then apply `resource-update` events to R.  Some other users
would also need to know when there are changes to R, necessitating
watching the eventlog for future resource-update changes.

It would be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-lookup service and
job-info.update-watch streaming service.  It currently supports only the key "R",
but can be extended to other keys in the future.

The job-info.update-lookup service will read R and the eventlog for a job.  it
then apples all resource-update changes to R and returns the result.

job-info.update-watch service will do the same as the above, but if
the job is not completed, it will continue to watch the eventlog for
future resource-update events.  On each new resource-update event, a new R
will be streamed back to the caller.  This continues until the job ends or the
caller cancels the stream.

Fixes flux-framework#5451
chu11 added a commit to chu11/flux-core that referenced this issue Sep 28, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, read the eventlog,
and then apply `resource-update` events to R.  Some other users
would also need to know when there are changes to R, necessitating
watching the eventlog for future resource-update changes.

It would be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-lookup service and
job-info.update-watch streaming service.  It currently supports only the key "R",
but can be extended to other keys in the future.

The job-info.update-lookup service will read R and the eventlog for a job.  it
then apples all resource-update changes to R and returns the result.

job-info.update-watch service will do the same as the above, but if
the job is not completed, it will continue to watch the eventlog for
future resource-update events.  On each new resource-update event, a new R
will be streamed back to the caller.  This continues until the job ends or the
caller cancels the stream.

Fixes flux-framework#5451
grondo pushed a commit to chu11/flux-core that referenced this issue Oct 12, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, read the eventlog,
and then apply `resource-update` events to R.  Some other users
would also need to know when there are changes to R, necessitating
watching the eventlog for future resource-update changes.

It would be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-lookup service and
job-info.update-watch streaming service.  It currently supports only the key "R",
but can be extended to other keys in the future.

The job-info.update-lookup service will read R and the eventlog for a job.  it
then apples all resource-update changes to R and returns the result.

job-info.update-watch service will do the same as the above, but if
the job is not completed, it will continue to watch the eventlog for
future resource-update events.  On each new resource-update event, a new R
will be streamed back to the caller.  This continues until the job ends or the
caller cancels the stream.

Fixes flux-framework#5451
grondo pushed a commit to chu11/flux-core that referenced this issue Oct 24, 2023
Problem: In the future, several services will need to know a
job's resources and know the updates that would apply to them.
This would currently require users to read R, read the eventlog,
and then apply `resource-update` events to R.  Some other users
would also need to know when there are changes to R, necessitating
watching the eventlog for future resource-update changes.

It would be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-lookup service and
job-info.update-watch streaming service.  It currently supports only the key "R",
but can be extended to other keys in the future.

The job-info.update-lookup service will read R and the eventlog for a job.  it
then apples all resource-update changes to R and returns the result.

job-info.update-watch service will do the same as the above, but if
the job is not completed, it will continue to watch the eventlog for
future resource-update events.  On each new resource-update event, a new R
will be streamed back to the caller.  This continues until the job ends or the
caller cancels the stream.

Fixes flux-framework#5451
@mergify mergify bot closed this as completed in 531ada0 Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants