job-info: support new update-lookup and update-watch service #5467

chu11 · 2023-09-22T21:50:48Z

per discussion in #5451

Problem: In the future, several services will need to know a job's resources and know the updates that would apply to them.
This would currently require users to read R, watch the eventlog, and then apply resource-update events as they happen. It would be nice if a service did this as there will be multiple users.

Solution: Support a new job-info.update-watch streaming service. It currently supports only the key "R", but can be extended to other keys in the future.

The service will read R and the eventlog for a job and apply all resource-update changes as needed. This initial "R" will sent back to the caller. If the job has completed, the RPC streaming service ends. If not, the eventlog will be watched for future resource-update events. On each new resource-update event, a new R will be streamed back to the caller. This continues until the job ends or the caller cancels the stream.

Notes:

It might be wise to add some type of "caching" of sorts as multiple callers may want to watch the same job's R value. I didn't do that at the moment. I figured that could be an optimization for the future.
per discussion in Idea: job-info: add streaming RPC to "watch" R #5451 future options could include a "uniq" option as a "full history" option.

src/modules/job-info/guest_watch.c

chu11 · 2023-09-23T13:17:05Z

It might be wise to add some type of "caching" of sorts as multiple callers may want to watch the same job's R value. I didn't do that at the moment. I figured that could be an optimization for the future.

As I think about this more, is the caching critically important? It would definitely save some unnecessary lookups. Lemme see how hard it'll be to add right now.

Note: I think such support would be in a follow up PR, lets look at this PR as "first round implementation".

chu11 · 2023-09-26T15:01:24Z

although a few parts were a little tricky, wasn't horrible to add the "multiple watchers under 1 eventlog-watch" support.

The commits could be squashed in at a later time, but for ease, I left the separate commits in so the review (can maybe) be easier.

grondo · 2023-09-26T23:19:06Z

src/modules/job-info/update.c

+         * to:
+         * - execution.expiration
+         */
+        if (!streq (key, "execution.expiration"))


Similar to the other PR, the actual key is just expiration

interesting, I now see the resource-update context is formatted differently in RFC21 than jobspec-update.

Did we have a specific reason for that? It seems possibly unwise long term for both contexts to be formatted differently? On the other hand I can understand that R should have far more limited update possibilities.

Did we have a specific reason for that? It seems possibly unwise long term for both contexts to be formatted differently?

I think it was done this way because jobspec and R are actually quite different - R is not a simple hierarchical document like jobspec. E.g., when grow/shrink or similar are added, it will be more useful to have the added or deleted R and not a wholesale replacement of execution.R_lite.

garlick · 2023-09-27T03:29:17Z

Do users want to see the whole history starting with the original R, or just current R + updates?

If the latter, would it make sense to have this RPC check the streaming bit, and if not set, return the current R and done? Then maybe #5464 and other users like sched hello in #5472 wouldn't need to replay the eventlog on their own?

(I haven't looked at this - just basing this comment on your summary in #5472 - so sorry if it's off base)

Possibly the method name would need adjustment then.

chu11 · 2023-09-27T04:02:16Z

Do users want to see the whole history starting with the original R, or just current R + updates?

The assumption in this PR is the latter. It returns R in its current form (R + updates currently in eventlog), and then streams further updates to R that occur later.

I left whole history as a future option / flag, as it wasn't clear there was a use case at the moment (other than "would like to see it just to see it").

If the latter, would it make sense to have this RPC check the streaming bit, and if not set, return the current R and done? Then maybe #5464 and other users like sched hello in #5472 wouldn't need to replay the eventlog on their own?

That isn't a terrible idea, makes sense. I guess the (original) theory was that whoever called this service would also want to know if R ever changed in the future. But I'm guessing there are a few cases where that's not needed?

And #5464 could totally use this service. I initially just mirrored the "jobspec" equivalent code in #5464 because the idea was not put too much burden on the broker service for parsing the eventlog a bunch. If we wanted to go down the path where we put the burden on the job-info module, we should probably support jobspec as well.

chu11 · 2023-09-27T04:04:34Z

re-pushed, updating "execution.expiration" to "expiration", as I had misread the RFC

chu11 · 2023-09-27T05:25:52Z

Hmmm, got this in one builder (gcc-12,content-s3,distcheck) with one of my new tests. I have no idea how an EAGAIN could even happen. Will have to think about this.

Edit: noticed a minor && chain issue below. Dunno if some raciness with the background processes.

  expecting success: 
  	flux submit --wait-event=start sleep inf > job8.id
  	${UPDATE_WATCH} $(cat job8.id) R > watch8A.out &
  	watchpidA=$! &&
  	wait_update_watchers 1 &&
  	kvspath=`flux job id --to=kvs $(cat job8.id)` &&
  	flux kvs eventlog append ${kvspath}.eventlog resource-update "{\"expiration\": 100.0}" &&
  	${WAITFILE} --count=2 --timeout=30 --pattern="expiration" watch8A.out
  	${UPDATE_WATCH} $(cat job8.id) R > watch8B.out &
  	watchpidB=$! &&
  	wait_update_watchers 2 &&
  	kvspath=`flux job id --to=kvs $(cat job8.id)` &&
  	flux kvs eventlog append ${kvspath}.eventlog resource-update "{\"expiration\": 200.0}" &&
  	${WAITFILE} --count=3 --timeout=30 --pattern="expiration" watch8A.out &&
  	${WAITFILE} --count=2 --timeout=30 --pattern="expiration" watch8B.out
  	${UPDATE_WATCH} $(cat job8.id) R > watch8C.out &
  	watchpidC=$!
  	wait_update_watchers 3
  	${WAITFILE} --count=1 --timeout=30 --pattern="expiration" watch8C.out &&
  	flux cancel $(cat job8.id) &&
  	wait $watchpidA &&
  	wait $watchpidB &&
  	wait $watchpidC &&
  	test $(cat watch8A.out | wc -l) -eq 3 &&
  	test $(cat watch8B.out | wc -l) -eq 2 &&
  	test $(cat watch8C.out | wc -l) -eq 1 &&
  	head -n1 watch8A.out | jq -e ".execution.expiration == 0.0" &&
  	head -n2 watch8A.out | tail -n1 | jq -e ".execution.expiration == 100.0" &&
  	tail -n1 watch8A.out | jq -e ".execution.expiration == 200.0" &&
  	head -n1 watch8B.out | jq -e ".execution.expiration == 100.0" &&
  	tail -n1 watch8B.out | jq -e ".execution.expiration == 200.0" &&
  	tail -n1 watch8C.out | jq -e ".execution.expiration == 200.0"
  
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 0.0, "expiration": 0.0, "nodelist": ["fv-az505-771"]}}
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 0.0, "expiration": 100.0, "nodelist": ["fv-az505-771"]}}
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 0.0, "expiration": 0.0, "nodelist": ["fv-az505-771"]}}
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 0.0, "expiration": 100.0, "nodelist": ["fv-az505-771"]}}
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 0.0, "expiration": 200.0, "nodelist": ["fv-az505-771"]}}
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 0.0, "expiration": 100.0, "nodelist": ["fv-az505-771"]}}
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 0.0, "expiration": 200.0, "nodelist": ["fv-az505-771"]}}
  {"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0"}}], "starttime": 0.0, "expiration": 200.0, "nodelist": ["fv-az505-771"]}}
  update_watch_stream: job-info.update-watch: Resource temporarily unavailable
  update_watch_stream: job-info.update-watch: Resource temporarily unavailable
  not ok 8 - job-info: update watch with multiple watchers works

grondo · 2023-09-27T14:00:46Z

Do users want to see the whole history starting with the original R, or just current R + updates?

If the latter, would it make sense to have this RPC check the streaming bit, and if not set, return the current R and done?

Stepping back a bit, there are two use cases here:

A user that wants to see the current R + updates. The use case here is a user running flux job info R or scheduler fetching R after a reload/restart, etc. As noted in cache R in the job manager #5472, it would be nice if job-info.lookup just returned this value so we don't have to even update callers...
A user wants to see current R and updates. The use case here is any entity that is monitoring R for changes, e.g. for expiration updates. This includes the job-exec module (so it can update the timer at which it terminates the job), the job shell (to properly implement the --signal=SIG@TIME option), and the scheduler of a subinstance (to update the internal expiration of its resource set).

chu11 · 2023-09-27T23:29:30Z

per offline discussion, re-worked this PR.

like before, there is a job-info.update-watch target that returns the R + updates result, then streams R updates as they occur. Multiple watches on the same jobid + R share the same internal eventlog-watch.
there is a job-info.update-lookup target, which looks up R and eventlog and returns the current R result and that's it. If there is already a update-watch for that jobid, it can return the cached result.

because of all these changes, I had to squash my previous changes, so unfortunately there's just one giant commit adding update-lookup and update-watch, hopefully it's not too hard to review :P

chu11 · 2023-09-28T23:18:04Z

hmmm several builders are consistently failing with

Run docker run --rm --privileged aptman/qus -s -- -p --credential aarch64
Unable to find image 'aptman/qus:latest' locally
docker: Error response from daemon: Head "https://registry-1.docker.io/v2/aptman/qus/manifests/latest": EOF.
See 'docker run --help'.
Error: Process completed with exit code 125.

under the setup qemu-user-static chunk. Perhaps something happened after #5469 merged?

grondo · 2023-09-28T23:31:16Z

My guess is the docker manifest for that repo is corrupted. Same issue in flux-sched.

Problem: In t2231-job-info-eventlog-watch.t, there is a missing pipe between an echo of a json payload and the 'rpc' helper tool, so a payload was not actually being sent. The test was just echoing the name of the command. Add a pipe between the echo and 'rpc' helper tool to initiate the correct payload and test.

Problem: A test in t2231-job-info-eventlog-watch.t does not set the streaming flag in an invalid input test. This test happens to work because the code's "is the input valid" check is done before the "is the streaming flag set" test. But for full correctness, the streaming flag should be set. Solution: Use the 'rpc_stream' tool in the test, which sets the streaming flag.

Problem: In several headers in job-info, the comment indicating the end of a ifdef block had the wrong macro name. Correct the macro name in the comments.

Problem: A local function in job-info was not declared static. Make it static.

Problem: It would be convenient if the several ctx destroy functions preserved errno so it doesn't have to be done with a "save_errno" pattern in cleanup paths. Preserve errno in ctx destroy functions. Cleanup code in a few locations where possible.

Problem: In guest_watch.c there is a utility function called guest_msg_pack() that allows us to create a message using specific credentials from another message. It's useful when messages are sent within the module and you want the original user's request credentials to be copied into those new messages. This function could be useful in future work. Solution: Move the function into a new internal util library and rename it cred_msg_pack().

codecov · 2023-10-24T16:28:25Z

Codecov Report

Merging #5467 (9885340) into master (640134f) will decrease coverage by 0.05%.
The diff coverage is 75.39%.

@@            Coverage Diff             @@
##           master    #5467      +/-   ##
==========================================
- Coverage   83.51%   83.46%   -0.05%     
==========================================
  Files         485      487       +2     
  Lines       81543    81918     +375     
==========================================
+ Hits        68099    68375     +276     
- Misses      13444    13543      +99

Files	Coverage Δ
src/modules/job-info/guest_watch.c	`73.72% <100.00%> (-1.15%)`	⬇️
src/modules/job-info/watch.c	`68.93% <88.88%> (+2.66%)`	⬆️
src/modules/job-info/job-info.c	`75.32% <82.35%> (+1.55%)`	⬆️
src/modules/job-info/util.c	`66.66% <66.66%> (ø)`
src/modules/job-info/update.c	`75.69% <75.69%> (ø)`

... and 11 files with indirect coverage changes

Problem: In watch and guest_watch, there are some common eventlog parsing helper code that could be useful in future job-info additions. Solution: Move the helper function eventlog_parse_next() into util and add a new helper function eventlog_parse_entry_chunk(). Update callers in watch.c and guest_watch.c appropriately.

Problem: In the future, several services will need to know a job's resources and know the updates that would apply to them. This would currently require users to read R, read the eventlog, and then apply `resource-update` events to R. Some other users would also need to know when there are changes to R, necessitating watching the eventlog for future resource-update changes. It would be nice if a service did this as there will be multiple users. Solution: Support a new job-info.update-lookup service and job-info.update-watch streaming service. It currently supports only the key "R", but can be extended to other keys in the future. The job-info.update-lookup service will read R and the eventlog for a job. it then apples all resource-update changes to R and returns the result. job-info.update-watch service will do the same as the above, but if the job is not completed, it will continue to watch the eventlog for future resource-update events. On each new resource-update event, a new R will be streamed back to the caller. This continues until the job ends or the caller cancels the stream. Fixes flux-framework#5451

Problem: There is no coverage for the new job-info.update-lookup and job-info.update-watch services. Add tests in the new t2233-job-info-update.t file and add helper tools update_lookup and update_watch_stream.

grondo

I've been basing work to handle R updates in the job-exec service and job shell on this PR, and this seems to be working. My inclination is to merge this PR now and make any needed changes and improvements if necessary as we need them.

garlick

I found a few things but actually, if this is working, maybe it's fine to merge as is and open an issue to return and clean up the two bigger things I called out

use eventlog class to parse eventlogs
use flux_msglist_disconnect() and flux_msglist_cancel() instead of duplicating that functionality.

Since it may take a bit of time to audit that code and make sure those changes don't break anything, merge now+fix later may be more expedient.

garlick · 2023-10-24T16:48:09Z

src/modules/job-info/util.c

+    if (!(payload = json_vpack_ex (NULL, 0, fmt, ap)))
+        goto error;
+    if (!(payloadstr = json_dumps (payload, JSON_COMPACT))) {
+        errno = ENOMEM;
+        goto error;
+    }
+    if (flux_msg_set_string (newmsg, payloadstr) < 0)
+        goto error;


This function is just moved in this PR not created, but maybe for later:

The json_vpack_ex() + json_dumps() + flux_msg_set_string() could be replaced with one call to flux_msg_vpack().

garlick · 2023-10-24T16:57:05Z

src/modules/job-info/guest_watch.c

@@ -708,19 +708,6 @@ static int main_namespace_lookup (struct guest_watch_ctx *gw)
    return rv;
 }

-static bool eventlog_parse_next (const char **pp, const char **tok,


Three comments on this:

use a different function prefix for local helpers since eventlog_ is already used by the external eventlog class

would it be better to use eventlog_decode() from that class to parse input into a JSON array and then iterate over that?

is a "chunk" the same as an "eventlog entry"? Just call it that if so.

garlick · 2023-10-24T17:00:07Z

src/modules/Makefile.am

@@ -170,6 +170,8 @@ job_info_la_SOURCES = \
 	job-info/watch.c \


commit mesage for job-info: support new update-lookup/watch service : capitalize "it" in the 4th paragraph of the body.

garlick · 2023-10-24T17:03:12Z

src/modules/job-info/update.c

+    uc->type = type;
+    uc->id = id;
+    if (!(uc->key = strdup (key))) {
+        errno = ENOMEM;


strdup already sets ENOMEM on failure

garlick · 2023-10-24T17:10:59Z

src/modules/job-info/update.c

+    while (eventlog_parse_next (&input, &tok, &toklen)) {
+        json_t *entry = NULL;
+        const char *name;
+        json_t *context = NULL;
+        if (eventlog_parse_entry_chunk (uc->ctx->h,
+                                        tok,
+                                        toklen,
+                                        &entry,
+                                        &name,
+                                        &context) < 0) {
+            errmsg = "error parsing eventlog";
+            goto error;
+        }


Yeah, as noted earlier, it looks like this could be replaced with (with appropriate error handling)

eventlog = eventlog_decode (eventlog_str); json_array_foreach (eventlog, index, entry) { eventlog_entry_parse (entry, NULL, &name, &context); ... }

garlick · 2023-10-24T17:20:07Z

src/modules/job-info/update.c

+            if (flux_respond_error (uc->ctx->h, cmpmsg, ENODATA, NULL) < 0)
+                flux_log_error (uc->ctx->h,
+                                "%s: flux_respond_error",
+                                __FUNCTION__);


Response should only be sent on cancel, not disconnect.

garlick · 2023-10-24T17:24:39Z

src/modules/job-info/update.c

+
+    uc = zlist_first (ctx->update_watchers);
+    while (uc) {
+        update_watch_cancel (uc, msg, cancel);


Maybe it would be better to replace update_watch_cancel() with

if (cancel) flux_msglist_cancel (ctx->h, uc->msglist, msg); else flux_msglist_disconnect (ctx->h, uc->msglist, msg);

grondo · 2023-10-24T22:06:21Z

I've just set MWP on behalf of @chu11

github-advanced-security bot found potential problems Sep 22, 2023

View reviewed changes

src/modules/job-info/guest_watch.c Fixed Show fixed Hide fixed

chu11 force-pushed the issue5451_job_info_watch_R branch from 39aaf83 to 1a8e4c4 Compare September 22, 2023 22:44

chu11 force-pushed the issue5451_job_info_watch_R branch 10 times, most recently from 1b75bba to e01ce52 Compare September 26, 2023 04:31

grondo mentioned this pull request Sep 26, 2023

cache R in the job manager #5472

Merged

grondo reviewed Sep 26, 2023

View reviewed changes

chu11 force-pushed the issue5451_job_info_watch_R branch from e01ce52 to 2e3c1da Compare September 27, 2023 04:04

chu11 force-pushed the issue5451_job_info_watch_R branch from 2e3c1da to 80e2003 Compare September 27, 2023 05:22

grondo mentioned this pull request Sep 27, 2023

flux-job: get updated version of R #5464

Merged

chu11 force-pushed the issue5451_job_info_watch_R branch 2 times, most recently from bbca20b to 45f7e68 Compare September 27, 2023 23:19

chu11 force-pushed the issue5451_job_info_watch_R branch from 45f7e68 to e45ce00 Compare September 28, 2023 21:10

chu11 changed the title ~~job-info: support new update-watch service~~ job-info: support new update-lookup and update-watch service Sep 28, 2023

grondo force-pushed the issue5451_job_info_watch_R branch from e45ce00 to 287b501 Compare October 12, 2023 23:49

grondo force-pushed the issue5451_job_info_watch_R branch from 287b501 to 9885340 Compare October 24, 2023 15:43

chu11 added 6 commits October 24, 2023 09:27

job-info: correct ifdef block comment

789f484

Problem: In several headers in job-info, the comment indicating the end of a ifdef block had the wrong macro name. Correct the macro name in the comments.

job-info: make local function static

cee28c4

Problem: A local function in job-info was not declared static. Make it static.

chu11 added 3 commits October 24, 2023 09:28

testsuite: add job-info.update-lookup/watch tests

a4fc87b

Problem: There is no coverage for the new job-info.update-lookup and job-info.update-watch services. Add tests in the new t2233-job-info-update.t file and add helper tools update_lookup and update_watch_stream.

grondo force-pushed the issue5451_job_info_watch_R branch from 9885340 to a4fc87b Compare October 24, 2023 16:31

grondo approved these changes Oct 24, 2023

View reviewed changes

garlick approved these changes Oct 24, 2023

View reviewed changes

grondo added the merge-when-passing label Oct 24, 2023

mergify bot merged commit fce7220 into flux-framework:master Oct 24, 2023
30 checks passed

chu11 deleted the issue5451_job_info_watch_R branch November 20, 2023 01:48

chu11 mentioned this pull request Nov 21, 2023

job-info: misc cleanup #5586

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-info: support new update-lookup and update-watch service #5467

job-info: support new update-lookup and update-watch service #5467

chu11 commented Sep 22, 2023

chu11 commented Sep 23, 2023 •

edited

Loading

chu11 commented Sep 26, 2023

grondo Sep 26, 2023

chu11 Sep 27, 2023

grondo Sep 27, 2023

garlick commented Sep 27, 2023

chu11 commented Sep 27, 2023 •

edited

Loading

chu11 commented Sep 27, 2023

chu11 commented Sep 27, 2023 •

edited

Loading

grondo commented Sep 27, 2023

chu11 commented Sep 27, 2023

chu11 commented Sep 28, 2023

grondo commented Sep 28, 2023

codecov bot commented Oct 24, 2023

grondo left a comment •

edited

Loading

garlick left a comment

garlick Oct 24, 2023

garlick Oct 24, 2023 •

edited

Loading

garlick Oct 24, 2023

garlick Oct 24, 2023

garlick Oct 24, 2023

garlick Oct 24, 2023

garlick Oct 24, 2023

grondo commented Oct 24, 2023

		@@ -170,6 +170,8 @@ job_info_la_SOURCES = \
		job-info/watch.c \

job-info: support new update-lookup and update-watch service #5467

job-info: support new update-lookup and update-watch service #5467

Conversation

chu11 commented Sep 22, 2023

chu11 commented Sep 23, 2023 • edited Loading

chu11 commented Sep 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garlick commented Sep 27, 2023

chu11 commented Sep 27, 2023 • edited Loading

chu11 commented Sep 27, 2023

chu11 commented Sep 27, 2023 • edited Loading

grondo commented Sep 27, 2023

chu11 commented Sep 27, 2023

chu11 commented Sep 28, 2023

grondo commented Sep 28, 2023

codecov bot commented Oct 24, 2023

Codecov Report

grondo left a comment • edited Loading

Choose a reason for hiding this comment

garlick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garlick Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grondo commented Oct 24, 2023

chu11 commented Sep 23, 2023 •

edited

Loading

chu11 commented Sep 27, 2023 •

edited

Loading

chu11 commented Sep 27, 2023 •

edited

Loading

grondo left a comment •

edited

Loading

garlick Oct 24, 2023 •

edited

Loading