Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-job: get updated version of R #5464

Merged

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Sep 22, 2023

similar to #5428 this supports resource-update events in flux job info and the equivalent python bindings to look up the "viewed" R and not the "base" one stored in the KVS.

Just like #5463, the commits with new tests do not work and have XXX tagged to be filled in later when an actual way to update the R of a job exists.

I should also note that this does not use any solution from #5451 as A) I implemented this first :P, B) copied the implementation from the jobspec side, and C) I wanted to keep this processing out of broker. We could of course change the approach.

@chu11 chu11 force-pushed the issue5424_flux_job_info_updated_R branch 5 times, most recently from f59f269 to 9664379 Compare September 23, 2023 00:04
@grondo
Copy link
Contributor

grondo commented Sep 25, 2023

Just like #5463, the commits with new tests do not work and have XXX tagged to be filled in later when an actual way to update the R of a job exists.

Would it work to write a test jobtap plugin that simply emits a resource-update event for a job? I think there may be a bit of work to do before we get R updates working through normal channels, and I'd hate for this to have to unnecessarily wait on that...

@chu11
Copy link
Member Author

chu11 commented Sep 25, 2023

Would it work to write a test jobtap plugin that simply emits a resource-update event for a job? I think there may be a bit of work to do before we get R updates working through normal channels, and I'd hate for this to have to unnecessarily wait on that...

We could definitely do that in the short term. The main reason I didn't do that initially is because of the python tests.

But looking at the implementation of flux-jobtap i guess a half-simple job-manager RPC should take care of that.

@chu11 chu11 force-pushed the issue5424_flux_job_info_updated_R branch from 9664379 to 5b62e89 Compare September 26, 2023 21:21
@chu11 chu11 changed the title WIP flux-job: get updated version of R lux-job: get updated version of R Sep 26, 2023
@chu11 chu11 changed the title lux-job: get updated version of R flux-job: get updated version of R Sep 26, 2023
@chu11
Copy link
Member Author

chu11 commented Sep 26, 2023

re-pushed adding a resource-update-expiration plugin that outputs a resource-update event when a job is run. It does something stupid, just increasing the expiration by 72 hours, so we can check in tests that it has changed.

I have comments that say "you may want to update this to something smarter down the road", but the tests do the job for the time being. So I removed WIP.

@chu11 chu11 force-pushed the issue5424_flux_job_info_updated_R branch from 5b62e89 to d31d435 Compare September 26, 2023 21:27
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Seems like there was a miscommunication about the resource-update event -- it is not similar to jobspec-update in that only an expiration key is currently allowed, which sets a new execution.expiration. (Commented inline as I noted spots that need to be updated.)

Comment on lines 3504 to 3509
json_object_foreach (context, path, value) {
if (jpath_set (R, path, value) < 0)
log_err_exit ("Failed to update R");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC 21 specifies that a resource-update event may only have a key of expiration, which updates the resource set expiration. I don't think this code will work here since the expiration is located in execution.expiration.

In the future other keys may indicate grow or shrink (or those may be different events) and those would likely not apply with jpath_set() either, so probably better to specifically handle expiration here and generate an error for any other key in the resource-update event context.

echo $jobid > updated_R.id &&
flux job info $jobid R | jq -e ".execution.expiration == 0.0" &&
kvspath=`flux job id --to=kvs ${jobid}` &&
flux kvs eventlog append ${kvspath}.eventlog resource-update "{\"execution.expiration\": 1000.0}" &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment. Only allowed key in resource-update event context is expiration.

Comment on lines 102 to 104
if event.name == "resource-update":
for key, value in event.context.items():
set_treedict(R, key, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, need to handle expiration correctly here.

id,
"resource-update",
"{s:f}",
"execution.expiration", newexp) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted elsewhere, RFC 21 specifies expiration as the only allowed resource-update context key.

@chu11
Copy link
Member Author

chu11 commented Sep 27, 2023

Seems like there was a miscommunication about the resource-update event -- it is not similar to jobspec-update in that only an expiration key is currently allowed, which sets a new execution.expiration. (Commented inline as I noted spots that need to be updated.)

Doh! I had just completely misread the RFC!

re-pushed with changes, mostly if key == "expiration" ... checks.

diff --git a/src/bindings/python/flux/job/kvslookup.py b/src/bindings/python/flux/job/kvslookup.py
index 776b9afa8..276ec0863 100644
--- a/src/bindings/python/flux/job/kvslookup.py
+++ b/src/bindings/python/flux/job/kvslookup.py
@@ -101,7 +101,8 @@ def _get_updated_R(job_data):
         event = EventLogEvent(entry)
         if event.name == "resource-update":
             for key, value in event.context.items():
-                set_treedict(R, key, value)
+                if key == "expiration":
+                    set_treedict(R, "execution.expiration", value)
     return json.dumps(R, ensure_ascii=False)
 
 
diff --git a/src/cmd/flux-job.c b/src/cmd/flux-job.c
index 9629bfe9f..9c98b34de 100644
--- a/src/cmd/flux-job.c
+++ b/src/cmd/flux-job.c
@@ -3502,8 +3502,10 @@ void info_output_R (flux_future_t *f, struct info_ctx *ctx)
                 continue;
 
             json_object_foreach (context, path, value) {
-                if (jpath_set (R, path, value) < 0)
-                    log_err_exit ("Failed to update R");
+                if (streq (path, "expiration")) {
+                    if (jpath_set (R, "execution.expiration", value) < 0)
+                        log_err_exit ("Failed to update R");
+                }
             }
         }
 
diff --git a/t/job-manager/plugins/resource-update-expiration.c b/t/job-manager/plugins/resource-update-expiration.c
index cc3ace891..872d60011 100644
--- a/t/job-manager/plugins/resource-update-expiration.c
+++ b/t/job-manager/plugins/resource-update-expiration.c
@@ -60,7 +60,7 @@ static int run_cb (flux_plugin_t *p,
                                      id,
                                      "resource-update",
                                      "{s:f}",
-                                     "execution.expiration", newexp) < 0) {
+                                     "expiration", newexp) < 0) {
         flux_jobtap_raise_exception (p,
                                      FLUX_JOBTAP_CURRENT_JOB,
                                      "resource-update", 0,
diff --git a/t/t2230-job-info-lookup.t b/t/t2230-job-info-lookup.t
index e87c323ab..dad9d8746 100755
--- a/t/t2230-job-info-lookup.t
+++ b/t/t2230-job-info-lookup.t
@@ -112,8 +112,8 @@ test_expect_success 'flux job info applies resource updates' '
        echo $jobid > updated_R.id &&
        flux job info $jobid R | jq -e ".execution.expiration == 0.0" &&
        kvspath=`flux job id --to=kvs ${jobid}` &&
-       flux kvs eventlog append ${kvspath}.eventlog resource-update "{\"execution.expiration\": 1000.0}" &&
-       flux job info $jobid R | jq -e ".execution.expiration > 0.0" &&
+       flux kvs eventlog append ${kvspath}.eventlog resource-update "{\"expiration\": 1000.0}" &&
+       flux job info $jobid R | jq -e ".execution.expiration == 1000.0" &&
        flux job info $jobid eventlog &&
        flux cancel $jobid
 '

@grondo
Copy link
Contributor

grondo commented Sep 27, 2023

Given the comments in #5467 I wonder if we want to consider having the job-info module apply the updates here, instead of pushing the application down into clients? This would allow a scheduler fetching R to always get the updated copy and not have to apply the updates manually (essentially duplicating code). I'm not sure how much work that would be though...

@chu11
Copy link
Member Author

chu11 commented Sep 27, 2023

Given the comments in #5467 I wonder if we want to consider having the job-info module apply the updates here, instead of pushing the application down into clients? This would allow a scheduler fetching R to always get the updated copy and not have to apply the updates manually (essentially duplicating code). I'm not sure how much work that would be though...

I don't think it should be too much work, I think the big questions are:

  • if we do this for R, we should probably do it for jobspec too. I did architect things in job-info: support new update-lookup and update-watch service #5467 in preparation for the possibility

  • should we do this in job-info.lookup and leave job-info.update-watch to be just for update watching?

    • and job-info.lookup should also support a "base" option of sorts.

@grondo
Copy link
Contributor

grondo commented Sep 27, 2023

should we do this in job-info.lookup and leave job-info.update-watch to be just for update watching?

I'm not sure. job-info.lookup does support flags right? Would it be good to have update-watch functionality as a LOOKUP_WATCH flag similar to how the KVS API works? That being said, it isn't too much extra work to convert to a new RPC vs adding a flag, so I'm not sure I have a strong preference here... I wonder if @garlick has any thoughts.

Problem: In a helper function, an extra parameter was mistakenly passed
to AssertNotIn.

Remove that extra parameter.
Problem: In several check helper functions in python/t0014-job-kvslookup.py,
we pass in a jobid but forget to use it.

Check that the jobid exists within the looked up data within those
helper functions.
Problem: A few tests in python/t0014-job-kvslookup.py were
named / numbered inconsistently.  Although this did not affect
these specific tests, it could in the future as some tests do
have dependencies on other tests being run first.

Solution: Fix the names / numbering.
Problem: In python/t0014-job-kvslookup.py there is a single
function to check both R and J lookups.  In future tests, it would
be convenient to only check R or J but not both.

Split up the function into a R check and J check and update callers
accordingly.
Problem: Internally within flux, R can be updated via resource-update
events.  But when the user runs "flux job info" to get R, it is getting
the version stored in the KVS and may not be representative of the
"viewed" R.  The result may not be consistent other tools such as "flux jobs".

Solution: When retrieving R, also retrieve the job eventlog.  Update R
through "resource-update" events and output that as the final result.
Get the stored version of R only if the --base option is specified.

This change breaks several tests in t2232-job-info-security.t that
created a dummy value of R.  Instead of reading a dummy value of R,
create a "dummy" key and read that.

Fixes flux-framework#5425
Problem: There are no tests that exercise handling of `flux job info`
and resource-update events in the job eventlog and operation of the
`flux job info --base` option.

Add coverage in t2230-job-info-lookup.t.
Problem: The "flux job info" command will present an updated
view of "R" when it is looked up, or a "base" view when --base
is specified.  This is supported in the python job/kvslookup.py
module with jobspec, but not with R.

Support updated and base lookups of R as well as jobspec.
Problem: There is no coverage for the 'base' option and the "R"
key in job kvs lookup via Python.

Add some coverage in python/t0014-job-kvslookup.py. Use a jobtap
plugin to generate resource-update event for tests.
@grondo grondo force-pushed the issue5424_flux_job_info_updated_R branch from b8340b7 to ef87fab Compare October 25, 2023 15:02
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Rebased and fixed up the test jobtap plugin to use R from job-manager.

@@ -285,20 +285,21 @@ test_expect_success 'flux job wait-event guest.exec.eventlog fails via -p (live

# for these tests we need to create a fake job eventlog in the KVS

# value of R irrelevant here, just need to lookup something
# value of "dummy" irrelevant here, just need to lookup something
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, these test changes should have come in a separate commit before the breaking changes with an explanation. However, it isn't critical and is just my preference, so this works for me.

Comment on lines 45 to 48
/* tests expecting an expiration greater than original. Instead
* of getting R and calculating future time, just add 72 hours to
* the current time.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R is now available in jobtap plugin args so this can be simplified. Will fix.

@codecov
Copy link

codecov bot commented Oct 25, 2023

Codecov Report

Merging #5464 (ef87fab) into master (9528de4) will increase coverage by 0.04%.
The diff coverage is 93.05%.

@@            Coverage Diff             @@
##           master    #5464      +/-   ##
==========================================
+ Coverage   83.43%   83.47%   +0.04%     
==========================================
  Files         487      487              
  Lines       81990    82061      +71     
==========================================
+ Hits        68406    68504      +98     
+ Misses      13584    13557      -27     
Files Coverage Δ
src/bindings/python/flux/job/kvslookup.py 98.12% <100.00%> (+0.36%) ⬆️
src/cmd/flux-job.c 87.60% <88.88%> (+0.02%) ⬆️

... and 6 files with indirect coverage changes

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it gets the job done and I'm fine with merging.
We should come back around and replace the one-off jobspec/R update code with calls to the new job-info.update-lookup RPC at some point though.

@grondo
Copy link
Contributor

grondo commented Oct 25, 2023

Thanks @garlick. I've set MWP on behalf of @chu11.

@mergify mergify bot merged commit 6745065 into flux-framework:master Oct 25, 2023
33 checks passed
@chu11 chu11 deleted the issue5424_flux_job_info_updated_R branch November 20, 2023 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants