add job update service and new job-update(1) command #5409

grondo · 2023-08-24T17:06:13Z

This is a WIP, but posting early while working on tests and documentation in case there are comments on the overall design or other high level aspects of this proposal.

This PR adds a new job update service to the job manger. A new job-manager.update RPC accepts requests for job updates with a payload including the target jobid, and an updates object, which has the same form as the jobspec-update context of RFC 21, except that period-delimited keys are not required to begin with attributes..

In order to test if an update is allowed for a given key, the update service will attempt to call a job.update.KEY jobtap callback. If no plugin is registered for that callback topic, or if the plugin callback returns -1, then the update is rejected. This allows us to gradually and carefully enable updates, as well as allowing out-of-tree projects to allow updates to keys for which they control (e.g. flux-accounting can allow updates to bank and project).

Once all keys in a payload have been "allowed" by jobtap plugin callbacks (if multiple keys are present, then they all must be allowed or they all fail), then the update service copies the current jobspec and applies the update, then validates the update by passing the updated jobspec to the job.validate plugin stack. This allows the update in full to be validated before it is applied.

Plugins can, however, notify the update service that the plugin already considers the update validated by setting a validated flag in the output arguments. This can be used to bypass further validation when it is unnecessary or if a user is being allowed to bypass limits for example. If there are multiple keys in the update request, then all keys must have the validated flag set to bypass validation.

Since the first key update we want to allow is attributes.system.duration, an update-duration builtin plugin is also added to the job manager. This plugin allows users to update duration of pending jobs with validation (so that limits are checked), and without validation by default for instance owner (allowing the instance owner to increase the duration beyond configured limits). A plugin config parameter allows this instance owner capability to be disabled (mostly for testing). Having the duration update be a plugin instead of builtin to the update service also allows updates to be disabled completely by removing the plugin.

Finally, a new flux-update(1) command is added for sending update requests to the update service. The command takes a jobid and one or more key=value pairs and constructs an update request. Keys that do not begin with standard top-level keys of jobspec are assumed to be under attributes.system., except for name which is an alias for attributes.system.job.name (yes, the flux-update command is designed for general updates at this point even if the update service does not support them).

Each key can have special value handling by adding a specially named method in the internal JobspecUpdates class. Currently, duration has special handling of the value to allow values of the form [+-]FSD. If a duration adjustment is requested via + or -, the utility downloads the jobspec and eventlog for the job and gets the current duration (with updates applied) and adjusts. Thus the following all should work:

flux update JOBID duration=1h -- set duration to 1 hour
flux update JOBID duration=+10m -- add 10 minutes to current duration
flux update JOBID duration=-10m -- subtract 10 minutes from current duration
flux update JOBID duration=inf -- set duration to unlimited

Since the job.validate callback is made before applying updates, any duration update request that exceeds limits is rejected with an appropriate error message, e.g.:

$ flux update ƒGT7B136w duration=+1h
flux-update: ERROR: requested duration exceeds policy limit of 1h

After an update request succeeds, the jobspec-update event is posted to the eventlog. This event now bumps jobs in SCHED back to PRIORITY. This should cancel any outstanding alloc requests to the scheduler, and when the alloc request is made again, the scheduler should get the new duration as part of the redacted jobspec.

grondo · 2023-08-24T17:20:48Z

Rebased now that #5408 is merged which will allow easier testing and experimentation

chu11 · 2023-08-24T17:26:23Z

excellent timing, I had started working on #4697 and #4698 and blanked on the fact that a flux kvs eventlog append doesn't go through the journal, and thus job-list never sees the update in some tests. Can give this an immediate whirl.

src/modules/job-manager/jobtap.c

src/modules/job-manager/update.c

src/cmd/flux-update.py

grondo · 2023-08-24T18:19:40Z

src/modules/job-manager/update.c

+
+    if (!(update = calloc (1, sizeof (*update))))
+        return NULL;
+    update->ctx = ctx;


I marked this as a false positive, but then thought maybe there's some obvious problem I'm missing here?

chu11

just starting to play with things, two small nits i noticed

chu11 · 2023-08-24T18:42:53Z

src/modules/job-manager/event.c

@@ -675,6 +675,13 @@ int event_job_update (struct job *job, json_t *event)
    else if (streq (name, "jobspec-update")) {
        if (event_handle_jobspec_update (job, context) < 0)
            goto inval;
+        /*  Transition a job in SCHED state back to PRIORITY to trigger


commit message typo "PRIOITY"

chu11 · 2023-08-24T18:43:48Z

src/modules/job-manager/plugins/update-duration.c

+                               flux_plugin_arg_t *args,
+                               void *arg)
+{
+    flux_jobid_t id;


i don't think id is used in this function for anything

Oh, yeah, good catch. Whatever was using the jobid got eventually moved out of that callback, so no longer necessary.

chu11 · 2023-08-24T18:57:11Z

woo hoo, initial tests seem to work

                                                                                                                                            
test_expect_success 'support jobspec updates of project and bank' '                                                                         
        flux jobtap load --remove=all ${PLUGINPATH}/project-bank-validate.so                                                                
'                                                                                                                                           
<snip>
# initially put job on hold, jobspec-updates don't matter after the job is running                                                          
test_expect_success 'flux job list outputs project and bank if one set' '                                                                   
        jobid=`flux submit --urgency=hold /bin/true | flux job id` &&                                                                       
        echo $jobid > jobprojectbank2.id &&                                                                                                 
        flux update $jobid project=foo &&                                                                                                   
        flux update $jobid bank=bar &&                                                                                                      
        flux job urgency $jobid default &&                                                                                                  
        wait_jobid_state $jobid inactive &&                                                                                                 
        flux job list -s inactive | grep $jobid | jq -e ".project == \"foo\"" &&                                                            
        flux job list -s inactive | grep $jobid | jq -e ".bank == \"bar\""                                                                  
'

chu11 · 2023-08-24T19:30:04Z

seeing:

Aug 24 19:21:46.697069 job-manager.err[0]: queue_started: job foHZJWpX invalid queue: val
Aug 24 19:21:46.697141 job-manager.err[0]: queue_started: job foHZJWpX invalid queue: val
Aug 24 19:21:46.697202 job-manager.err[0]: queue_started: job foHZJWpX invalid queue: val
Aug 24 19:21:46.697271 job-manager.err[0]: queue_started: job foHZJWpX invalid queue: val
Aug 24 19:21:46.697330 job-manager.err[0]: queue_started: job foHZJWpX invalid queue: val
Aug 24 19:21:46.697388 job-manager.err[0]: queue_started: job foHZJWpX invalid queue: val
Aug 24 19:21:46.697444 job-manager.err[0]: queue_started: job foHZJWpX invalid queue: val

spin when I run t2260-job-list.t.

I haven't figured out cause yet, but I'm thinking something in your branch conflicts with my "jobspec-update-job-list" jobtap plugin. Perhaps the jobspec update data wasn't validated before, but needs to be validated now, and somehow ends up in some error loop?

note: I change the job queue in jobspec-update-job-list

chu11 · 2023-08-24T19:42:27Z

On nothing more than a hunch I reverted the job-manager: move jobs from SCHED->PRIORITY on jobspec-update and the spinning error no longer happens. I'm guessing there's some corner case in there regarding changes to queues. The error message saying the queue is "val" is perplexing though.

chu11 · 2023-08-24T19:49:37Z

ok, well this is part of the issue

    if (job->jobspec_redacted) {                                                                                                            
        /* unit tests assume empty jobspec legal, so all fields                                                                             
         * optional                                                                                                                         
         */                                                                                                                                 
        if (json_unpack (job->jobspec_redacted,                                                                                             
                         "{s?{s?{s?s}}}",                                                                                                   
                         "attributes",                                                                                                      
                           "system",                                                                                                        
                             "queue", &job->queue) < 0) {                                                                                   
            errno = EINVAL;                                                                                                                 
            return -1;                                                                                                                      
        }                                                                                                                                   
    }

if the value has changed in the redacted jobspec, then queue is now pointing to whatever junk its old pointer is pointing to. So that is the reason for that mystery.

grondo · 2023-08-24T19:54:50Z

if the value has changed in the redacted jobspec, then queue is now pointing to whatever junk its old pointer is pointing to. So that is the reason for that mystery.

Good catch, but updating the queue is not supported yet, maybe we should have changed something else incidental in those tests (maybe job name?). Though I guess it wouldn't hurt to fix this issue in this PR since it seems to somehow introduce a problem...

grondo · 2023-08-24T20:02:11Z

Well I guess an explicit call to update the queue name any time the jobspec_redacted is updated might work, but blech. Also is bumping a job back to PRIORITY enough to ensure moving from a stopped to started queue works and vice versa? (I guess the job.update.attributes.system.queue plugin callback would deny updating to a queue that is currently stopped... somehow). Not too pretty though... 😕

grondo · 2023-08-24T20:04:03Z

Maybe a PR to change the job-list test attribute to something other than queue would be best for now? And open an issue to capture what we learned about changing the queue of a pending job?

grondo · 2023-08-24T20:23:09Z

Simple solution for now? I can add that here if it works:

diff --git a/src/modules/job-manager/event.c b/src/modules/job-manager/event.c
index 7f8c70d76..17dc8744b 100644
--- a/src/modules/job-manager/event.c
+++ b/src/modules/job-manager/event.c
@@ -555,7 +555,8 @@ static int event_handle_jobspec_update (struct job *job, json_t *context)
     if (!job->jobspec_redacted
         || job->state == FLUX_JOB_STATE_RUN
         || job->state == FLUX_JOB_STATE_CLEANUP
-        || jobspec_apply_updates (job->jobspec_redacted, context) < 0)
+        || jobspec_apply_updates (job->jobspec_redacted, context) < 0
+        || job_update_queue (job) < 0)
         return -1;
     return 0;
 }
diff --git a/src/modules/job-manager/job.c b/src/modules/job-manager/job.c
index cbc94101c..737963611 100644
--- a/src/modules/job-manager/job.c
+++ b/src/modules/job-manager/job.c
@@ -172,6 +172,11 @@ static int jobspec_redacted_parse_queue (struct job *job)
     return 0;
 }
 
+int job_update_queue (struct job *job)
+{
+    return jobspec_redacted_parse_queue (job);
+}
+
 struct job *job_create_from_eventlog (flux_jobid_t id,
                                       const char *eventlog,
                                       const char *jobspec,
diff --git a/src/modules/job-manager/job.h b/src/modules/job-manager/job.h
index 4fcbccf94..88f760c25 100644
--- a/src/modules/job-manager/job.h
+++ b/src/modules/job-manager/job.h
@@ -127,6 +127,10 @@ int job_event_peek (struct job *job, int *flagsp, json_t **entryp);
 bool job_event_is_queued (struct job *job, const char *name);
 const char *job_event_queue_print (struct job *job, char *buf, int size);
 
+/*  Update queue in case of jobspec update
+ */
+int job_update_queue (struct job *job);
+
 /*  Validate updates as valid RFC 21 jobspec-update event context:
  */
 bool validate_jobspec_updates (json_t *updates);

chu11 · 2023-08-24T20:44:10Z

hmmmm, my spiritually similar hack still hangs the test, so perhaps there's another issue somewhere.

-int jobspec_apply_updates (json_t *jobspec, json_t *updates)
+int jobspec_apply_updates (json_t *jobspec, json_t *updates, struct job *job)
 {
     const char *path;
     json_t *val;
@@ -612,6 +612,7 @@ int jobspec_apply_updates (json_t *jobspec, json_t *updates)
         if (jpath_set (jobspec, path, val) < 0)
             return -1;
     }
+    jobspec_redacted_parse_queue (job);
     return 0;
 }

The change to queue in the job-list tests was specific to test that queue stats were updated correctly in job-list. Obviously that PR went in before specific updates were allowed/not allowed.

could simply remove that part of the test temporarily?

grondo · 2023-08-24T20:47:05Z

BTW, @cmoussa1 should probably comment here if this scheme will work for flux-accounting. The idea would be that the accounting priority plugin could add handlers for job.update.attributes.system.bank and job.update.attributes.system.project to permit updates of these attributes. These callbacks should check if the new bank and/or project are valid before allowing the update.

It occurs to me that one new caveat here is that he job.validate callback should now be idempotent, i.e. it should only validate the job, not make any updates or changes to internal information about the job. This isn't the case currently, so there may be some job.validate callbacks that need to be split up to do some work in job.new instead.

In the case of a bank or project update, the job will be kicked back to PRIORITY after an update, so the bank and project for a job need to get updated at that callback somehow, before the priority plugin does its calculation of the new priority. I think this also might be a change in how the priority works now, so we may need to coordinate changes in the accounting plugin before "enabling" updates to those attributes.

grondo · 2023-08-24T20:52:15Z

hmmmm, my spiritually similar hack still hangs the test, so perhaps there's another issue somewhere.

Yeah, I didn't test mine and it crashed the broker. Kind of like encapsulating the update in a single function as you've done so let me try that. It would be good to figure this out now in case the approach here is untenable for a queue update...

grondo · 2023-08-24T21:18:14Z

Hm, maybe the problem is that the job-list test jobspec-update plugin is updating the jobspec on every callback, since the jobspec-update event kicks the job back to PRIORITY, the job will call the callback again, then get sent back to PRIORITY again, and so on...

grondo · 2023-08-24T21:24:00Z

This helps:

diff --git a/t/job-manager/plugins/jobspec-update-job-list.c b/t/job-manager/plu
gins/jobspec-update-job-list.c
index 43440fe9b..11bcb52c5 100644
--- a/t/job-manager/plugins/jobspec-update-job-list.c
+++ b/t/job-manager/plugins/jobspec-update-job-list.c
@@ -25,6 +25,9 @@ static int validate_cb (flux_plugin_t *p,
                         flux_plugin_arg_t *args,
                         void *data)
 {
+    static bool updated = false;
+    if (updated)
+        return 0;
     if (flux_jobtap_jobspec_update_pack (p,
                                          "{s:f}",
                                          "attributes.system.duration",
@@ -35,6 +38,7 @@ static int validate_cb (flux_plugin_t *p,
                                      "update failure");
         return -1;
     }
+    updated = true;
     return 0;
 }
 
@@ -43,6 +47,9 @@ static int depend_cb (flux_plugin_t *p,
                       flux_plugin_arg_t *args,
                       void *data)
 {
+    static bool updated = false;
+    if (updated)
+        return 0;
     if (flux_jobtap_jobspec_update_pack (p,
                                          "{s:s}",
                                          "attributes.system.job.name",
@@ -53,6 +60,7 @@ static int depend_cb (flux_plugin_t *p,
                                      "update failure");
         return -1;
     }
+    updated = true;
     return 0;
 }
 
@@ -61,6 +69,9 @@ static int sched_cb (flux_plugin_t *p,
                      flux_plugin_arg_t *args,
                      void *data)
 {
+    static bool updated = false;
+    if (updated)
+        return 0;
     if (flux_jobtap_jobspec_update_pack (p,
                                          "{s:s}",
                                          "attributes.system.queue",
@@ -71,6 +82,7 @@ static int sched_cb (flux_plugin_t *p,
                                      "update failure");
         return -1;
     }
+    updated = true;
     return 0;
 }

Might only be necessary for the job.state.sched callback though.

chu11 · 2023-08-24T21:33:41Z

@grondo ahh good catch, that's sorta obvious now that the sched->priority commit is in there. Applied the logical fix to (only in the sched callback) and the hang is gone.

chu11 · 2023-08-24T21:35:24Z

Does this fix point to a potential need to not call the jobtap callback again after a jobspec update? Wouldn't be the most obvious thing to do to a casual jobtap plugin programmer.

grondo · 2023-08-24T21:35:28Z

I'll just incorporate that into this PR (before the SCHED->PRIORITY commit)

grondo · 2023-08-24T21:38:36Z

Does this fix point to a potential need to not call the jobtap callback again after a jobspec update? Wouldn't be the most obvious thing to do to a casual jobtap plugin programmer.

I think the jobtap plugin stack must be called again when the job goes back to SCHED or any other state or else the job may not be properly processed by plugins (and there's no way to "know" what the plugins are doing to avoid some plugins but not all, even if that were possible)

Edit: I'm doubtful this will be a problem for real plugins, since the use case would be to check the current jobspec and only issue an update if necessary. On the second time around, the update should no longer be necessary.

Another idea would be to check if the update wouldn't change jobspec and reject or ignore it if so.

cmoussa1 · 2023-08-24T21:48:42Z

BTW, @cmoussa1 should probably comment here if this scheme will work for flux-accounting. The idea would be that the accounting priority plugin could add handlers for job.update.attributes.system.bank and job.update.attributes.system.project to permit updates of these attributes. These callbacks should check if the new bank and/or project are valid before allowing the update.

This sounds reasonable to me at the moment! It would make sense for the priority plugin to support this kind of functionality; users can belong to multiple banks and projects, so they should be able to switch between them if need be.

It occurs to me that one new caveat here is that he job.validate callback should now be idempotent, i.e. it should only validate the job, not make any updates or changes to internal information about the job. This isn't the case currently, so there may be some job.validate callbacks that need to be split up to do some work in job.new instead.

I think this makes sense so far. I could definitely be wrong but I think as it stands right now, the priority plugin does not make any updates or changes to internal information about the job (that would, of course, change with flux-framework/flux-accounting#294), right? Maybe I am misunderstanding your point here, sorry! Are you suggesting that this kind of functionality, where the priority plugin posts a jobspec-update for the default project name (or, eventually the default bank name) be moved to the callback for job.new?

In the case of a bank or project update, the job will be kicked back to PRIORITY after an update, so the bank and project for a job need to get updated at that callback somehow, before the priority plugin does its calculation of the new priority. I think this also might be a change in how the priority works now, so we may need to coordinate changes in the accounting plugin before "enabling" updates to those attributes.

This is good to know. If this is the case, where a job goes back to PRIORITY after an update, then we would need to add a flux_jobtap_jobspec_update_pack () call(s) to this callback to update the main eventlog for any accounting attributes that could get updated, yes? If so, then I think these calls could just be placed before the priority for the job gets calculated, i.e before the callback calls priority_calculation ()? Just my immediate thoughts without any actual testing or anything. :P

grondo · 2023-08-24T22:18:28Z

Are you suggesting that this kind of functionality, where the priority plugin posts a jobspec-update for the default project name (or, eventually the default bank name) be moved to the callback for job.new?

I was thinking more about a plugin's internal state for a job -- the job.validate callback should not change this state based on the validation arguments, since job.validate could now be called with proposed jobspec, and if some other validation fails, the updates might never be applied.

where a job goes back to PRIORITY after an update, then we would need to add a flux_jobtap_jobspec_update_pack () call(s) to this callback to update the main eventlog for any accounting attributes that could get updated, yes? If so, then I think these calls could just be placed before the priority for the job gets calculated, i.e before the callback calls priority_calculation ()

jobspec-update events only need to be emitted once. They would be emitted from the job-manager.update service after the updates are allowed and validated. The result would be that the attributes would be updated in the internal copy of jobspec in the job-manager before the job is sent back to the PRIORITY state.

Once the job reaches the PRIORITY state, then the job.state.priority callback is called. So the accounting plugin job.state.priority callback should check if the bank or project has been updated, and if so update any internal bank or project information for the job. (I can't remember if the accounting plugin has any of this state at the moment)

cmoussa1 · 2023-08-24T22:49:44Z

I was thinking more about a plugin's internal state for a job -- the job.validate callback should not change this state based on the validation arguments, since job.validate could now be called with proposed jobspec, and if some other validation fails, the updates might never be applied.

Ah, so job.validate should not create that internal bank_info struct that is associated with each job that contains things like active job counts, fairshare value, what queues that user/bank has access to, etc.? And instead, that should be pushed off to a later state, i.e job.new? Sorry if I am misunderstanding you.

jobspec-update events only need to be emitted once. They would be emitted from the job-manager.update service after the updates are allowed and validated. The result would be that the attributes would be updated in the internal copy of jobspec in the job-manager before the job is sent back to the PRIORITY state.

Once the job reaches the PRIORITY state, then the job.state.priority callback is called. So the accounting plugin job.state.priority callback should check if the bank or project has been updated, and if so update any internal bank or project information for the job. (I can't remember if the accounting plugin has any of this state at the moment)

Ohhhh, okay. So priority_cb () would need to add some logic to see if the bank name or project name has been updated and update that internal bank_info struct if so.

Yeah, I think that callback still unpacks arguments for the job, but doesn't check to see if it needs to be updated or anything. I believe the only case in which it updates the bank_info struct is if the job was held in PRIORITY state while the plugin was waiting for any user/bank information to be loaded from the flux-accounting DB. Perhaps this check could be generalized to handle the proposed case?

garlick · 2023-08-28T23:21:10Z

Nice!

chu11

second pass, overall LGTM, just small things

chu11 · 2023-08-29T18:44:32Z

src/modules/job-manager/job.c

+    if (!job->jobspec_redacted
+        || !(jobspec = json_deep_copy (job->jobspec_redacted))) {
+        errno = ENOMEM;
+        return NULL;
+    }


if (!job->jobspec_redacted is true, probably isn't ENOMEM. Although not 100% sure what it should be.

Maybe EAGAIN, though I don't really think this error can happen, I'll fix.

chu11 · 2023-08-29T20:16:33Z

src/modules/job-manager/plugins/update-duration.c

+    flux_plugin_conf_unpack (p,
+                             "{s:i}",
+                             "owner-allow-any",
+                             &owner_allow_any);


parse bool instead of int? ehh no biggie, but just a thought.

chu11 · 2023-08-29T20:29:46Z

etc/completions/flux.pre

+        -n, --dry-run \
+    "
+    if [[ $cur != -* ]]; then
+        #  Attempt to substibute a pending jobid


substibute - > substitute

as an aside, another one to submit to typo checker :-)

chu11 · 2023-08-29T20:42:15Z

doc/man1/flux-update.rst

+``name``
+  ``attributes.system.job.name``


should this name chunk be removed? seems left over.

garlick · 2023-08-30T17:31:20Z

I've played with this a bit on my test cluster that has limits set up and it seems to work as advertised!

Problem: Jobspec updates will need to be manipulated and applied in multiple modules within the job manager, but currently the functionality to validate and apply jobspec updates is within static functions in event.c and jobtap.c. Locate some jobspec update functions centrally in job.c so they may easily be accessed from other job manager modules.

Problem: The code to apply jobspec updates from the jobspec-update event in the job manager duplicates the job_apply_jobspec_updates() function exported from job-manager/job.c. Use jobspec_apply_jobspec_updates() to apply jobspec updates instead of the duplicated code.

Problem: The job update service will require assistance of the jobtap plugin stack to validate requested updates. Add a couple jobtap support functions for this purpose: - jobtap_job_update(): Call job.update.KEY callback to allow a single update for KEY. - jobtap_validate_updates(); Apply updates to jobspec and call the job.validate stack on the modified jobspec.

Problem: The Jobspec setattr() method always prepends 'attributes.' the the key argument, but this can be inconvenient when the key already contains the 'attributes.' prefix, since that prefix must then be removed before calling jobspec.setattr(). Only prepend 'attributes.' to the key argument of setattr() if it doesn't already contain that prefix.

Problem: There exists a Jobspec setattr() method which sets an attribute based on "dotted key" notation, but no equivalent getattr() method to get dotted keys. Add a Jobspec.getattr() method.

Problem: The limit-duration and limit-job-size plugins do not validate jobs unless they are in the NEW state, ostensibly because job.validate may be called after a plugin reload or job manager restart. However, it is no longer the case that job.validate is called in these situations, and it may be necessary to call job.validate for jobs beyond the NEW state when processing job updates. Drop the checks for FLUX_JOB_STATE_NEW in the limit-duration and limit-job-size plugins.

Problem: There is no service in the job manager for requesting the update of jobspec or other job parameters. Add a new update service to the job manager. Job updates can now be requested via a job-manager.update RPC, the payload of which includes the target jobid and an "updates" object which follows the jobspec-update specification in RFC 20. Updates for a key are only allowed if a plugin callback exists for the jobtap topic string "job.update.KEY", and the callback returns success. If multiple keys are updated in the same request, they all must be allowed or none will be applied. Once updates have been validated, then the proposed modified jobspec is sent through the job.validate plugin call stack. If the new jobspec fails to be sucessfully validated, then the updates are rejected and an error is returned to the requestor. Individual plugins may request that the job.validate be skipped for a given key by settin a 'validated' flag in the plugin OUT arguements. However, the job.validate call will still be made if multiple keys are being updated and not all of them set a validated flag. Once the update is allowed and validated, then a jobspec-update event is posted for the job and an empty success response is issued.

Problem: Once a job is submitted the duration cannot be updated. Add an update-duration plugin that adds a job.update.attributes.system.duration callback so that jobspec duration updates are supported for pending jobs. By default, users can update the duration of their own jobs up to the currently configured limit, and instance owners can update duration to any value. The ability of the instance owner to bypass limits can be disabled by reloading the plugin with the config parameter owner-allow-any=0.

Problem: There is no command line interface to request job updates. Add the flux-update(1) command, which takes a jobid and one or more KEY=VALUE pairs on the command line, and sends an update request to the job manager. Special handling for specific keys is supported for a more convenient user interface. Currently, any key which doesn't start with `attributes.`, `resources.` or `tasks.` is assumed to be prefixed with `attributes.system.`, so `duration=10m` is translated to `attributes.system.duration=10` for example. Key values may also get special handling through existence of an `update_{keystr}` method in the JobspecUpdates class, where `keystr` is the key with dots replaced by underscore. For now, an `update_attributes_system_duration()` function is provided which allows 'duration' values which support +/-FSD or FSD. When adjusting duration, the current jobspec is fetched with any updates applied to get the most up-to-date duration.

Problem: There are no tab completions for the flux-update(1) command. Add a completion handler for flux-update(1) to etc/completions/flux.pre.

Problem: There is no way for a jobtap plugin to get notified of a jobspec update after the jobspec updates have been applied. Jobs only transition back to PRIOITY state from SCHED, so the job.state.priority callback will not always be sufficient, and subscribing directly to the jobspec-update event would require the plugin to manually apply updates, and may not capture other ways a jobspec or job might be updated in the future. Introduce a 'job.update' callback topic which is called after any jobspec update has been applied. If the job is transitioning back to the PRIORITY state, this callback will be called before the job.state.priority topic so that plugins may adjust internal state that would normally be established prior to the first call to job.state.priority.

Problem: There are no tests of the job update support in flux. Add a new test, t2290-job-update.t, and helper jobtap plugin, job-manager/plugins/update-test.c, and add some basic testing of the job update support using `flux update`.

Problem: The flux-update(1) command is not documented. Add a short manual page for flux-update(1). Update spelling dictionary as necessary.

Problem: It would be useful to disable updates for individual jobs, but there is currently no way to do this. Add an 'immutable' flag to the job manager job structure. Support adding this flag via the `set-flags` event.

Problem: When the instance owner updates a guest job in order to bypass validation (e.g. to update duration of a job beyond current limits), a future job update of a different attribute may fail because the job will be revalidated. This causes a confusing error that is unrelated to the user's request. When a job update bypasses validation, the update request is made by the instance owner, and the job user is not the instance owner, mark the job as immutable to prevent future updates by the job owner. This not only results in a less confusing error "job is immutable due to previous instance owner update" and also prevents the need to track which attribute updates have bypassed validation in past updates, which could be complex and could introduce unintended consequences.

grondo · 2023-08-31T14:27:49Z

Thanks @chu11 and @garlick! I've fixed up the final comments by @chu11 and force pushed the result and will set MWP.

grondo · 2023-08-31T15:24:38Z

Hm, the el8 - ascii checks failed with:

 Traceback (most recent call last):
    File "/usr/src/t/scripts/sign-as.py", line 3, in <module>
      from flux.security import SecurityContext
    File "/usr/src/src/bindings/python/flux/security.py", line 11, in <module>
      from _flux._security import ffi, lib

I better check if any FLUX_SECURITY prereq is needed on some of these tests.

Problem: There are no tests of jobs which have been updated by the instance owner and therefore have the immutable flag set. Test update of guest jobs in t2290-job-update.t before and after an instance owner update. Ensure an immutable job cannot be updated by the user.

Problem: The flux-update(1) man page does not mention that jobs updated by the instance owner may become immutable. Add an explanation of how jobs updated by the instance owner can bypass validation, and why this makes the jobs immutable.

codecov · 2023-08-31T16:46:50Z

Codecov Report

Merging #5409 (c41d143) into master (ab1e50e) will increase coverage by 10.85%.
The diff coverage is 96.39%.

@@             Coverage Diff             @@
##           master    #5409       +/-   ##
===========================================
+ Coverage   83.55%   94.40%   +10.85%     
===========================================
  Files         470       89      -381     
  Lines       78669     8639    -70030     
===========================================
- Hits        65730     8156    -57574     
+ Misses      12939      483    -12456

Files Changed	Coverage Δ
src/bindings/python/flux/job/Jobspec.py	`89.83% <84.61%> (-0.19%)`	⬇️
src/cmd/flux-update.py	`97.95% <97.95%> (ø)`

... and 383 files with indirect coverage changes

grondo force-pushed the job-update-service branch from 3e89a0b to ebfa2cc Compare August 24, 2023 17:19

github-advanced-security bot found potential problems Aug 24, 2023

View reviewed changes

src/modules/job-manager/jobtap.c Fixed Show fixed Hide fixed

src/modules/job-manager/update.c Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Aug 24, 2023

View reviewed changes

src/cmd/flux-update.py Dismissed Show dismissed Hide dismissed

src/cmd/flux-update.py Dismissed Show dismissed Hide dismissed

grondo force-pushed the job-update-service branch from ebfa2cc to 7d02d7a Compare August 24, 2023 17:59

github-advanced-security bot found potential problems Aug 24, 2023

View reviewed changes

chu11 reviewed Aug 24, 2023

View reviewed changes

grondo force-pushed the job-update-service branch from 7d02d7a to ba9b3a8 Compare August 24, 2023 19:21

grondo force-pushed the job-update-service branch from ba9b3a8 to f55977b Compare August 24, 2023 22:35

chu11 approved these changes Aug 29, 2023

View reviewed changes

cmoussa1 mentioned this pull request Aug 30, 2023

[experimental] plugin: add bank info lookup helper function flux-framework/flux-accounting#371

Closed

garlick added this to the flux-core v0.54.0 milestone Aug 30, 2023

grondo added 15 commits August 31, 2023 07:26

python: add Jobspec.getattr() method

d2b4d7b

Problem: There exists a Jobspec setattr() method which sets an attribute based on "dotted key" notation, but no equivalent getattr() method to get dotted keys. Add a Jobspec.getattr() method.

completions: add bash completions for flux-update(1)

d9544f9

Problem: There are no tab completions for the flux-update(1) command. Add a completion handler for flux-update(1) to etc/completions/flux.pre.

testsuite: add job update tests

4fccb67

Problem: There are no tests of the job update support in flux. Add a new test, t2290-job-update.t, and helper jobtap plugin, job-manager/plugins/update-test.c, and add some basic testing of the job update support using `flux update`.

doc: add flux-update(1)

33892fd

Problem: The flux-update(1) command is not documented. Add a short manual page for flux-update(1). Update spelling dictionary as necessary.

job-manager: support immutable job flag

61f1e77

Problem: It would be useful to disable updates for individual jobs, but there is currently no way to do this. Add an 'immutable' flag to the job manager job structure. Support adding this flag via the `set-flags` event.

grondo force-pushed the job-update-service branch from c34f2cc to 4e89409 Compare August 31, 2023 14:26

grondo added the merge-when-passing label Aug 31, 2023

grondo added 2 commits August 31, 2023 09:00

grondo force-pushed the job-update-service branch from 4e89409 to c41d143 Compare August 31, 2023 16:00

mergify bot merged commit f951224 into flux-framework:master Aug 31, 2023
30 checks passed

grondo deleted the job-update-service branch August 31, 2023 17:14

		``name``
		``attributes.system.job.name``

add job update service and new job-update(1) command #5409

add job update service and new job-update(1) command #5409

Conversation

grondo commented Aug 24, 2023

grondo commented Aug 24, 2023

chu11 commented Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

chu11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Aug 24, 2023

chu11 commented Aug 24, 2023 • edited Loading

chu11 commented Aug 24, 2023

chu11 commented Aug 24, 2023

grondo commented Aug 24, 2023

grondo commented Aug 24, 2023 • edited Loading

grondo commented Aug 24, 2023

grondo commented Aug 24, 2023

chu11 commented Aug 24, 2023

grondo commented Aug 24, 2023 • edited Loading

grondo commented Aug 24, 2023

grondo commented Aug 24, 2023

grondo commented Aug 24, 2023

chu11 commented Aug 24, 2023

chu11 commented Aug 24, 2023

grondo commented Aug 24, 2023

grondo commented Aug 24, 2023 • edited Loading

cmoussa1 commented Aug 24, 2023

grondo commented Aug 24, 2023

cmoussa1 commented Aug 24, 2023

garlick commented Aug 28, 2023

chu11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garlick commented Aug 30, 2023

grondo commented Aug 31, 2023

grondo commented Aug 31, 2023

codecov bot commented Aug 31, 2023

Codecov Report

chu11 commented Aug 24, 2023 •

edited

Loading

chu11 commented Aug 24, 2023 •

edited

Loading

grondo commented Aug 24, 2023 •

edited

Loading

grondo commented Aug 24, 2023 •

edited

Loading

grondo commented Aug 24, 2023 •

edited

Loading

chu11 Aug 29, 2023 •

edited

Loading