Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree: Allow alternate Local Workers in Tree Mode #439

Merged
merged 1 commit into from
Jul 29, 2020

Conversation

mattaezell
Copy link
Contributor

Do not hard-coding Exec as the Tree LocalWorker; instead use the
Task's info local_workername or the Default local_workername.

Closes #435.

lib/ClusterShell/Task.py Outdated Show resolved Hide resolved
@mattaezell
Copy link
Contributor Author

@degremont @thiell Is this acceptable to merge as-is, or is there a better way you can suggest I implement this?

@thiell
Copy link
Collaborator

thiell commented Jul 17, 2020

Hi @mattaezell!

Thanks for your PR and patience! :)

I have been discussing this with @degremont today and we think there might be another way to fix this. Let me explain.

  1. I think we all agree that ExecWorker should NOT be hardcoded as the Tree's local worker. Thanks for spotting that. There is a module's function _local_workerclass() available in Defaults.py which should be used instead that uses the default value identified by local_workername (set to 'exec' by default, and overridable locally using the configuration file defaults.conf). That first change will allow users to change it through Defaults. Now, that doesn't resolve your direct problem.

  2. You noticed that Task's Defaults are not propagated. The idea is that different portions of a cluster could have different values in defaults.conf. So to resolve your problem and make the Tree mode more flexible, we should add a generic way to override defaults of the gateways using the Task's info dict (that is propagated).

In tree mode, the Task's info dict is filtered a bit and propagated from PropagationChannel.shell() (which is actually called from TreeWorker._execute_remote() initially from the root node):

292         # keep only valid task info pairs
293         info = dict((k, v) for k, v in self.task._info.items()
294                     if k not in DEFAULTS._task_info_pkeys_bl)

to be then received on the remote end by GatewayChannel.recv_ctl():

251                 taskinfo = data['taskinfo']
252                 self.logger.debug('assigning task infos (%s)', data['taskinfo'])
253 
254                 task = task_self()
255                 task._info.update(taskinfo)

The idea would be to allow propagating default "overrides" to the gateway. What we could do in GatewayChannel.recv_ctl() is to parse the keys of the (updated) Task's info so that we can override the gateway defaults.

Perhaps we can state that for task.info's keys that start with "tree_default:" (just an idea), the gateway would override its own task default dict from those. For instance, if a key "tree_default:local_workername" is found, we would call task.set_default("local_workername", value). It should be only a few lines of code to add to GatewayChannel.recv_ctl().

Any default values would be overridden that way from the root node, and there is no need to define a new, specific info key "local_workername". In your case, you would then use "tree_default:local_workername" instead, and with the change done in 1., the new default value of the gateway's "local_workername" would be used instead and thus solve your problem.

Let me know what you think and if I missed anything. If that makes sense to you, would you be willing to follow that path and change your PR?

@mattaezell
Copy link
Contributor Author

Let me know what you think and if I missed anything. If that makes sense to you, would you be willing to follow that path and change your PR?

It makes sense. I'll work on implementation and testing - hopefully I'll have an updated PR early next week.

Thanks.

Do not hard-code Exec as the Tree LocalWorker; instead use the
value set in Defaults.

Allow a task to override Defaults on gateway nodes by setting
an info item called 'tree_default:<key>".

Closes cea-hpc#435.
@mattaezell
Copy link
Contributor Author

There is a module's function _local_workerclass() available in Defaults.py which should be used instead that uses the default value identified by local_workername (set to 'exec' by default, and overridable locally using the configuration file defaults.conf).

That function expects defaults as an argument, which is supposed to be a Defaults object. Task has a _default dict but no Defaults object that I could find. Luckily local_worker and local_workername are set in Task.__init__() available from Task.default().

There is nothing that keeps local_worker and local_workername in sync (I confused myself with that one!) so I added two special cases in set_default() that will update local_worker and distant_worker if local_workername or distant_workername are set.

Please review again and let me know what you think. Thanks!!

@thiell thiell added this to the 1.8.4 milestone Jul 24, 2020
@thiell thiell self-requested a review July 24, 2020 17:59
Copy link
Collaborator

@thiell thiell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@thiell thiell merged commit 06ecc79 into cea-hpc:master Jul 29, 2020
@mattaezell mattaezell deleted the tree_localworker branch July 31, 2020 19:23
mattaezell added a commit to olcf/phoenix that referenced this pull request Feb 23, 2022
Requires ClusterShell be patched with:
cea-hpc/clustershell#439
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Advice to make a custom worker Tree-capable
3 participants