Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intelligent Host Select. #4184

Merged
merged 25 commits into from
Jul 23, 2021
Merged

Intelligent Host Select. #4184

merged 25 commits into from
Jul 23, 2021

Conversation

wxtim
Copy link
Member

@wxtim wxtim commented Apr 20, 2021

These changes partially address #3827
@datamel should be credited somewhere since we pair programmed the Remote Init and Filinstall logic.

Summary of logic:

  • Added a new kwarg, badhosts: set to get_host_from_platform. This is then subtracted from the platforms list of hosts and the selection is carried out on the leftover hosts.
  • Added TaskRemoteMgr.badhosts = set(), allowing us to store a list of bad hosts.
  • Modified several cases of get_host_from_platform in remote.py to use badhosts if they can.
  • Used this infrastructure to allow retrying of remote tasks if they fail with a 255.

Caveats

  • I'm leaving platform from platform host as a follow up.
  • Current testing is static - i.e. there is no checking what happens if hostA dies, then comes back online then hostB dies. All testing is done with platforms where lists of hosts contain deliberately fictional hosts, where ssh will fail causing a retry.

WIP Checklist

  • Host within Platform
    • Create a persistent object (TaskRemoteMgr.badhosts) in which to store hosts which aren't accessible.
    • Created a mainloop plugin clearing TaskRemoteMgr.badhosts from time to time.
    • Remote init working.
    • ... and tested
    • Remote fileinstall working.
    • ... and tested
    • Job submission working.
    • ... and tested
    • Remote tidy working.
    • ... and tested
    • Job log retrieval working.
    • ... and tested
    • Polling Working
    • ... and tested
    • Kill working
    • ...and tested
  • Check existing tests work = Now Passing locally.

Standard Checklist

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
    • Test workflow included which may become the basis for future functional tests.
  • Appropriate change log entry included.
  • (master branch) I have opened a documentation PR at cylc/cylc-doc/pull/XXXX.
  • No dependency changes.

@wxtim wxtim added this to the cylc-8.0.0 milestone Apr 20, 2021
@wxtim wxtim self-assigned this Apr 20, 2021
@wxtim wxtim marked this pull request as draft April 20, 2021 14:48
cylc/flow/platforms.py Outdated Show resolved Hide resolved
cylc/flow/task_job_mgr.py Outdated Show resolved Hide resolved
cylc/flow/task_remote_mgr.py Outdated Show resolved Hide resolved
cylc/flow/task_remote_mgr.py Outdated Show resolved Hide resolved
cylc/flow/task_job_mgr.py Outdated Show resolved Hide resolved
@wxtim
Copy link
Member Author

wxtim commented Apr 27, 2021

Copied from a comment for @oliver-sanders

Intelligent Host Selection - Job Submission

The idea is Cylc tries to submit the job to one host, if that fails 255 it tries the next one, if it runs out of hosts it puts the job into a submit-failed state.

This means the job submission system has to remember the hosts it has previously tried. At the moment on Tim's branch this state is centralised which makes sense. This way Cylc doesn't try to use hosts which is knows are down. This is good for efficiency and also prevents flooding the system with SSH commands that are sitting around before they hit their timeout.

Problems:

We would have to wipe this global state at some point otherwise hosts which have come back will not be considered for any operation.
If there is a transient network problem all hosts could be added to the blacklist and the workflow would sit around doing nothing until the state is wiped.

Best solution I can come up with is:

Reset this state periodically (use cylc.flow.main_loop.periodical)
AND also reset the state whenever a submit-failed task is retriggered either manually or by a scheduled automatic retry.

@wxtim wxtim marked this pull request as ready for review May 24, 2021 09:00
@wxtim wxtim marked this pull request as draft May 24, 2021 15:48
cylc/flow/remote.py Outdated Show resolved Hide resolved
cylc/flow/task_proxy.py Outdated Show resolved Hide resolved
@wxtim wxtim marked this pull request as ready for review May 25, 2021 15:43
@wxtim wxtim requested a review from MetRonnie May 25, 2021 15:43
@wxtim
Copy link
Member Author

wxtim commented May 25, 2021

I've tagged multiple people in this review because I am aware that the half term hols are coming up. Feel free to poke about with this.
All tests are passing locally.

Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11/31 files viewed

cylc/flow/cfgspec/globalcfg.py Outdated Show resolved Hide resolved
cylc/flow/cfgspec/globalcfg.py Outdated Show resolved Hide resolved
cylc/flow/platforms.py Outdated Show resolved Hide resolved
cylc/flow/platforms.py Outdated Show resolved Hide resolved
cylc/flow/platforms.py Show resolved Hide resolved
cylc/flow/task_job_mgr.py Outdated Show resolved Hide resolved
cylc/flow/task_job_mgr.py Show resolved Hide resolved
cylc/flow/task_proxy.py Outdated Show resolved Hide resolved
cylc/flow/task_remote_mgr.py Outdated Show resolved Hide resolved
cylc/flow/task_remote_mgr.py Outdated Show resolved Hide resolved
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21/32 files viewed

cylc/flow/workflow_files.py Outdated Show resolved Hide resolved
cylc/flow/workflow_files.py Outdated Show resolved Hide resolved
cylc/flow/remote.py Outdated Show resolved Hide resolved
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

33/33 files viewed

tests/functional/lib/bash/test_header Show resolved Hide resolved
tests/unit/test_platforms_get_host.py Outdated Show resolved Hide resolved
tests/unit/test_platforms_get_host.py Outdated Show resolved Hide resolved
tests/unit/test_platforms_get_host.py Outdated Show resolved Hide resolved
cylc/flow/workflow_files.py Outdated Show resolved Hide resolved
* 'ihs.new-go' of github.com:wxtim/cylc: (29 commits)
  fixed a test
  scan: workflow_params
  fix -new-window → --new-window
  removed stray todos
  actions: revert dns hack
  Remove old "cylc search" command.
  response to review
  Update cylc/flow/cfgspec/workflow.py
  Update cylc/flow/cfgspec/globalcfg.py
  Update cylc/flow/cfgspec/globalcfg.py
  Update change log.
  Fix func test.
  Remove old "cylc edit" command.
  Change error type.
  startcp and starttask exclusive; extend unit test.
  Update cylc/flow/scheduler_cli.py
  Update cylc/flow/scheduler_cli.py
  Update cylc/flow/scheduler_cli.py
  Update cylc/flow/scheduler_cli.py
  Fix bug in runahead release with --start-task.'
  ...
@oliver-sanders
Copy link
Member

we might not want to lose the info from the logging of a flow run in a less verbose mode

I don't think this message provides us with any new information?

We should log:

  • When a new "bad host" is detected.
  • When the "bad hosts" set has been cleared.

No need to log the bad hosts with every single host-selection.

@wxtim
Copy link
Member Author

wxtim commented Jul 15, 2021

we might not want to lose the info from the logging of a flow run in a less verbose mode

I don't think this message provides us with any new information?

We should log:

* When a new "bad host" is detected.

* When the "bad hosts" set has been cleared.

No need to log the bad hosts with every single host-selection.

Sorry, I've fixed this now - I honestly thought that I had already...

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small comments on the rsync fallback:

cylc/flow/subprocpool.py Outdated Show resolved Hide resolved
Co-authored-by: Oliver Sanders <oliver.sanders@metoffice.gov.uk>
cylc/flow/cfgspec/globalcfg.py Outdated Show resolved Hide resolved
cylc/flow/platforms.py Outdated Show resolved Hide resolved
cylc/flow/scheduler.py Show resolved Hide resolved
@oliver-sanders
Copy link
Member

Still testing, no new issues.

wxtim and others added 2 commits July 15, 2021 13:43
Co-authored-by: Oliver Sanders <oliver.sanders@metoffice.gov.uk>
@wxtim
Copy link
Member Author

wxtim commented Jul 15, 2021

Processes touched

  • Task init
  • Task file install
  • Task Poll
  • Remote Tidy
  • Retrieve Job Logs

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still testing, made it most the way through ... 😓.

There is a potential bug in the rsync:fail => ssh logic if platform is not defined. I guess we could fallback to a plain ssh command although that makes it easy for the platform logic to be silently ignored. Here's a diff anyway:

diff --git a/cylc/flow/subprocpool.py b/cylc/flow/subprocpool.py
index 9c533dc3a..31e925f41 100644
--- a/cylc/flow/subprocpool.py
+++ b/cylc/flow/subprocpool.py
@@ -542,15 +542,21 @@ class SubProcPool:
         may cause different rsync failures depending on version, and some of
         the failures may be caused by other problems.
         """
-        if platform is not None:
-            ssh_cmd = platform['ssh command']
-
         rsync_255_fail = False
         if (
             ctx.cmd[0] == 'rsync'
             and ctx.ret_code not in [0, 255]
             and is_remote_host(ctx.host)
         ):
+            ssh_cmd = (
+                platform['ssh command']
+                if platform is not None
+                else 'ssh'
+            )
+            ssh_test_cmd = shlex.split(f'{ssh_cmd} {ctx.host} true')
+            LOG.info(
+                f'testing connectivity for {ctx.host} using {ssh_test_cmd}'
+            )
             ssh_test = run(
                 shlex.split(f'{ssh_cmd} {ctx.host} true'),
                 capture_output=True

cylc/flow/subprocpool.py Show resolved Hide resolved
@oliver-sanders
Copy link
Member

Save the one small issue above and the one small conflict, all good 🎉!

@wxtim wxtim closed this Jul 23, 2021
@wxtim wxtim reopened this Jul 23, 2021
- Oliver's suggested fallback for testing rsync fails
- Warning to docstring of construct_rsync_over_ssh_cmd to avoid breaking subprocpool.rsync_255_fail
@wxtim
Copy link
Member Author

wxtim commented Jul 23, 2021

@oliver-sanders I think that I've responded with fff1c7b.

Sorry about the close/reopen. I've going to restart my VM. It's very laggy. :(

@oliver-sanders
Copy link
Member

oliver-sanders commented Jul 23, 2021

Going in!

This was a tricky one, good work!

@oliver-sanders oliver-sanders merged commit 5408474 into cylc:master Jul 23, 2021
@wxtim wxtim deleted the ihs.new-go branch July 23, 2021 09:14
@oliver-sanders oliver-sanders modified the milestones: cylc-8.0.0, cylc-8.0b2 Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants