Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute cluster or site platform awareness #2199

Closed
hjoliver opened this issue Mar 9, 2017 · 11 comments
Closed

Compute cluster or site platform awareness #2199

hjoliver opened this issue Mar 9, 2017 · 11 comments
Assignees
Labels
efficiency For notable efficiency improvements
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Mar 9, 2017

Ref: https://groups.google.com/forum/#!topic/cylc/dFVNTeyPcrs

We could make Cylc aware of the concept of a group of HPC login nodes. If the original job submit node goes down, we could try the alternative node(s) in the event that a poll or kill command fails with host not found.

@hjoliver hjoliver added this to the some-day milestone Mar 9, 2017
@matthewrmshin
Copy link
Contributor

We need to have a concept similar rose host-select in cylc, but geared towards handling login nodes of clusters. (rose host-select was designed to be a poor-person's load balancing system for a group of similar compute servers. We no longer have this requirement at our site, but we still have a requirement to randomly select login nodes of clusters these days. Clearly, this requirement should be met by better DNS routing on sites, but this is not always the case.)

I think we can do something like this:

  • In the global.rc, we'll have a [host-groups] section. Each entry will have something like host-group=host1, host2, .... Each host group can be assumed to share the same file system, batch system, etc.
  • In the suite.rc, we'll allow [runtime][TASK][remote]host=HOST-GROUP. A random host in the specified host group will be used for job submission, poll, kill, log retrieval, etc.

@matthewrmshin matthewrmshin modified the milestones: later, some-day Mar 10, 2017
@matthewrmshin matthewrmshin self-assigned this Mar 10, 2017
@matthewrmshin
Copy link
Contributor

(Promoting the milestone and self assigned, to avoid this being lost in the ether.)

@hjoliver
Copy link
Member Author

hjoliver commented Mar 11, 2017

@matthewrmshin - your proposal sounds good, but you haven't explicitly addressed what to do if the (randomly) chosen host goes down. Presumably (as I suggested above) we'd need a retry-via-other-host mechanism to handle poll and kill (etc.) failures due to the target host going offline? Of course this would only work for jobs submitted to a batch scheduler (background jobs running on a particular login node are just screwed if the node goes down).

@matthewrmshin
Copy link
Contributor

OK. We'll make sure to consider:

  • A host group only makes sense for a relevant batch scheduler. This must be configurable.
  • We will choose a random available host in the host group for job submit, poll and kill. If one host is unavailable, we'll pick the next one in the randomised list, until exhausted.

@matthewrmshin matthewrmshin modified the milestones: soon, later Jun 22, 2017
@dvalters dvalters self-assigned this Jul 13, 2017
@dvalters dvalters removed their assignment Oct 18, 2017
@matthewrmshin matthewrmshin changed the title Cylc awareness of multiple login nodes. Compute cluster awareness Oct 20, 2017
@matthewrmshin
Copy link
Contributor

matthewrmshin commented Oct 20, 2017

Change of title to allow a more general discussion of compute cluster support. (The suite host may be part of the cluster, so it is not limited to handling of login nodes.)

I can now see that global.rc should have a new clusters section that will mostly supersede the current hosts section.

# global.rc
[clusters]  # platforms?
    [[spicy]]
        login hosts = peppercorn, clove, cinnamon, fennel, star-anise
        batch system = slurm
        # and pretty much everything under a host subsection in the hosts section
    [[hedge-pea-sea]]
        login hosts = localhost
        batch system = pbs
        # and so on

With clusters, I think the following may also be relevant:

  • Custom job management (e.g. submit, poll, kill) commands.
  • List of file systems that are shared with the suite host? And other clusters?
  • A URL for checking the status of the cluster?
  • Custom logic to invoke for collecting job accounting information when a job completes?
  • Batch scheduler directives that should be added to all jobs?
  • Number of jobs a user can submit to a cluster at a given time.
  • Hold all tasks that target a cluster. E.g. cluster is scheduled for an outage. Hold (pause) by job host. #2144.

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Jan 25, 2018

Somewhat related to #2144 and #2528. A recent unexpected outage meant that jobs were drained from the cluster while it remained down for an extended period of time. Suites were unable to poll or kill submitted/running tasks on the cluster. It would be nice if:

  • Users are able to reset all jobs submitted to a cluster in a single command.
  • Suites are able to detect this automatically (via a site setting?) and are then able to reset all affected tasks to go into statuses like submit-failed. (See also new task state 'killed'? #2394.)

@hjoliver
Copy link
Member Author

This issue has superseded #2144, absorbing the hold by host/cluster feature request.

@matthewrmshin
Copy link
Contributor

See also this discussion https://groups.google.com/forum/?fromgroups=#!topic/cylc/KoFhCGurLTo - we should also consider the ability to configure cluster specific environment variables or even extra custom logic.

@wxtim
Copy link
Member

wxtim commented Jul 31, 2019

Having started to consider this I think that there are some issues with describing all possible job hosts as "clusters". I have come to the view that I prefer the phrase "job platforms" which doesn't imply anything about whether we are running our jobs on a raspi0, or a desktop, or a cray.

@matthewrmshin matthewrmshin modified the milestones: soon, cylc-8.0a2 Aug 28, 2019
@hjoliver hjoliver assigned wxtim and unassigned matthewrmshin Oct 8, 2019
@oliver-sanders oliver-sanders mentioned this issue Jan 28, 2020
9 tasks
@hjoliver hjoliver modified the milestones: cylc-8.0a2, cylc-8.0a3 Apr 30, 2020
@wxtim
Copy link
Member

wxtim commented Jan 28, 2021

@hjoliver Can we close this issue?
I think the only outstanding issue related is #3827

@hjoliver
Copy link
Member Author

Yes, good. Thanks for the reminder @wxtim

@hjoliver hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements
Projects
None yet
Development

No branches or pull requests

4 participants