platforms: platform and host selection methods and intelligent fallback #3827

oliver-sanders · 2020-09-18T14:16:53Z

Follow on from #3686

Implement an interface for supporting different selection methods (e.g. random, ordered, rank by psutil metrics, etc).
Implement logic to request another host/platform if the first is not available (e.g. host down or network issues).

Selection Methods

This should involve the creation of interfaces (preferably implemented as Python entry-points to permit in-house extension) for:

Selection of a platform from a platform group.
Selection of a host from a platform.

We will need to allow these to be configured separately e.g:

[platforms]
    [[one]]
        hosts = a, b, c
        method = random
    [[two]]
        hosts = d, e, f
        method = cycle

[platform-groups]
    [[three]]
        platforms = one, two
        method = random
    [[four]]
        platforms = two, one
        method = cycle

The following selection methods would be desirable but only the interface(s) need(s) to be implemented to close this issue:

random (implemented)
ordered (i.e. pick the first available host/platform from a list)
cycle (pick hosts/platforms in order then go back the the start of the list - useful for testing)
ranked (rank by CPU, memory or whatever and specify cutoffs platforms: rose host-select / using platform groups to manage a simple cluster of nodes #3800)

Intelligent Fallback

In the event that a host is not available (e.g. for job submission) Cylc will need to pick an alternative host from the specified platform or platform group.

Here is a purely illustrative example to explain what is meant by this:

def host_from_platform_group(group):
    for platform in group:
        for host in platform['hosts']
            yield host

def submit_job(job):
    for host in host_from_platform_group(job['platform']):
        try:
            submit(job, host)
            break
        except (PlatformLookupError, HostSelectError, socket.gaierror, SomeTimeoutError):
            continue

This functionality will be required in a lot of different places (e.g. remote-init, job submission, job polling) so it would make sense to centralise it.

Reloading The Global Config

Issue #3762 will see the global config reloaded at a set interval for the lifetime of the scheduler. Any selection logic should be robust to this, the list of platforms in a group and hosts in a platform are volatile and may change as sys admins move workload around a system.

oliver-sanders · 2020-09-22T16:01:13Z

@wxtim and I had a talk through the "intelligent fallback" part of this issue which has raised some questions...

The basic premise of this logic is that if a host goes down, rather than just failing Cylc can re-try the operation on another host.

So how do we know if a host is down, there are two options:

Test the connection up-front.
- This is what the current rose host-select logic does (it executes a short bash script on the remote host and sends some numbers back).
- Testing connection to every host, every time we want to run a remote command is an overhead.
- Even if SSH has been configured to reuse connections via "control persist" it is still an extra subprocess call.
- These connection test processes could overwhelm the process pool, especially if there are connection issues.
Perform the operation (e.g. job submission) and diagnose any failure.
- SSH exits 255 in the event of comms issues so could we forgo connection testing and just use this signal?
- Saves an un-necessary connection/process.

However, not all issues are comms based, for example, what if the platform is not accepting new jobs, say because the queue is full or closed? This seems like the sort of thing Cylc should be able to handle gracefully. In this case there is no point trying another host within the platform, however, it may be worth trying another platform within the group.

Should we:

Only handle SSH failure.
- Any other errors are too involved for Cylc to handle.
Provide an interface to the batch_sys_handlers allowing them to diagnose failures.
- E.G. by checking return codes or scraping command output.
- Note that some failures may imply a bad host whereas others may imply a bad platform.

Anything I've missed out @wxtim?

hjoliver · 2020-09-23T00:38:46Z

Hmm, complicated 😬

I have too many questions about this - might be a good one to discuss at next meeting?

wxtim · 2020-09-23T07:46:07Z

I think that is a fair summary of what we discussed. I have just spent some time looking at the code that was bothering me - it's still not completely clear how job submission might work, but I think ultimately it still comes down to a question of how we can tell a submit failure we want to retry with different platform settings with one to allow to stand.

dpmatthews · 2020-09-23T16:50:47Z

I vote for "Only handle SSH failure".
This already provides a massive improvement over cylc 7.
If we want to try to go further than this then I think this should be a future enhancement (not required for cylc 8).

oliver-sanders · 2020-09-24T08:47:12Z

Ok, I vote (2,3), don't perform a trial connection, detect 255 error codes.

wxtim · 2020-10-13T07:51:05Z

I've got a proof of concept branch where I've hacked the remote_init and host_from_platform methods to allow us to keep trying hosts until we get one we like. I haven't implemented any of the other procedures requiring it though, so the example suite will stall after remote init.

I had a discussion with @oliver-sanders (feel free to edit this post to make it more accurately reflect our discussion) yesterday, where I talked about the fact that centralizing the logic looks a little tricky because you need to store state information, but also to housekeep it. This discussion generated multiple approaches:

Re-write the task pool using asyncio
Convert the platform object from a dictionary to an instance of a custom data type capable of storing information about hosts and platforms tried and not used. Alternatively, add the required fields ('bad_hosts', 'bad platforms') to the platform dictionary.
There was a third approach. I can't remember anything about it. (Duh)

oliver-sanders · 2020-10-13T08:46:14Z

We have to preserve the state of the selection process somewhere, either globally accessible (i.e. via a unique id) or locally scoped via some other mechanism.

The state is effectively composed of a list of platforms which have been tried and a list of hosts within each platform which have been tried. The easiest way to preserve the state is probably just to use a generator (as they hold their state within their scope until destroyed). Take for example this reference implementation:

def select_generator(platform_group):
    platforms = platform_group['platforms']
    while platforms:
        platform = select(platforms, platform_group['method'])
        hosts = platform['hosts']
        while hosts:
            host = select(hosts, platform['method'])
            yield host
            hosts.remove(host)
        platforms.remove(platform)

The question is how to hook that up to the call/callback framework into which it must fit. Simplified version:

def controller():
    proc_pool.call(
        call_remote_init,
        callback_remote_init
    )

def call_remote_init(*args):
    pass

def callback_remote_init(*args):
    pass

Using global state storage it would look something like:

def call_remote_init(id, *args):
    # note the store is some session/globally scoped object
    if id:
        # retrieve the state from the store
         gen = store[id]
    else:
        id = uuid()  # selection id
        # put the state into the store
        store[id] = select_generator(platform_group)
        gen = store[id]
    try:
        host = next(gen)
    except:
        # no available hosts - remote-init failure
        pass
    # ...

def callback_remote_init(id, *args):
    if returned_a_255_error_code:
        call_remote_init(id, *args)
        return
    else:
        del state[id]
    # ...

This would do the job, however, the pattern would have to be reproduced for each call/callback pattern (remote_init, job_submission) and kinda feels messy. The main drawback is that this state store must be housekept (the del state[id] bit) else we will have a memory leak.

If this code were all async we would not need the state store.

async def remote_init(*args):
    for host in select_generator(platform_group):
        result = await proc_pool.call(...)
        if returned_a_255_error_code:
           continue
        else:
            break
    else:
        # no hosts - remote-init failure
        pass

Nice and clean, but re-writing the subprocesspool is a bit much right now. So how to bridge a call/callback pattern to an async pattern....

async def remote_init(*args):
    for host in select_generator(platform_group):
        # pass a future object into the call/callback
        future = asyncio.future()
        call_remote_init(*args, future)
        await future
        if returned_a_255_error_code:
            continue
        else:
            break
    else:
        # no hosts - remote-init failure
        pass

def call_remote_init(future, *args):
    pass

def callback_remote_init(future, *args):
    # in the callback mark the future as done, this returns control to remote_init
    future.done()

I'm not sure how to approach this, pros and cons. The async approach might be nice, however, the async code doesn't currently reach down very far from the Scheduler so would involve adding a lot of async statements along the call stack just to be able to call remote_init in that way.

wxtim · 2021-04-29T10:05:15Z

On the 22/04 @oliver-sanders Said:@

Intelligent Host Selection - Job Submission
The idea is Cylc tries to submit the job to one host, if that fails 255 it tries the next one, if it runs out of hosts it puts the job into a submit-failed state.
This means the job submission system has to remember the hosts it has previously tried. At the moment on Tim's branch this state is centralised which makes sense. This way Cylc doesn't try to use hosts which is knows are down. This is good for efficiency and also prevents flooding the system with SSH commands that are sitting around before they hit their timeout.
Problems:
We would have to wipe this global state at some point otherwise hosts which have come back will not be considered for any operation.
If there is a transient network problem all hosts could be added to the blacklist and the workflow would sit around doing nothing until the state is wiped.
Best solution I can come up with is:
Reset this state periodically (use cylc.flow.main_loop.periodical)
AND also reset the state whenever a submit-failed task is retriggered either manually or by a scheduled automatic retry.
Thoughts?

But after a discussion between me and @dpmatthews this morning the following is proposed:

Reset the badhosts state periodically.
When get_hosts_from_platforms fails because set(hosts) - set(badhosts) == {}, remove platform['hosts'] from badhosts before raising an error. As a result future submissions using this platform should start with a blank slate:

Scenarios Considered.

Platform has no bad hosts - this logic is irrelevant.
Platform has some bad hosts - Remote task retried until good host found. Bad hosts persistent until end of clearance interval.
Platform has no good hosts - Task Fails, removing hosts from that task's platform from badhosts. Any retries (Auto or manual) will start with a clean slate.

Can you see any issues @oliver-sanders ?

wxtim · 2021-04-29T10:41:06Z

Additional question: If get_host_from_platform fails at job-submit, then this is clearly a job-submit failure. What happens if fails at Remote init or fileinstall? Do we retrun a full bodied error? Or try to handle it?

hjoliver · 2021-05-12T02:30:26Z

Failed remote init or file install should log the full error, I should think, so the user can see what's gone wrong and either fix it or alert system admins. I don't think we could handle this kind of error automatically?

oliver-sanders added the platforms-follow-up label Sep 18, 2020

oliver-sanders modified the milestones: 8.x, cylc-8.0.0 Sep 18, 2020

oliver-sanders mentioned this issue Sep 18, 2020

platforms: platform aliases #3686

Closed

wxtim self-assigned this Sep 21, 2020

This was referenced Sep 22, 2020

Platforms.groups #3793

Merged

Platform from platform group basic unit test #3832

Merged

wxtim mentioned this issue Oct 14, 2020

Intel host select.get platform group #3868

Closed

wxtim linked a pull request Oct 14, 2020 that will close this issue

Intel host select.get platform group #3868

Closed

oliver-sanders mentioned this issue Oct 27, 2020

cylc clean #3887

Open

oliver-sanders self-assigned this Dec 11, 2020

wxtim mentioned this issue Jan 28, 2021

Compute cluster or site platform awareness #2199

Closed

oliver-sanders modified the milestones: cylc-8.0.0, cylc-8.0b1 Mar 30, 2021

oliver-sanders unassigned oliver-sanders and wxtim Mar 30, 2021

hjoliver modified the milestones: cylc-8.0b1, cylc-8.0b2 Apr 15, 2021

wxtim mentioned this issue Apr 20, 2021

Intelligent Host Select. #4184

Merged

26 tasks

MetRonnie assigned wxtim Apr 27, 2021

wxtim mentioned this issue Jul 15, 2021

Document Intelligent Platform Selection cylc/cylc-doc#266

Closed

oliver-sanders modified the milestones: cylc-8.0b2, cylc-8.0b3 Jul 28, 2021

wxtim mentioned this issue Jul 29, 2021

Intelligent Host Selection from Platform Group #4329

Merged

9 tasks

MetRonnie added the intelligent-host-selection label Aug 2, 2021

oliver-sanders closed this as completed in #4329 Oct 4, 2021

oliver-sanders mentioned this issue Oct 18, 2022

ihs: catch NoHostsError #5195

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

platforms: platform and host selection methods and intelligent fallback #3827

platforms: platform and host selection methods and intelligent fallback #3827

oliver-sanders commented Sep 18, 2020 •

edited by wxtim

Loading

oliver-sanders commented Sep 22, 2020 •

edited by wxtim

Loading

hjoliver commented Sep 23, 2020

wxtim commented Sep 23, 2020 •

edited

Loading

dpmatthews commented Sep 23, 2020

oliver-sanders commented Sep 24, 2020 •

edited

Loading

wxtim commented Oct 13, 2020 •

edited

Loading

oliver-sanders commented Oct 13, 2020 •

edited

Loading

wxtim commented Apr 29, 2021

wxtim commented Apr 29, 2021

hjoliver commented May 12, 2021

platforms: platform and host selection methods and intelligent fallback #3827

platforms: platform and host selection methods and intelligent fallback #3827

Comments

oliver-sanders commented Sep 18, 2020 • edited by wxtim Loading

Selection Methods

Intelligent Fallback

Reloading The Global Config

oliver-sanders commented Sep 22, 2020 • edited by wxtim Loading

hjoliver commented Sep 23, 2020

wxtim commented Sep 23, 2020 • edited Loading

dpmatthews commented Sep 23, 2020

oliver-sanders commented Sep 24, 2020 • edited Loading

wxtim commented Oct 13, 2020 • edited Loading

oliver-sanders commented Oct 13, 2020 • edited Loading

wxtim commented Apr 29, 2021

Scenarios Considered.

wxtim commented Apr 29, 2021

hjoliver commented May 12, 2021

oliver-sanders commented Sep 18, 2020 •

edited by wxtim

Loading

oliver-sanders commented Sep 22, 2020 •

edited by wxtim

Loading

wxtim commented Sep 23, 2020 •

edited

Loading

oliver-sanders commented Sep 24, 2020 •

edited

Loading

wxtim commented Oct 13, 2020 •

edited

Loading

oliver-sanders commented Oct 13, 2020 •

edited

Loading