Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

platforms: platform and host selection methods and intelligent fallback #3827

Closed
oliver-sanders opened this issue Sep 18, 2020 · 10 comments · Fixed by #4329
Closed

platforms: platform and host selection methods and intelligent fallback #3827

oliver-sanders opened this issue Sep 18, 2020 · 10 comments · Fixed by #4329
Assignees
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Sep 18, 2020

Follow on from #3686

  • Implement an interface for supporting different selection methods (e.g. random, ordered, rank by psutil metrics, etc).
  • Implement logic to request another host/platform if the first is not available (e.g. host down or network issues).

Selection Methods

This should involve the creation of interfaces (preferably implemented as Python entry-points to permit in-house extension) for:

  • Selection of a platform from a platform group.
  • Selection of a host from a platform.

We will need to allow these to be configured separately e.g:

[platforms]
    [[one]]
        hosts = a, b, c
        method = random
    [[two]]
        hosts = d, e, f
        method = cycle

[platform-groups]
    [[three]]
        platforms = one, two
        method = random
    [[four]]
        platforms = two, one
        method = cycle

The following selection methods would be desirable but only the interface(s) need(s) to be implemented to close this issue:

Intelligent Fallback

In the event that a host is not available (e.g. for job submission) Cylc will need to pick an alternative host from the specified platform or platform group.

Here is a purely illustrative example to explain what is meant by this:

def host_from_platform_group(group):
    for platform in group:
        for host in platform['hosts']
            yield host

def submit_job(job):
    for host in host_from_platform_group(job['platform']):
        try:
            submit(job, host)
            break
        except (PlatformLookupError, HostSelectError, socket.gaierror, SomeTimeoutError):
            continue

This functionality will be required in a lot of different places (e.g. remote-init, job submission, job polling) so it would make sense to centralise it.

Reloading The Global Config

Issue #3762 will see the global config reloaded at a set interval for the lifetime of the scheduler. Any selection logic should be robust to this, the list of platforms in a group and hosts in a platform are volatile and may change as sys admins move workload around a system.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Sep 22, 2020

@wxtim and I had a talk through the "intelligent fallback" part of this issue which has raised some questions...

The basic premise of this logic is that if a host goes down, rather than just failing Cylc can re-try the operation on another host.

So how do we know if a host is down, there are two options:

  1. Test the connection up-front.
    • This is what the current rose host-select logic does (it executes a short bash script on the remote host and sends some numbers back).
    • Testing connection to every host, every time we want to run a remote command is an overhead.
    • Even if SSH has been configured to reuse connections via "control persist" it is still an extra subprocess call.
    • These connection test processes could overwhelm the process pool, especially if there are connection issues.
  2. Perform the operation (e.g. job submission) and diagnose any failure.
    • SSH exits 255 in the event of comms issues so could we forgo connection testing and just use this signal?
    • Saves an un-necessary connection/process.

However, not all issues are comms based, for example, what if the platform is not accepting new jobs, say because the queue is full or closed? This seems like the sort of thing Cylc should be able to handle gracefully. In this case there is no point trying another host within the platform, however, it may be worth trying another platform within the group.

Should we:

  1. Only handle SSH failure.
    • Any other errors are too involved for Cylc to handle.
  2. Provide an interface to the batch_sys_handlers allowing them to diagnose failures.
    • E.G. by checking return codes or scraping command output.
    • Note that some failures may imply a bad host whereas others may imply a bad platform.

Anything I've missed out @wxtim?

@hjoliver
Copy link
Member

Hmm, complicated 😬

I have too many questions about this - might be a good one to discuss at next meeting?

@wxtim
Copy link
Member

wxtim commented Sep 23, 2020

I think that is a fair summary of what we discussed. I have just spent some time looking at the code that was bothering me - it's still not completely clear how job submission might work, but I think ultimately it still comes down to a question of how we can tell a submit failure we want to retry with different platform settings with one to allow to stand.

@dpmatthews
Copy link
Contributor

I vote for "Only handle SSH failure".
This already provides a massive improvement over cylc 7.
If we want to try to go further than this then I think this should be a future enhancement (not required for cylc 8).

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Sep 24, 2020

Ok, I vote (2,3), don't perform a trial connection, detect 255 error codes.

@wxtim
Copy link
Member

wxtim commented Oct 13, 2020

I've got a proof of concept branch where I've hacked the remote_init and host_from_platform methods to allow us to keep trying hosts until we get one we like. I haven't implemented any of the other procedures requiring it though, so the example suite will stall after remote init.

I had a discussion with @oliver-sanders (feel free to edit this post to make it more accurately reflect our discussion) yesterday, where I talked about the fact that centralizing the logic looks a little tricky because you need to store state information, but also to housekeep it. This discussion generated multiple approaches:

  • Re-write the task pool using asyncio
  • Convert the platform object from a dictionary to an instance of a custom data type capable of storing information about hosts and platforms tried and not used. Alternatively, add the required fields ('bad_hosts', 'bad platforms') to the platform dictionary.
  • There was a third approach. I can't remember anything about it. (Duh)

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Oct 13, 2020

We have to preserve the state of the selection process somewhere, either globally accessible (i.e. via a unique id) or locally scoped via some other mechanism.

The state is effectively composed of a list of platforms which have been tried and a list of hosts within each platform which have been tried. The easiest way to preserve the state is probably just to use a generator (as they hold their state within their scope until destroyed). Take for example this reference implementation:

def select_generator(platform_group):
    platforms = platform_group['platforms']
    while platforms:
        platform = select(platforms, platform_group['method'])
        hosts = platform['hosts']
        while hosts:
            host = select(hosts, platform['method'])
            yield host
            hosts.remove(host)
        platforms.remove(platform)

The question is how to hook that up to the call/callback framework into which it must fit. Simplified version:

def controller():
    proc_pool.call(
        call_remote_init,
        callback_remote_init
    )

def call_remote_init(*args):
    pass

def callback_remote_init(*args):
    pass

Using global state storage it would look something like:

def call_remote_init(id, *args):
    # note the store is some session/globally scoped object
    if id:
        # retrieve the state from the store
         gen = store[id]
    else:
        id = uuid()  # selection id
        # put the state into the store
        store[id] = select_generator(platform_group)
        gen = store[id]
    try:
        host = next(gen)
    except:
        # no available hosts - remote-init failure
        pass
    # ...

def callback_remote_init(id, *args):
    if returned_a_255_error_code:
        call_remote_init(id, *args)
        return
    else:
        del state[id]
    # ...

This would do the job, however, the pattern would have to be reproduced for each call/callback pattern (remote_init, job_submission) and kinda feels messy. The main drawback is that this state store must be housekept (the del state[id] bit) else we will have a memory leak.

If this code were all async we would not need the state store.

async def remote_init(*args):
    for host in select_generator(platform_group):
        result = await proc_pool.call(...)
        if returned_a_255_error_code:
           continue
        else:
            break
    else:
        # no hosts - remote-init failure
        pass

Nice and clean, but re-writing the subprocesspool is a bit much right now. So how to bridge a call/callback pattern to an async pattern....

async def remote_init(*args):
    for host in select_generator(platform_group):
        # pass a future object into the call/callback
        future = asyncio.future()
        call_remote_init(*args, future)
        await future
        if returned_a_255_error_code:
            continue
        else:
            break
    else:
        # no hosts - remote-init failure
        pass

def call_remote_init(future, *args):
    pass

def callback_remote_init(future, *args):
    # in the callback mark the future as done, this returns control to remote_init
    future.done()

I'm not sure how to approach this, pros and cons. The async approach might be nice, however, the async code doesn't currently reach down very far from the Scheduler so would involve adding a lot of async statements along the call stack just to be able to call remote_init in that way.

@wxtim
Copy link
Member

wxtim commented Apr 29, 2021

On the 22/04 @oliver-sanders Said:@

Intelligent Host Selection - Job Submission
The idea is Cylc tries to submit the job to one host, if that fails 255 it tries the next one, if it runs out of hosts it puts the job into a submit-failed state.
This means the job submission system has to remember the hosts it has previously tried. At the moment on Tim's branch this state is centralised which makes sense. This way Cylc doesn't try to use hosts which is knows are down. This is good for efficiency and also prevents flooding the system with SSH commands that are sitting around before they hit their timeout.
Problems:
We would have to wipe this global state at some point otherwise hosts which have come back will not be considered for any operation.
If there is a transient network problem all hosts could be added to the blacklist and the workflow would sit around doing nothing until the state is wiped.
Best solution I can come up with is:
Reset this state periodically (use cylc.flow.main_loop.periodical)
AND also reset the state whenever a submit-failed task is retriggered either manually or by a scheduled automatic retry.
Thoughts?

But after a discussion between me and @dpmatthews this morning the following is proposed:

  • Reset the badhosts state periodically.
  • When get_hosts_from_platforms fails because set(hosts) - set(badhosts) == {}, remove platform['hosts'] from badhosts before raising an error. As a result future submissions using this platform should start with a blank slate:

Scenarios Considered.

  • Platform has no bad hosts - this logic is irrelevant.
  • Platform has some bad hosts - Remote task retried until good host found. Bad hosts persistent until end of clearance interval.
  • Platform has no good hosts - Task Fails, removing hosts from that task's platform from badhosts. Any retries (Auto or manual) will start with a clean slate.

Can you see any issues @oliver-sanders ?

@wxtim
Copy link
Member

wxtim commented Apr 29, 2021

Additional question: If get_host_from_platform fails at job-submit, then this is clearly a job-submit failure. What happens if fails at Remote init or fileinstall? Do we retrun a full bodied error? Or try to handle it?

@hjoliver
Copy link
Member

Failed remote init or file install should log the full error, I should think, so the user can see what's gone wrong and either fix it or alert system admins. I don't think we could handle this kind of error automatically?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants