Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platforms.groups #3793

Merged
merged 4 commits into from
Sep 18, 2020
Merged

Platforms.groups #3793

merged 4 commits into from
Sep 18, 2020

Conversation

wxtim
Copy link
Member

@wxtim wxtim commented Aug 28, 2020

Synopsis

Part addresses issue #3686

As per the platforms proposal it was agreed that platform aliases should be available. In their most basic form (here) these allow a task with a platform set by the user to be assigned by the site administrator to any of a set of platforms.

Work Carried Out

  • Added the [platform aliases][ALIAS]platforms config to the global.cylc config spec.
  • Used Python's random.choice to pick a platform from the list.
  • Created a functional test.

Work not carried out

Additional infrastructure for more sophisticated platform from alias selection.

Discussion Point

Can anyone suggest a better way to test this. My unit test is horrible, because I cannot see a way to test this functionality nicely.

@wxtim wxtim self-assigned this Aug 28, 2020
@wxtim wxtim marked this pull request as draft August 28, 2020 10:54
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok.

cylc/flow/platforms.py Outdated Show resolved Hide resolved
@wxtim wxtim added this to the cylc-8.0a3 milestone Aug 28, 2020
@wxtim wxtim marked this pull request as ready for review August 28, 2020 15:57
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, but I wonder if it might be worth implementing a non-random selection method now to prevent test flakiness - which is likely to be painful even if it occurs infrequently.

@@ -469,7 +469,7 @@
Conf('hosts', VDR.V_STRING_LIST, ['localhost'])

# Platform Groups
with Conf('platform groups'):
with Conf('platform aliases'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a vague feeling that we decided in the workshop to use "groups" rather than "aliases", perhaps because "groups" is more descriptive? - the purpose of this is to group a bunch of platforms together to be treated as one. However my notes don't mention that, and I suppose we should have picked this up in the proposal review stage if it mattered. "Aliases" isn't too bad anyway, so no biggie.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, platform groups are listed in the "upgraded config" here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right then - I'll fix that.

@hjoliver
Copy link
Member

I wonder if it might be worth implementing a non-random selection method now to prevent test flakiness

One way of doing it would be to create a platform generator the first time an alias is used, then yield the next true platform name each time the alias is used. Then we could implement any selection method quite easily. Just for the hell of it, I tried "random" and "endless cycle" this way (cycling would enable a non-flaky test, and might be a useful real-world option too?) - it seems to work fine, here's the diff in case you want to consider it. I guess a similar method could be used for host selection within platforms too?

diff --git a/cylc/flow/platforms.py b/cylc/flow/platforms.py
index 666889fd5..73e991f42 100644
--- a/cylc/flow/platforms.py
+++ b/cylc/flow/platforms.py
@@ -19,11 +19,33 @@
 import random
 import re
 from copy import deepcopy
+from itertools import cycle
 
 from cylc.flow.exceptions import PlatformLookupError
 from cylc.flow.cfgspec.glbl_cfg import glbl_cfg
 
 
+platform_generators = {}
+
+
+def random_gen(stuff):
+    while True:
+        yield random.choice(stuff)
+
+
+def choose_platform(alias, platforms, method="random"):
+    method = "cycle"  # TEST ME
+    if alias not in platform_generators:
+        if method == "cycle":
+            gen = cycle(platforms)
+        elif method == "random":
+            gen = random_gen(platforms)
+        platform_generators[alias] = gen
+    else:
+        gen = platform_generators[alias]
+    return next(gen)
+
+
 def platform_from_name(platform_name=None, platforms=None):
     """
     Find out which job platform to use given a list of possible platforms and
@@ -55,9 +77,11 @@ def platform_from_name(platform_name=None, platforms=None):
 
     for platform_name_re in reversed(list(platform_aliases)):
         if re.fullmatch(platform_name_re, platform_name):
-            platform_name = random.choice(
-                platform_aliases[platform_name_re]['platforms']
-            )
+            # platform_name is an alias; replace with a real platform
+            platform_name = choose_platform(
+                platform_name,
+                platform_aliases[platform_name_re]['platforms'])
+            print('>>>>>>>>>>', platform_name)   # DEBUG
 
     # The list is reversed to allow user-set platforms (which are loaded
     # later than site set platforms) to be matched first and override site

@hjoliver
Copy link
Member

(This would require adding host selection method to [platform aliases] - but I guess we plan to do that anyway)

@wxtim
Copy link
Member Author

wxtim commented Sep 1, 2020

This seems like a reasonable template @hjoliver - and a good way around the flaky test. Thanks. I'll get on it.

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.


if platform_name is None:
platform_data = deepcopy(platforms['localhost'])
platform_data['name'] = 'localhost'
return platform_data

for platform_name_re in reversed(list(platform_aliases)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny thought, what if someone defines both a platform and a platform group with the same name?

  • Should we validate platform groups to prevent this from occurring.
  • Otherwise the approach used here (go through platform groups before platforms) seems sensible to me.

Copy link
Member Author

@wxtim wxtim Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where this might be reasonable:

[platform aliases]
  # Site user prevents anyone specifying a login node 
  # (unless they do it explicitly by creating their own platform)
  [[hpc\d?]]
    platforms = hpc1, hpc2   #, hpc3 is out of service, but everyone's hpc3 tasks work anyway!
    ...

(although I can't off the top of my head imagine why hpc 1-3 are separate platforms (I suppose that they might have different bs command templates or something...))
...but equally allows wonderful brainaches...

[platform aliases]
  [[chocolate]]
    platforms = vanilla
  [[vanilla]]
    platforms = chocolate

On consideration I'd say leave it, but perhaps make sure we document in the section of the docs for site managers.

@wxtim
Copy link
Member Author

wxtim commented Sep 2, 2020

This would require adding host selection method to [platform aliases]

No, this would be platform selection method - host selection method would be added to the platform...

@oliver-sanders
Copy link
Member

oliver-sanders commented Sep 2, 2020

One way of doing it would be to create a platform generator the first time an alias is used

Nice simple approach, a cycle option would be great for testing.

There's just one barrier to overcome which is that these generators would be long lived? Which means that the list of hosts provided to the generator would be fossilised for the life of the workflow run. This gets in the way of automatically reloading the global config at regular intervals would would enable site admins to, say, remove a login node or drain a system.

 [platform groups]
     [[my_group]]
-        platforms = foo, bar, baz
+        platforms = bar, baz

Note we will encounter the same issues when it comes to selecting hosts from a platform:

[platforms]
    [[my_platform]]
-       hosts = foo, bar, baz
+       hosts = bar, baz

So I think we need to be passing the list of platforms to the selection method with each call. For a selection method like cycle which needs to maintain local state we will have to get creative, how about something like this:

def cycle_selection(_):
    index = 0
    def _cycle_selection(platforms):
        nonlocal index
        index += 1
        return platforms[index % len(platforms)]
    return _cycle_selection

METHODS = {
    'cycle': cycle_selection()
}

def choose_platform(alias, platforms, method="random"):
    meth = METHODS[method](platforms)

There is a future nice-to-have about allowing user-defined selection functions which would require this interface anyway function(list_of_platforms) -> platform e.g:

[platform group]
    [[my_group]]
        platforms = a, b, c
        selection = select_by_slurm_queue_length

@oliver-sanders
Copy link
Member

See the related #3800 for thoughts about integrating the existing Cylc host-selection logic into platforms.

@wxtim
Copy link
Member Author

wxtim commented Sep 3, 2020

I find myself uncomfortable doing too much jiggery-pokery for a platform selection method I don't think will ever actually be used. I'll have a quick play though.

@hjoliver
Copy link
Member

Nice simple approach, a cycle option would be great for testing.

There's just one barrier to overcome which is that these generators would be long lived? Which means that the list of hosts provided to the generator would be fossilised for the life of the workflow run

Hmm, I kinda thought the generator approach was more elegant. Presumably we could just "reset" a generator by replacement, at the appropriate time:

>>> from itertools import cycle
>>> lunch = ['cat', 'dog', 'sausage']
>>> gen = cycle(lunch)
>>> next(gen)
1
>>> next(gen)
2
>>> gen = cycle(lunch)  # RESET 
>>> next(gen)
1
>>> next(gen)
2

However, I haven't actually thought about the mechanics of how to do that in context, and @wxtim has implemented your approach, which works, - so no biggie 👍

@wxtim
Copy link
Member Author

wxtim commented Sep 15, 2020

I'm afraid that it all became a bit wierd and complex - @oliver-sanders original fix didn't remember which platform_group was which, and I had trouble getting modified logic working. Since in the current form the test is very, very unlikely (I can make it slower to fail correctly and less likely to give a false negative if you want?) to give a false negative I'd like to commend the PR in its original form to @oliver-sanders and @hjoliver.

No reason that we can't have a follow on request.

@hjoliver
Copy link
Member

Alright, fine with me 👍

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, tested as working. 👍

Maybe bang up a follow-up issue for better/additional section methods?

@oliver-sanders
Copy link
Member

oliver-sanders commented Sep 18, 2020

FYI: we realised that we are going to require a more advanced platform/host selection interface in order to be able to automatically handle network/system outages.

Here's a purely illustrative example to explain what I mean by this:

def host_from_platform_group(group):
    for platform in group:
        for host in platform['hosts']
            yield host

def submit_job(job):
    for host in host_from_platform_group(job['platform']):
        try:
            submit(job, host)
            break
        except (PlatformLookupError, HostSelectError, socket.gaierror, SomeTimeoutError):
            continue

@wxtim
Copy link
Member Author

wxtim commented Sep 18, 2020

FYI: we realised that we are going to require a more advanced platform/host selection interface in order to be able to automatically handle network/system outages.

Agreed, but can we get this mergéd?

@oliver-sanders oliver-sanders merged commit 1035c4e into cylc:master Sep 18, 2020
@wxtim wxtim changed the title Platforms.aliases Platforms.groups Sep 18, 2020
@hjoliver
Copy link
Member

FYI: we realised that we are going to require a more advanced platform/host selection interface in order to be able to automatically handle network/system outages.

Is there a follow-up issue up for this?

@wxtim
Copy link
Member Author

wxtim commented Sep 22, 2020

Is there a follow-up issue up for this?

#3827

@wxtim wxtim deleted the platforms.aliases branch September 22, 2020 06:39
@hjoliver hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants