Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal cylc authorisation #125

Merged
merged 17 commits into from
Jun 17, 2021
Merged

Conversation

datamel
Copy link
Contributor

@datamel datamel commented Apr 26, 2021

Proposal for UI-Server authorisation.
Proof of concept PR : cylc/cylc-uiserver#204

@datamel datamel self-assigned this Apr 26, 2021
@hjoliver hjoliver self-requested a review April 27, 2021 05:40
@datamel
Copy link
Contributor Author

datamel commented Apr 27, 2021

Converting this to a draft as the config section in the proposal requires additional thought.

@datamel datamel marked this pull request as draft April 27, 2021 12:37
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good so far @datamel, nice and clear. A few things to discuss at the next meeting, if not resolved by then.

@datamel datamel marked this pull request as ready for review May 5, 2021 01:35
@datamel datamel requested a review from dpmatthews May 5, 2021 01:35
@hjoliver
Copy link
Member

hjoliver commented May 5, 2021

I took another pass through this, still looks great 👍

The only thing I'm wondering about is the current read/write/execute model, where execute permission (scheduler start, stop) implies write permission (task mutations). Arguably that should be the other way around, or perhaps separate with no inheritance of one from the other. If you can start my workflow it will run exactly what I configured it to run. But if you can execute task mutations on my running workflow you can change what actually runs (potentially drastically).

@dwsutherland
Copy link
Member

dwsutherland commented May 7, 2021

Yeah, need something like read/play/edit

Where play/operate would be stop/play/trigger ...etc, and write would be file-manipulation(viewing?)/trigger-edit/broadcast

UI Server back-end code should prevent users manipulating client-side code to bypass authorisation. For example, sending a GraphQL query for a different operation to that stated in the operation name parameter.

As mentioned elsewhere, we can achieve fine grained authorisation via middleware. For example:
https://docs.graphene-python.org/en/latest/execution/middleware/#example
where can write any function into this, and have all information needed here (even the cached authorisation config if so desired).. And it should be the first middleware that is executed (the last thing is always the resolver fetching/command-run). i.e.

        return self._create_handler(
            path,
            clazz,
            schema=schema,
            resolvers=self.resolvers,
            backend=CylcGraphQLBackend(),
            middleware=[AuthorisationMiddleware, IgnoreFieldMiddleware],
            **kwargs
        )

https://github.com/cylc/cylc-uiserver/blob/1c194fdf59983d009f9afe10a2e3091fce9fc746/cylc/uiserver/main.py#L364

The reason we should use Graphene's inbuilt middleware, is because request object methods (like get_graphql_params in your PoC) only give you the operation i.e. query, mutation, subscription. And you'd need to parse the graphql doc from there, because you can run something like this:

mutation {
  pause (workflows: ["fox/run1"]) {
    result
  }
  kill (workflows: ["fox/run1"], tasks: ["*"]) {
    result
  }
}

At the middle level you could allow pause and do nothing (or return unauthorised) on kill (or w/e combination).
However, I'm not entirely sure how the UI will get the information to grey-out functionality.. We may just have to use a UIS query or separate end-point, as GraphQL introspection probably won't send different content for different users (?).

I've tried a number of methods for fetching group membership for the authenticated browser users i.e. return the same groups as I see using groups on the Linux command line, which should be both local groups and remote via SSSD/LDAP.

I don't think we need to tether the grouping of our UIS authorisation to the system groupings (as the HPC/System admins may have their own grouping ideas), but allow for users to decide which other system users can do with their suites. At NIWA, I don't have write permission or are part of the same file system grouping as our operational role accounts (but have sudo access to assume their role). Obviously the config should be user write only (0644), and this should be checked when loading it in.

It will essentially allow you to give full permission of your user and it's capabilities (as with edit/execute permission, file/job edit and broadcast can be made to do anything you like), so to please HPC/systems/security folk, you'd need to have a max giveaway configurable at the site level (defaulted to read).

Not directly related to this proposal, although perhaps worth bearing in mind is access on the command line.

Once the CLI can go through the proxy => hub => UIS, we'll be covered 👍

Nice work!

@datamel
Copy link
Contributor Author

datamel commented May 7, 2021

Thanks so much for the feedback @dwsutherland. I had investigated middleware and moved away from it initially, due to other things needing authorisation (not just graphql queries/mutations). Thanks for your example, I shall have another look.
The POC is a little outdated (as the proposal has evolved), I will look at updating it and this proposal.

@datamel
Copy link
Contributor Author

datamel commented May 25, 2021

So far, we have the following options.
It would be good to get some final thoughts from you all. Perhaps I need to extend my PoC for all 3?

Option A: Decorating handler actions

Adapt my current authorization PoC so that it parses the query body to get all operations in the graphql. I had originally identified this as something we would need to do (was aware that we could not rely on operation name alone):

# TODO Should also check the query e.g. it starts with 'mutation pause'
# For security in case an operation name is sent different to the query
# Hence fetching below

I still feel strongly that the responsibility of authentication and authorization should sit with the handler layer and prevent unauthorised access as early as possible. We expose more code, the further down we check authorization, which is not great if we’re also being a bit hacky with said code to make it do auth elsewhere.

Option B: Middleware

As @dwsutherland rightly suggests, we could place the auth in graphene middleware. Although we do not have immediate access to the authenticated user at this point.
So, to do this we’d push additional properties onto the middleware object in the handlers. I have tested this (obviously just hard-coded to the second element but we can find it via object type).
For the subscription handler via the subscription server field e.g.

           self.subscription_server.middleware[1].current_user = self.current_user
           self.subscription_server.middleware[1].user_auth_config = self.user_auth_config

For the graphql handler via the middleware list:

    self.middleware[1].current_user = self.current_user

This worked fine, although we would still need to parse the graphql in the middleware to determine the operations.
This would only cover authorisation for graphql and not the static files, the main UI handler or the user profiler handler. It would also mean having 2 independent authorisation solutions, both needing maintenance etc.

Option C: Decorating the resolver functions

As per @oliver-sanders thoughts that we should avoid additional parsing of the query, which we need to do to discover operations in a query.
My instincts when I first picked this work up was to work with the resolvers themselves but I was told that only the UI server code should be touched. So I have not had a chance to prove this.

I’m happy to give it a go for another PoC but there’s a few points for discussion first:
• Do we really want to pass authentication that far down from the handler? Feels wrong to raise a web.HTTPError(403) from cylc flow code, which shouldn’t have ‘knowledge’ of the website side of things.
• Might be hacky getting user details that far (not proved it yet).
• Wouldn’t we be doing additional processing over every resolver being hit (until an unauthorized operation is reached)? - surely this would tip the argument in favour of option A which, granted, has drawback of additional parsing of the query but all three options have some additional processing debt.
• As Option B: This would only cover authorization for graphql and not the static files, the main UI handler or the user profiler handler. It would also mean having 2 independent authorization solutions, both needing maintenance etc.

Also worth noting, I have looked into whether we can parse the request at handler and then continue passing this through, ready parsed, but the graphene layer parses the request regardless.
https://github.com/graphql-python/graphene-tornado/blob/2800ff581fd2edd665d2cf996e1601e7e892cb53/graphene_tornado/tornado_graphql_handler.py#L171.
I could put a pr into graphene to skip when it receives an already parsed request but I suspect it would not be accepted as that is major code change which could have a knock on effect for other users. So, I feel if we go with this option we are stuck parsing twice.

On a related point, if a user requests multiple mutations in one request, and they are authorised for e.g. 2 out of 3, do we want…
a) Two to be allowed and the third to be blocked and logged
b) The entire request to be blocked and logged, i.e. all 3 mutations denied?
I'm thinking most likely a) but thought I'd double check.

@oliver-sanders
Copy link
Member

I can see the temptation to parse the GraphQL document up front but I'll chuck in a few comments to show how this could be vulnerable.

This approach depends a lot on our ability to replicate the internal workings of Graphene as closely as possible and to keep up with their internal implementation. If we lapse the approach might not be quite as iron-clad as we would like it to be.

In parsing the document upfront we are trying to guess what endpoints that the request would hit rather than waiting for the server to tell us what endpoints are being hit. If we guess wrong then we are in trouble, here are some conditions that could break our guesses:

  • We parse the document differently to the Graphene internals.
  • We don't interpret the results correctly.
  • The attacker finds a way to disguise a mutation or change its identifier in the request.
  • The Graphene internal logic changes in a future version (we don't have control over what version the user installs).
  • The GraphQL language evolves providing new representations of mutations.
  • A bug in graphene causes something to be called which wasn't supposed to be.

Then, without any protections further down the chain the entire authorisation layer could be sidestepped.

In a conventional HTTP(s) server (like the Cylc 7 suite server) the server routes each request based on the URL, any auth is performed at the endpoint. In a GraphQL server the routing is based on the GraphQL document, the endpoints are the resolvers.

I found an authorisation framework for GraphQL (2.8k⭐️) called graphql-shield, sadly it's Javascript (like all the GraphQL goodies) so we can't use it, however, worth noting they went down the middleware approach.

@hjoliver
Copy link
Member

I agree with @oliver-sanders on this.

@dwsutherland
Copy link
Member

dwsutherland commented Jun 9, 2021

I agree with @oliver-sanders too, we should not be parsing the GraphQL document/object ourselves.. For all the stated reasons.

However, we really need to be sure of what information we have and what we can do with GraphQL already, before ruling things out.. So just some clarifications on Option B:

Although we do not have immediate access to the authenticated user at this point.

Can't we just pass this in via info/context? You simply set fields on info.context.
It may even be simpler than that, because the whole request object is also available via info.context['request'] (or the like).

This worked fine, although we would still need to parse the graphql in the middleware to determine the operations.

No (sorry to be blunt). This information is all available (info.operation.operation, info.operation.name.value.. info.operation more generally), no parsing.
For example:

class IgnoreFieldMiddleware:
    """Set to null/None type undesired field values for stripping."""

    # Sometimes `next` is a Partial(coroutine) or Promise,
    # making inspection for know how to resolve it difficult.
    ASYNC_OPS = {'query', 'mutation'}

    def __init__(self):
        self.args_tree = {}
        self.tree_paths = set()
        self.field_sets = {}

    def resolve(self, next, root, info, **args):
        """Middleware resolver; handles field according to operation."""
        # GraphiQL introspection is 'query' but not async
        if getattr(info.operation.name, 'value', None) == 'IntrospectionQuery':
            return next(root, info, **args)

        if info.operation.operation in STRIP_OPS:
            path_string = f'{info.path}'
.
.
.

There's all the information we need here. i.e. info.field_name, info.path... etc

I think we need two levels:

  • Top level: Handlers .. etc.
  • GraphQL: Middleware

Also:

In a GraphQL server the routing is based on the GraphQL document, the endpoints are the resolvers.

Middleware is essentially a wrapper to all resolvers, including the default resolver (not defined in our code) which reads the data from root argument:
https://github.com/graphql-python/graphene/blob/master/graphene/types/resolver.py
So Middleware is like a decorator, without all the effort of decorating (and defining our own decorated default resolver)

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me.

Just one little curveball that you may have already spotted, and which we were aware before we started on this. Commenting it here for the record and also as a reminder that we should add a note of some form to the documentation when we write it.

These site-configured limits can be hacked and overwritten by users.

There are at least two Cylc mechanisms by which the limits could be hacked from user-space:

  • Setting CYLC_SITE_CONF_PATH, writing your own site config and injecting it:
    • In the spawner args (user config).
    • In shell profile files (unless Cylc is deployed in a venv of some form with a wrapper script).
    • On the cli (i.e. starting the UIS manually).
  • Monkeypatching the UIS code from the config file.
    • Because the config file is in Python and you can change anything in Python.
    • We could switch to a JSON config file, but, Jupyter server supports Python config files and we can't turn that support off.

These sound like security bugs, but they are not. Preventing these mechanisms from being exploited would be pseudo security. There are plenty of non Cylc-specific mechanisms too:

  • Monkeypatch the UIS code from the PYTHONSTARTUP file.
    • This applies to ALL Python programs!
  • Wrap the Cylc UIS code to create your own uis command...
    • ...and get the spawner to run this in the user config.
    • ...and patch it by setting PYTHONPATH.
  • I'm sure many OS level approaches.

On the surface this might sound bad, but it isn't because:

  • The UIS runs in user space, users have control over this, there is nothing Cylc can do to stop them.
  • Users can always give away full access to their accounts by sharing credentials (short of biometric security even 2FA can't prevent this).
  • The only way to truly achieve this would involve running the entire UIS under a site-controlled privileged account.

So in the same way that "not giving away your user credentials" or "not using infrastructure for malicious purposes" are enforced as human policies "not hacking Cylc to fiddle the site-defined limits" is a human factor too.

docs/proposal-multi-user-approach.md Outdated Show resolved Hide resolved

We will need to consider the desired behaviour if a user appears twice with different defaults and limits set, this is probably most likely to occur when a user appears in either multiple groups or in a group and as a user.

Proposal: all permissions are additive, if user appears elsewhere in config, the permission level should be taken as the greatest possible. Note, default in site config does not contribute to permissions, unless user does not appear in ui server owner's config.
Copy link
Member

@hjoliver hjoliver Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sensible. If the higher level of permission appears in the config, then no one can rightfully complain about us granting it (even if lower levels are explicitly granted as well).


Some of the below are not currently available to users but including them here for consideration. This is also a fairly substantial list, which makes the case for read and admin pre-set access groups which would be easier for users and sites to configure, rather than defining a long list of mutations per user/group.

As a springboard for discussion, defaults could be assigned as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me like you've nailed it, as-is.

@hjoliver
Copy link
Member

hjoliver commented Jun 9, 2021

Excellent diagrams 🎉

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal 👍 Just a handful of minor comments and suggestions.

Update docs/proposal-multi-user-approach.md

Co-authored-by: Hilary James Oliver <hilary.j.oliver@gmail.com>

Update proposal with READ/CONTROL decision
@datamel datamel force-pushed the proposal-cylc-authorisation branch from 9d98014 to 9a675c0 Compare June 10, 2021 09:51
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@dpmatthews dpmatthews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several minor changes needed now ALL is not the same as READ + CONTROL

docs/proposal-multi-user-approach.md Outdated Show resolved Hide resolved
docs/proposal-multi-user-approach.md Outdated Show resolved Hide resolved
docs/proposal-multi-user-approach.md Outdated Show resolved Hide resolved
docs/proposal-multi-user-approach.md Outdated Show resolved Hide resolved
docs/proposal-multi-user-approach.md Outdated Show resolved Hide resolved
@hjoliver
Copy link
Member

🎉

@hjoliver hjoliver merged commit 5a995cc into cylc:master Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants