Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add caching for authorization checks #87

Merged
merged 8 commits into from
Jun 9, 2023
Merged

Conversation

krithika369
Copy link
Collaborator

@krithika369 krithika369 commented Jun 7, 2023

Motivation

This PR adds an in-memory cache for the authorization checks. The purpose of this cache is to reduce the load on the underlying authz API server, given that these mappings rarely change.

Implementation

api/pkg/authz

A simple map-based in-memory caching mechanism is introduced in api/pkg/authz/enforcer/enforcer.go, which stores the result (true/false) of each action:subject:resource combination that is checked for.

The cache itself is implemented in api/pkg/authz/enforcer/cache.go and includes some minor optimization on the cache key used (mapping action/subject/resource names to auto-generated IDs) so that the data stored in the cache is concise.
Eg: GET:firstname.lastname@domain.com:projects -> 1:1:2

Note: Without this optimization, in the example shown below for MLP, we will be needing ~250 MiB of memory.

If the cache is enabled on the Enforcer, it will be initialised and used as a read-aside cache. The key expiration and cache clean up interval are configurable.

This package is used in multiple CaraML components such as the MLP API (changes described below), Turing, Merlin, etc. If the in-memory cache is enabled, the onus is on the consumers of this package to:

  • Ensure that sufficient run-time memory is allocated for the app, depending on the expected usage.
  • Determine how the cache can be leveraged efficiently. Eg: If access is always granted to resource a and a:** together (and there are no DENY permissions involved), when the app must check for permissions on a:b:c, it can instead decide to check for the permission on a only, to reduce the number of unique cache keys created.

api/api

  • api/config/config.go - Add config options for using caching with the authz layer
  • api/middleware/authorization.go - Manipulate the resource names to only run checks against the first 2 levels. This is sufficient because most permissions are granted project-wide (with the exception of the /applications API, which doesn't have any sub-resources).
  • api/api/router.go - Enable authz caching
Memory implications
===================
Turning on the authz caching will have the following memory implications. With:
* 100 users => ~ 2 x 100 bytes ID data = 200
* 5 commonly used methods (GET, POST, PUT, PATCH, DELETE) => 1 x 5 bytes ID data = 5
* 100 projects (i.e., 100 `projects:{id}` resources) => 2 x 100 bytes ID data = 200
* 2 simple resources (`applications`, `projects`) => 2 x 2 bytes ID data = 4

We could be storing up to:
200 x 5 x (200 + 4) bytes = ~ 204 KiB of data in-memory.

Next steps

  • Add UI config and changes to warn users that permission changes may take up to X minutes to take effect, when authz caching is enabled.

Other changes

  • Upgrade to Go 1.20
  • Update API lint step in CI to use golanci-lint action directly

@krithika369 krithika369 marked this pull request as draft June 7, 2023 06:45
@krithika369 krithika369 marked this pull request as ready for review June 7, 2023 09:30
@krithika369 krithika369 requested review from romanwozniak, ariefrahmansyah and deadlycoconuts and removed request for romanwozniak June 7, 2023 09:30
@romanwozniak
Copy link
Contributor

Thank you @krithika369 for such a detailed PR description and well-documented code!

The purpose of this cache is to reduce the load on the underlying authz API server, given that these mappings rarely change

Did you consider some options for immediate cache expiration in the event of updates to the AuthZ policies? If I understood it correctly, with this change, it will take up to KeyExpirySeconds (worst-case scenario) for the change to be propagated. This might be confusing for the end-users.

@krithika369
Copy link
Collaborator Author

krithika369 commented Jun 9, 2023

Did you consider some options for immediate cache expiration in the event of updates to the AuthZ policies? If I understood it correctly, with this change, it will take up to KeyExpirySeconds (worst-case scenario) for the change to be propagated. This might be confusing for the end-users.

This is tricky. The reason being, the cache will be stored in-memory, within each component (Turing, Merlin, etc.) but the MLP API is the one that handles the Authz update. We could introduce an internal /invalidate-cache endpoint in all the components (and may be ensure that only the CaraML components are authorised to call this endpoint).

It might be better if this were communicated using an event-bus.

Both of these approaches may require some work. Since the only way that users can change the Authz permissions today is via the UI, I'm thinking that, for now, we can add a notice when the setting is edited (the Next Steps, will open another PR for this). Though, in doing this, there is another problem - the KeyExpirySeconds could theoretically be set to different values by different components. So I'm thinking of fixing the max possible value for KeyExpirySeconds to 10 minutes. This way, individual components can choose not to have the caching or set a lower value, but the message to the user ("up to 10 minutes") will be accurate.

LMK if you have other thoughts.

@romanwozniak
Copy link
Contributor

It might be better if this were communicated using an event-bus

Yes, that would be handy. Could also be a WebSocket connection to deliver system events between CaraML components or/and UI

So I'm thinking of fixing the max possible value for KeyExpirySeconds to 10 minutes. This way, individual components can choose not to have the caching or set a lower value, but the message to the user ("up to 10 minutes") will be accurate

I don't think we need too precise here. A simple disclaimer that "it could take several minutes for the changes to take effect" will do

Copy link
Contributor

@romanwozniak romanwozniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@krithika369 krithika369 merged commit 6ae81a8 into main Jun 9, 2023
@krithika369 krithika369 deleted the add_keto_cache branch June 9, 2023 08:35
@pradithya
Copy link
Member

Late to the party...
How about respecting the Cache-Control HTTP header. When the header value is no-cache (typically if users hard refresh from browser) we should invalidate the cache and use fresh result.

@krithika369
Copy link
Collaborator Author

How about respecting the Cache-Control HTTP header. When the header value is no-cache (typically if users hard refresh from browser) we should invalidate the cache and use fresh result.

The Authz cache is entirely server-side. It's not related to the contents loaded by the browser. Is this relevant?

@pradithya
Copy link
Member

I am aware of that part.
This header is sent by the browser to all external API call, including CaraML backend when hard refresh is performed by users. I am thinking we could leverage that to avoid having users to wait for cache expiry.

@krithika369
Copy link
Collaborator Author

I am aware of that part. This header is sent by the browser to all external API call, including CaraML backend when hard refresh is performed by users. I am thinking we could leverage that to avoid having users to wait for cache expiry.

I see... If we took this route, based on the architecture we have today, the user will have to perform a hard-refresh on the app where they want to see the permission changes. This can be somewhat natural to do for user accounts + UI workflows (eg: user navigates to Merlin, cannot see models and does a hard-refresh) but not so much when altering permissions for service accounts. Also, it can be confusing - if they did a hard refresh on the List Routers view and subsequently opened a router's settings, the permissions on the experiment engine (XP management) may not be updated still.

Besides, I'm a little reluctant to invalidate all the backed cache based on Cache-control settings driven by the client (this is bad if someone keeps doing a hard refresh for whatever reason; though I think it'll be nice to have a way to bypass cache for testing/debugging, for the devs). IMO it'll be better if we can automatically invalidate the cache for a project after its settings have been updated. But to do that for all components, we need some additional work (this discussion).

So, for now, I prefer we simply warn the users to wait a few minutes (UI changes here).

krithika369 added a commit to caraml-dev/merlin that referenced this pull request Jun 14, 2023
<!--  Thanks for sending a pull request!  Here are some tips for you:

1. Run unit tests and ensure that they are passing
2. If your change introduces any API changes, make sure to update the
e2e tests
3. Make sure documentation is updated for your PR!

-->

**What this PR does / why we need it**:
<!-- Explain here the context and why you're making the change. What is
the problem you're trying to solve. --->

This PR largely replicates the changes for the API in
caraml-dev/mlp#87, to have the option of caching
authz responses. As a part of this implementation, the API dependency on
MLP has been upgraded.

**Which issue(s) this PR fixes**:
<!--
*Automatically closes linked issue when PR is merged.
Usage: `Fixes #<issue number>`, or `Fixes (paste link of issue)`.
-->

None

**Does this PR introduce a user-facing change?**:
<!--
If no, just write "NONE" in the release-note block below.
If yes, a release note is required. Enter your extended release note in
the block below.
If the PR requires additional action from users switching to the new
release, include the string "action required".

For more information about release notes, see kubernetes' guide here:
http://git.k8s.io/community/contributors/guide/release-notes.md
-->

```release-note
Add caching options for authorization responses, to improve the performance of the control plane APIs
```

**Checklist**

- [ ] Added unit test, integration, and/or e2e tests
- [x] Tested locally
- [ ] Updated documentation
- [ ] Update Swagger spec if the PR introduce API changes
- [ ] Regenerated Golang and Python client if the PR introduce API
changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants