Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support for multiple rbac ConfigMaps (Alpha) #17828

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

meln5674
Copy link

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Fixes #8324

This work is based on these abandoned PRs:

In large multi-tenant instances, the size of the policy.csv may exceed the limit for a single Kubernetes resource.

This patch adds the label argocd.argoproj.io/cm-type=policy-csv, which, when placed on a configmap in the same namespace, causes it to be included in the casbin policy according to the same rules as the existing argocd-cm-rbac configmap, with outputs sorted alphabetically by configmap name before concatenation to ensure predictable outputs.

The admin CLI has been updated to also check for these additional configmaps and generate the final policy accordingly.

E2E tests have been added to ensure that these extra configmaps' policies are included when created, as well as correctly removed when they are changed or updated.

NOTE: I could have sworn I made this PR as a draft a few days ago, but it was gone when I went to update it. Please excuse me if I've accidentally made a duplicate

@meln5674 meln5674 requested review from a team as code owners April 13, 2024 19:31
@meln5674 meln5674 force-pushed the feature/multiple-cm-rbac branch 3 times, most recently from 4b35e53 to bab69f1 Compare April 15, 2024 16:43
@meln5674 meln5674 force-pushed the feature/multiple-cm-rbac branch 2 times, most recently from 8884828 to 1c610ef Compare November 2, 2024 01:43
Copy link

codecov bot commented Nov 2, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 38 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@d761c94). Learn more about missing BASE report.
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
util/rbac/rbac.go 68.04% 27 Missing and 4 partials ⚠️
cmd/argocd/commands/admin/settings_rbac.go 58.82% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #17828   +/-   ##
=========================================
  Coverage          ?   55.18%           
=========================================
  Files             ?      324           
  Lines             ?    55277           
  Branches          ?        0           
=========================================
  Hits              ?    30504           
  Misses            ?    22159           
  Partials          ?     2614           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Andrew Melnick <meln5674@kettering.edu>
@meln5674 meln5674 force-pushed the feature/multiple-cm-rbac branch from 1c610ef to 448a6ac Compare November 2, 2024 17:28
Copy link
Contributor

@andrii-korotkov-verkada andrii-korotkov-verkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be easier to manage and work with if we split config maps by roles or something like this.

An explicit allowlist of extra cm-s in one of main cm-s sounds safer than just label, which people can add to arbitrary cm.

We may hit some race conditions when updating multiple files. Let's introduce some kind of version field and require a consistent value of it from all involved cm-s to actually combine the data.

if cm.Name == e.configmap {
// This should never happen?
// Assume this is user error and ignore it
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore it instead of surfacing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, wrote this over half a year ago, so my memory is hazy at best. I think the logic was that this CM should never be deleted, so if it is, we shouldn't act on it. Or are you just suggesting to return an error so that it get's logged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm thinking of at least logging the error.

@meln5674
Copy link
Author

meln5674 commented Nov 9, 2024

An explicit allowlist of extra cm-s in one of main cm-s sounds safer than just label, which people can add to arbitrary cm.

I don't like this idea, for the same reason that I don't like limiting to a single cm: There is a limit. If people have edit access to the argocd namespace, they already nearly have admin-like permissions already, so I don't seem much benefit, security-wise.

Recall that the motivation here is very large multi-tenant setups, which should be using per-tenant application namespaces to begin with, and restricting access to the argocd namespace to only argocd admins. That said, I could see adding a flag to the server to enable/disable this feature entirely as a compromise for environments that don't wish to use it for the reason you listed. I could also see adding a field to the main cm as a name pattern to match new policy cm's against, and ignore those that don't match against, similar to how the application namespaces flag works.

We may hit some race conditions when updating multiple files. Let's introduce some kind of version field and require a consistent value of it from all involved cm-s to actually combine the data.

I'm not sure I understand the need nor implementation you're suggesting. Do you mean to say that if you have 1000 cm's, then an update to one requires updating the other 999 to bump the version field on them? That doesn't sit well with me, especially as I don't see what possible data races you could be referring to.

If two CM's are updated in sequence, then the observed RBAC will be with the first update, and then the second update a few ms later. Unless you're referring to a data race in the controller code itself, which I believe executes serially, but I can add a mutex around the policyCSVCache field if that would mitigate your concern.

Consider if you want to generate these cm's from multiple helm charts, or from separate scripts, this would require a substantial amount of additional complexity and coordination. Not only that, this would introduce an actual data race between this competing scripts/charts to try and update to a consistent version number.

@andrii-korotkov-verkada
Copy link
Contributor

Okay, I didn't consider restriction to argocd namespace. Your point is valid, it's probably okay.

Regarding race condition, I think we need to do something, even though I understand it's non-trivial.

Consider a case when some permissions are moved to a different cm due to its size for example. Then depending on the sync delays there can be a permission loss or duplicate permissions. If one of the refreshes of cm-s fails, the state can persist for longer, though probably not as much if we refresh periodically.
Another case is when some roles are defined in a different file than policy. If there's a desync, the overall state becomes invalid due to referencing unknown policy.
Also, it's not milliseconds of delay - config maps can take a minute or so to be updated by the kubelet.

@meln5674
Copy link
Author

meln5674 commented Nov 9, 2024

Consider a case when some permissions are moved to a different cm due to its size for example. Then depending on the sync delays there can be a permission loss or duplicate permissions. If one of the refreshes of cm-s fails, the state can persist for longer, though probably not as much if we refresh periodically.

The kuberenetes client-go informer can do exactly that, see here, and it seems they are already set to do that every 10 minutes

Another case is when some roles are defined in a different file than policy. If there's a desync, the overall state becomes invalid due to referencing unknown policy.

How does the system currently handle a policy line from the single cm that references a missing role? Or, by analogy, what happens if an Application and its AppProject "de-sync" such that the Application can no longer refresh or sync resources? Are there special protections in place for that situation we can mirror here, or is the strategy with that to rely on eventual consistency like this does?

Also, it's not milliseconds of delay - config maps can take a minute or so to be updated by the kubelet.

I don't follow, we are not interacting with the kubelet in any way. The updated cm's are streamed directly from the kube apiserver over a SPDY connection, not loaded from a filesystem mounted by the kubelet, and those updates are sent as part of the kube apiserver http handlers. The only time I could see where there would be a meaningful delay would be if argocd lost connection to the kube apiserver, in which case, there really isn't anything that could be done.

Also, if such a persistent de-sync does occur somehow, it can be resolved trivially by doing a rolling update of all argo-server pods, which would force a full refresh of all cm's.

@andrii-korotkov-verkada
Copy link
Contributor

How does the system currently handle a policy line from the single cm that references a missing role? Or, by analogy, what happens if an Application and its AppProject "de-sync" such that the Application can no longer refresh or sync resources? Are there special protections in place for that situation we can mirror here, or is the strategy with that to rely on eventual consistency like this does?

I don't know.

The updated cm's are streamed directly from the kube apiserver over a SPDY connection, not loaded from a filesystem mounted by the kubelet, and those updates are sent as part of the kube apiserver http handlers.

Ah, okay, thanks for correcting me.

Here's another idea:

  • Keep all but rules granting permissions in the main config map.
  • Shard the permission-granting rules by the role name.
  • Keep the mapping of role name to config map it's in in the main config map.

When adding a new role, desync would only cause the permissions to not be accessible earlier.
When modifying an existing role permission, there won't be a desync, since only one config map would be modified.
When deleting a role, desync wouldn't affect things, we just need to gracefully handle absent config maps.

@andrii-korotkov-verkada
Copy link
Contributor

This idea can be made backwards-compatible and can potentially bypass the alpha and beta phases.

@meln5674
Copy link
Author

meln5674 commented Nov 9, 2024

  • Keep all but rules granting permissions in the main config map.

This wouldn't resolve the issue this PR is designed to fix, which is that a single configmap has a finite number of rows it can store, which is reached in very large multi-tenant setups.

If these de-syncs are truly a serious concern (which, as of yet, I am unconvinced of), it could be substantially mitigated by using a "debouncing" system, where a policy cm update records the lines, but does not update the policy, and instead, sets a timer, say, for 5s. During this time, any other policy cm update does the same, and restarts the timer. Only when the timer expires does it actually update the policy. Thus, the policy would only update after a period of observed stability, meaning that any de-sync shorter than the jitter period would be impossible.

@andrii-korotkov-verkada
Copy link
Contributor

This wouldn't resolve the issue this PR is designed to fix, which is that a single configmap has a finite number of rows it can store, which is reached in very large multi-tenant setups.

How many grant rules, roles and user groups do you have roughly? I thought the grant rules are the main limiting factor and my approach would split that. But curious about more details about the bottlenecks.

@meln5674
Copy link
Author

meln5674 commented Nov 9, 2024

How many grant rules, roles and user groups do you have roughly? I thought the grant rules are the main limiting factor and my approach would split that. But curious about more details about the bottlenecks.

Because roles are not parameterize-able, I need a set of roles per tenant, plus an assignment of each to the matching IDAM groups (multiple per tenant), plus a set of pre-populated service accounts. I would estimate that I have O(100) tenants, O(1,000) rules and groups, and O(100,000) lines, but that's a complete guess without logging into corporate and actually counting.

Perhaps that means that the "correct" solution is to have templates to generate policy lines given a set of configurable inputs, but that seems like a much larger lift than just allowing multiple cm's, and my aim to solve the immediate problem, that myself, and others in the linked issue, are having right now, in the simplest, least intrusive way possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for multiple rbac configmaps
2 participants