feat: Support for multiple rbac ConfigMaps (Alpha) #17828

meln5674 · 2024-04-13T19:31:49Z

Checklist:

This work is based on these abandoned PRs:

In large multi-tenant instances, the size of the policy.csv may exceed the limit for a single Kubernetes resource.

This patch adds the label argocd.argoproj.io/cm-type=policy-csv, which, when placed on a configmap in the same namespace, causes it to be included in the casbin policy according to the same rules as the existing argocd-cm-rbac configmap, with outputs sorted alphabetically by configmap name before concatenation to ensure predictable outputs.

The admin CLI has been updated to also check for these additional configmaps and generate the final policy accordingly.

E2E tests have been added to ensure that these extra configmaps' policies are included when created, as well as correctly removed when they are changed or updated.

NOTE: I could have sworn I made this PR as a draft a few days ago, but it was gone when I went to update it. Please excuse me if I've accidentally made a duplicate

codecov · 2024-11-02T02:45:48Z

Codecov Report

Attention: Patch coverage is 66.66667% with 38 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@d761c94). Learn more about missing BASE report.
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
util/rbac/rbac.go	68.04%	27 Missing and 4 partials ⚠️
cmd/argocd/commands/admin/settings_rbac.go	58.82%	5 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #17828   +/-   ##
=========================================
  Coverage          ?   55.18%           
=========================================
  Files             ?      324           
  Lines             ?    55277           
  Branches          ?        0           
=========================================
  Hits              ?    30504           
  Misses            ?    22159           
  Partials          ?     2614

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Andrew Melnick <meln5674@kettering.edu>

andrii-korotkov-verkada

I think it'd be easier to manage and work with if we split config maps by roles or something like this.

An explicit allowlist of extra cm-s in one of main cm-s sounds safer than just label, which people can add to arbitrary cm.

We may hit some race conditions when updating multiple files. Let's introduce some kind of version field and require a consistent value of it from all involved cm-s to actually combine the data.

andrii-korotkov-verkada · 2024-11-09T15:10:17Z

util/rbac/rbac.go

+	if cm.Name == e.configmap {
+		// This should never happen?
+		// Assume this is user error and ignore it
+		return nil


Why do we ignore it instead of surfacing?

Not sure, wrote this over half a year ago, so my memory is hazy at best. I think the logic was that this CM should never be deleted, so if it is, we shouldn't act on it. Or are you just suggesting to return an error so that it get's logged?

Yeah, I'm thinking of at least logging the error.

meln5674 · 2024-11-09T18:59:52Z

An explicit allowlist of extra cm-s in one of main cm-s sounds safer than just label, which people can add to arbitrary cm.

I don't like this idea, for the same reason that I don't like limiting to a single cm: There is a limit. If people have edit access to the argocd namespace, they already nearly have admin-like permissions already, so I don't seem much benefit, security-wise.

Recall that the motivation here is very large multi-tenant setups, which should be using per-tenant application namespaces to begin with, and restricting access to the argocd namespace to only argocd admins. That said, I could see adding a flag to the server to enable/disable this feature entirely as a compromise for environments that don't wish to use it for the reason you listed. I could also see adding a field to the main cm as a name pattern to match new policy cm's against, and ignore those that don't match against, similar to how the application namespaces flag works.

We may hit some race conditions when updating multiple files. Let's introduce some kind of version field and require a consistent value of it from all involved cm-s to actually combine the data.

I'm not sure I understand the need nor implementation you're suggesting. Do you mean to say that if you have 1000 cm's, then an update to one requires updating the other 999 to bump the version field on them? That doesn't sit well with me, especially as I don't see what possible data races you could be referring to.

If two CM's are updated in sequence, then the observed RBAC will be with the first update, and then the second update a few ms later. Unless you're referring to a data race in the controller code itself, which I believe executes serially, but I can add a mutex around the policyCSVCache field if that would mitigate your concern.

Consider if you want to generate these cm's from multiple helm charts, or from separate scripts, this would require a substantial amount of additional complexity and coordination. Not only that, this would introduce an actual data race between this competing scripts/charts to try and update to a consistent version number.

andrii-korotkov-verkada · 2024-11-09T19:28:54Z

Okay, I didn't consider restriction to argocd namespace. Your point is valid, it's probably okay.

Regarding race condition, I think we need to do something, even though I understand it's non-trivial.

Consider a case when some permissions are moved to a different cm due to its size for example. Then depending on the sync delays there can be a permission loss or duplicate permissions. If one of the refreshes of cm-s fails, the state can persist for longer, though probably not as much if we refresh periodically.
Another case is when some roles are defined in a different file than policy. If there's a desync, the overall state becomes invalid due to referencing unknown policy.
Also, it's not milliseconds of delay - config maps can take a minute or so to be updated by the kubelet.

meln5674 · 2024-11-09T21:04:14Z

Consider a case when some permissions are moved to a different cm due to its size for example. Then depending on the sync delays there can be a permission loss or duplicate permissions. If one of the refreshes of cm-s fails, the state can persist for longer, though probably not as much if we refresh periodically.

The kuberenetes client-go informer can do exactly that, see here, and it seems they are already set to do that every 10 minutes

Another case is when some roles are defined in a different file than policy. If there's a desync, the overall state becomes invalid due to referencing unknown policy.

How does the system currently handle a policy line from the single cm that references a missing role? Or, by analogy, what happens if an Application and its AppProject "de-sync" such that the Application can no longer refresh or sync resources? Are there special protections in place for that situation we can mirror here, or is the strategy with that to rely on eventual consistency like this does?

Also, it's not milliseconds of delay - config maps can take a minute or so to be updated by the kubelet.

I don't follow, we are not interacting with the kubelet in any way. The updated cm's are streamed directly from the kube apiserver over a SPDY connection, not loaded from a filesystem mounted by the kubelet, and those updates are sent as part of the kube apiserver http handlers. The only time I could see where there would be a meaningful delay would be if argocd lost connection to the kube apiserver, in which case, there really isn't anything that could be done.

Also, if such a persistent de-sync does occur somehow, it can be resolved trivially by doing a rolling update of all argo-server pods, which would force a full refresh of all cm's.

andrii-korotkov-verkada · 2024-11-09T21:39:05Z

How does the system currently handle a policy line from the single cm that references a missing role? Or, by analogy, what happens if an Application and its AppProject "de-sync" such that the Application can no longer refresh or sync resources? Are there special protections in place for that situation we can mirror here, or is the strategy with that to rely on eventual consistency like this does?

I don't know.

The updated cm's are streamed directly from the kube apiserver over a SPDY connection, not loaded from a filesystem mounted by the kubelet, and those updates are sent as part of the kube apiserver http handlers.

Ah, okay, thanks for correcting me.

Here's another idea:

Keep all but rules granting permissions in the main config map.
Shard the permission-granting rules by the role name.
Keep the mapping of role name to config map it's in in the main config map.

When adding a new role, desync would only cause the permissions to not be accessible earlier.
When modifying an existing role permission, there won't be a desync, since only one config map would be modified.
When deleting a role, desync wouldn't affect things, we just need to gracefully handle absent config maps.

andrii-korotkov-verkada · 2024-11-09T21:39:41Z

This idea can be made backwards-compatible and can potentially bypass the alpha and beta phases.

meln5674 · 2024-11-09T23:03:12Z

Keep all but rules granting permissions in the main config map.

This wouldn't resolve the issue this PR is designed to fix, which is that a single configmap has a finite number of rows it can store, which is reached in very large multi-tenant setups.

If these de-syncs are truly a serious concern (which, as of yet, I am unconvinced of), it could be substantially mitigated by using a "debouncing" system, where a policy cm update records the lines, but does not update the policy, and instead, sets a timer, say, for 5s. During this time, any other policy cm update does the same, and restarts the timer. Only when the timer expires does it actually update the policy. Thus, the policy would only update after a period of observed stability, meaning that any de-sync shorter than the jitter period would be impossible.

andrii-korotkov-verkada · 2024-11-09T23:35:36Z

This wouldn't resolve the issue this PR is designed to fix, which is that a single configmap has a finite number of rows it can store, which is reached in very large multi-tenant setups.

How many grant rules, roles and user groups do you have roughly? I thought the grant rules are the main limiting factor and my approach would split that. But curious about more details about the bottlenecks.

meln5674 · 2024-11-09T23:54:25Z

How many grant rules, roles and user groups do you have roughly? I thought the grant rules are the main limiting factor and my approach would split that. But curious about more details about the bottlenecks.

Because roles are not parameterize-able, I need a set of roles per tenant, plus an assignment of each to the matching IDAM groups (multiple per tenant), plus a set of pre-populated service accounts. I would estimate that I have O(100) tenants, O(1,000) rules and groups, and O(100,000) lines, but that's a complete guess without logging into corporate and actually counting.

Perhaps that means that the "correct" solution is to have templates to generate policy lines given a set of configurable inputs, but that seems like a much larger lift than just allowing multiple cm's, and my aim to solve the immediate problem, that myself, and others in the linked issue, are having right now, in the simplest, least intrusive way possible.

meln5674 requested review from a team as code owners April 13, 2024 19:31

meln5674 force-pushed the feature/multiple-cm-rbac branch 3 times, most recently from 4b35e53 to bab69f1 Compare April 15, 2024 16:43

crenshaw-dev mentioned this pull request Jun 11, 2024

feat: add support for gzipped policy files #18594

Closed

14 tasks

meln5674 force-pushed the feature/multiple-cm-rbac branch 2 times, most recently from 8884828 to 1c610ef Compare November 2, 2024 01:43

Support for multiple ConfigMaps w/ policy.csv

448a6ac

Signed-off-by: Andrew Melnick <meln5674@kettering.edu>

meln5674 force-pushed the feature/multiple-cm-rbac branch from 1c610ef to 448a6ac Compare November 2, 2024 17:28

andrii-korotkov-verkada suggested changes Nov 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support for multiple rbac ConfigMaps (Alpha) #17828

feat: Support for multiple rbac ConfigMaps (Alpha) #17828

meln5674 commented Apr 13, 2024

codecov bot commented Nov 2, 2024 •

edited

Loading

andrii-korotkov-verkada left a comment

andrii-korotkov-verkada Nov 9, 2024

meln5674 Nov 9, 2024

andrii-korotkov-verkada Nov 9, 2024

meln5674 commented Nov 9, 2024

andrii-korotkov-verkada commented Nov 9, 2024

meln5674 commented Nov 9, 2024

andrii-korotkov-verkada commented Nov 9, 2024

andrii-korotkov-verkada commented Nov 9, 2024

meln5674 commented Nov 9, 2024

andrii-korotkov-verkada commented Nov 9, 2024

meln5674 commented Nov 9, 2024

feat: Support for multiple rbac ConfigMaps (Alpha) #17828

Are you sure you want to change the base?

feat: Support for multiple rbac ConfigMaps (Alpha) #17828

Conversation

meln5674 commented Apr 13, 2024

codecov bot commented Nov 2, 2024 • edited Loading

Codecov Report

andrii-korotkov-verkada left a comment

Choose a reason for hiding this comment

andrii-korotkov-verkada Nov 9, 2024

Choose a reason for hiding this comment

meln5674 Nov 9, 2024

Choose a reason for hiding this comment

andrii-korotkov-verkada Nov 9, 2024

Choose a reason for hiding this comment

meln5674 commented Nov 9, 2024

andrii-korotkov-verkada commented Nov 9, 2024

meln5674 commented Nov 9, 2024

andrii-korotkov-verkada commented Nov 9, 2024

andrii-korotkov-verkada commented Nov 9, 2024

meln5674 commented Nov 9, 2024

andrii-korotkov-verkada commented Nov 9, 2024

meln5674 commented Nov 9, 2024

codecov bot commented Nov 2, 2024 •

edited

Loading