-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support for multiple rbac ConfigMaps (Alpha) #17828
base: master
Are you sure you want to change the base?
Conversation
4b35e53
to
bab69f1
Compare
8884828
to
1c610ef
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17828 +/- ##
=========================================
Coverage ? 55.18%
=========================================
Files ? 324
Lines ? 55277
Branches ? 0
=========================================
Hits ? 30504
Misses ? 22159
Partials ? 2614 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Andrew Melnick <meln5674@kettering.edu>
1c610ef
to
448a6ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be easier to manage and work with if we split config maps by roles or something like this.
An explicit allowlist of extra cm-s in one of main cm-s sounds safer than just label, which people can add to arbitrary cm.
We may hit some race conditions when updating multiple files. Let's introduce some kind of version field and require a consistent value of it from all involved cm-s to actually combine the data.
if cm.Name == e.configmap { | ||
// This should never happen? | ||
// Assume this is user error and ignore it | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we ignore it instead of surfacing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, wrote this over half a year ago, so my memory is hazy at best. I think the logic was that this CM should never be deleted, so if it is, we shouldn't act on it. Or are you just suggesting to return an error so that it get's logged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm thinking of at least logging the error.
I don't like this idea, for the same reason that I don't like limiting to a single cm: There is a limit. If people have edit access to the argocd namespace, they already nearly have admin-like permissions already, so I don't seem much benefit, security-wise. Recall that the motivation here is very large multi-tenant setups, which should be using per-tenant application namespaces to begin with, and restricting access to the argocd namespace to only argocd admins. That said, I could see adding a flag to the server to enable/disable this feature entirely as a compromise for environments that don't wish to use it for the reason you listed. I could also see adding a field to the main cm as a name pattern to match new policy cm's against, and ignore those that don't match against, similar to how the application namespaces flag works.
I'm not sure I understand the need nor implementation you're suggesting. Do you mean to say that if you have 1000 cm's, then an update to one requires updating the other 999 to bump the version field on them? That doesn't sit well with me, especially as I don't see what possible data races you could be referring to. If two CM's are updated in sequence, then the observed RBAC will be with the first update, and then the second update a few ms later. Unless you're referring to a data race in the controller code itself, which I believe executes serially, but I can add a mutex around the Consider if you want to generate these cm's from multiple helm charts, or from separate scripts, this would require a substantial amount of additional complexity and coordination. Not only that, this would introduce an actual data race between this competing scripts/charts to try and update to a consistent version number. |
Okay, I didn't consider restriction to argocd namespace. Your point is valid, it's probably okay. Regarding race condition, I think we need to do something, even though I understand it's non-trivial. Consider a case when some permissions are moved to a different cm due to its size for example. Then depending on the sync delays there can be a permission loss or duplicate permissions. If one of the refreshes of cm-s fails, the state can persist for longer, though probably not as much if we refresh periodically. |
The kuberenetes client-go informer can do exactly that, see here, and it seems they are already set to do that every 10 minutes
How does the system currently handle a policy line from the single cm that references a missing role? Or, by analogy, what happens if an Application and its AppProject "de-sync" such that the Application can no longer refresh or sync resources? Are there special protections in place for that situation we can mirror here, or is the strategy with that to rely on eventual consistency like this does?
I don't follow, we are not interacting with the kubelet in any way. The updated cm's are streamed directly from the kube apiserver over a SPDY connection, not loaded from a filesystem mounted by the kubelet, and those updates are sent as part of the kube apiserver http handlers. The only time I could see where there would be a meaningful delay would be if argocd lost connection to the kube apiserver, in which case, there really isn't anything that could be done. Also, if such a persistent de-sync does occur somehow, it can be resolved trivially by doing a rolling update of all argo-server pods, which would force a full refresh of all cm's. |
I don't know.
Ah, okay, thanks for correcting me. Here's another idea:
When adding a new role, desync would only cause the permissions to not be accessible earlier. |
This idea can be made backwards-compatible and can potentially bypass the alpha and beta phases. |
This wouldn't resolve the issue this PR is designed to fix, which is that a single configmap has a finite number of rows it can store, which is reached in very large multi-tenant setups. If these de-syncs are truly a serious concern (which, as of yet, I am unconvinced of), it could be substantially mitigated by using a "debouncing" system, where a policy cm update records the lines, but does not update the policy, and instead, sets a timer, say, for 5s. During this time, any other policy cm update does the same, and restarts the timer. Only when the timer expires does it actually update the policy. Thus, the policy would only update after a period of observed stability, meaning that any de-sync shorter than the jitter period would be impossible. |
How many grant rules, roles and user groups do you have roughly? I thought the grant rules are the main limiting factor and my approach would split that. But curious about more details about the bottlenecks. |
Because roles are not parameterize-able, I need a set of roles per tenant, plus an assignment of each to the matching IDAM groups (multiple per tenant), plus a set of pre-populated service accounts. I would estimate that I have O(100) tenants, O(1,000) rules and groups, and O(100,000) lines, but that's a complete guess without logging into corporate and actually counting. Perhaps that means that the "correct" solution is to have templates to generate policy lines given a set of configurable inputs, but that seems like a much larger lift than just allowing multiple cm's, and my aim to solve the immediate problem, that myself, and others in the linked issue, are having right now, in the simplest, least intrusive way possible. |
Checklist:
Fixes #8324
This work is based on these abandoned PRs:
In large multi-tenant instances, the size of the policy.csv may exceed the limit for a single Kubernetes resource.
This patch adds the label
argocd.argoproj.io/cm-type=policy-csv
, which, when placed on a configmap in the same namespace, causes it to be included in the casbin policy according to the same rules as the existingargocd-cm-rbac
configmap, with outputs sorted alphabetically by configmap name before concatenation to ensure predictable outputs.The admin CLI has been updated to also check for these additional configmaps and generate the final policy accordingly.
E2E tests have been added to ensure that these extra configmaps' policies are included when created, as well as correctly removed when they are changed or updated.
NOTE: I could have sworn I made this PR as a draft a few days ago, but it was gone when I went to update it. Please excuse me if I've accidentally made a duplicate