-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: respecting rbac for resource exclusions/inclusions proposal #13479
Conversation
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## master #13479 +/- ##
==========================================
+ Coverage 49.56% 49.64% +0.07%
==========================================
Files 256 258 +2
Lines 43920 44192 +272
==========================================
+ Hits 21770 21940 +170
- Misses 19987 20091 +104
+ Partials 2163 2161 -2 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
## Proposal | ||
|
||
The configuration for this will be present in the `argocd-cm`, we will add new boolean field `resource.respectRBAC` in the | ||
cm which can be set to `true` to enable this feature, by default the feature is disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be safe to enable by default? If the controller doesn't have access to a resource, respecting RBAC will only help it avoid 403s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep it enabled by default, but that might suddenly change the behavior of the controller when a user upgrades which I thought might not be welcomed by some users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! I think we'd want to everywhere target enabling by default. But agreed, best to move cautiously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also be voting for kind of a cautious move here. Make it a feature toggle, test it extensively, and then move forward to enable it by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think by default for now at least this would be disabled and opt in for the user.
cm which can be set to `true` to enable this feature, by default the feature is disabled. | ||
|
||
The feature will also modify `gitops-engine` pkg to add a `SelfSubjectAccessReview` request before adding any resource to the watch list, | ||
which will make sure that argocd only monitors resources that it has access to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest not to perform an additional request to SelfSubjectAccessReview
, because it potentially doubles the number of requests required for building up the cache. On large clusters, this is problematic already as of today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, I'd like to propose evaluating the API response for a given resource during list and/or establishing the watch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, updateed proposal to match the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jannfis I presented the proposal in today's meeting and there was a concern with depending on the api response, it was about false positives that might be encountered due to excessive load on the kube api server and also due to env-specific proxies. So I am adding back the SelfSubjectAccessReview
implementation option besides the api response approach with their advantages and disadvantages as discussed in the call today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added a third approach recommended by @alexmt where we combine 1 and 2. Tagging others for opinion @crenshaw-dev @leoluz @jannfis
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
…ce access Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
3. Combine approaches 1 and 2, in this controller will check the api response for the list call, and if it receives forbidden/unauthorized it will make the `SelfSubjectAccessReview` call. | ||
This approach is accurate and at the same time, only makes extra api calls if the list calls fail in the first place. | ||
|
||
In all solutions, once controller determines that it does not have access to the resource it will stop monitoring it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user adds/removes access later ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A controller restart will be necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Controller automatically does full "resync" every day. So it will auto-discover RBAC change once a day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes ^^ though if you want it instantly a restart is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexmt The watches will be re-established every 10 minutes as well, right? I think part of that is also rediscovery changes in available APIs. I guess with that, auto-discovery of RBAC changes would also happen all 10 minutes instead of only after cache expiry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think controller is retrying only in-progress watches every 10 minutes. So it will notice if RBAC no longer allows accessing resource but won't notice if RBAC is allowing new resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you add new CRDs to the cluster, Argo CD will pick those up without the need for a restart. Is that due to an informer on the CRD API?
As for the restart, I think a cluster-cache resync (which can be triggered via Argo CD's API) should be sufficient? Maybe we can add a new functionality that Argo CD will also re-try (probe) the APIs it received a 401 for previously? In a configurable interval, which defaults to the same duration as the watch timeout (10 minutes)?
Anyway, I think this is implementation detail and could even be added later on. Just keeping this here as a potential idea.
1. Modify `gitops-engine` pkg to make a `SelfSubjectAccessReview` request before adding any resource to the watch list, in this approach we are making an extra | ||
api server call to check if controller has access to the resource, this does increase the no. of kubeapi calls made but is more accurate. | ||
2. Modify `gitops-engine` pkg to check for forbidden/unauthorized errors when listing for resources, this is more efficient approach as the | ||
no. of kubeapi calls made does not change, but there is a chance of false positives as similar errors can be returned from kubeapi server or env specific proxies in other situations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what kind of false positives that would be?
And, if there's a false-positive, chances are that SelfSubjectAccessReview will return positive, but the API server will reply with a 403 when trying to establish the watch, then again resulting in a hard error?
For the implementation there are 3 proposals : | ||
|
||
1. Modify `gitops-engine` pkg to make a `SelfSubjectAccessReview` request before adding any resource to the watch list, in this approach we are making an extra | ||
api server call to check if controller has access to the resource, this does increase the no. of kubeapi calls made but is more accurate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: The number of kubeapi calls. In clusters with a lot of APIs (e.g. OpenShift comes with 200 APIs installed by default), that's already at least a burst of 200 API calls in a very short time frame. The default QPS for the client-side rate limiter in the kube API client is 50 with a burst of 100 (2 * QPS).
With additional SelfSubjectAccessReview, this numbers would increase to a burst of 400 calls.
Things get even worse when you consider namespace-scoped mode. Because the watches would be established for each of the APIs per managed namespaces. Considering the 200+ APIs example, this would result in at least 400 calls per managed namespace.
This seems to be numbers that can easily break the API server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default QPS for the client-side rate limiter in the kube API client is 50 with a burst of 100 (2 * QPS).
The default in kube has actually been increased to either 300 or 500. But Argo's default is currently set lower. We should set ours to the new default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed.
50/100 is a very conservative number. I know that in the OpenShift client, this has been changed for a while (I think to 350 for the burst at least), and if upstream K8s has adopted a similar increase by default, we should do it as well. People can still fine-tune it using ARGOCD_K8S_CLIENT_QPS and ARGOCD_K8S_CLIENT_BURST when they want more conservative rate limits, but I guess the better user experience is to have them increased by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessesuen it seems that the default in the kube client is 5 and 10 respectively: https://pkg.go.dev/k8s.io/client-go@v0.27.3/rest#pkg-constants (https://github.com/kubernetes/client-go/blob/ebad5fbb0d96ae12547fd19468316a604bce5ccf/rest/config.go#L43-L46)
3. Combine approaches 1 and 2, in this controller will check the api response for the list call, and if it receives forbidden/unauthorized it will make the `SelfSubjectAccessReview` call. | ||
This approach is accurate and at the same time, only makes extra api calls if the list calls fail in the first place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strongly depends on the use-case of the user and what kind of permissions they are willing to give to a specific Argo CD instance. For example, if there's an instance that is only allowed to manage 10 out of 200 APIs, you would still have 190 additional calls to the SelfSubjectAccessReview.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe make the use of SubjectSelfAccessReview configurable? Then people can have the trade-off they want - on clusters with a low amount of APIs, they can configure the mechanism to the highest precision while on clusters with high amount of APIs, they trade precision against speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I think we can have another config option that enables the stricter check. Will update the proposal with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
It was decided that we will go with approach 3 from the above list, but we shall provide 2 boolean configurations options to users : | ||
- `resource.respectRBAC.strict` : This will perform both the checks i.e. whether the list call response is forbidden/unauthorized and if it is make the `SelfSubjectAccessReview` call to confirm. | ||
- `resource.respectRBAC.normal` : This will only check whether the list call response is forbidden/unauthorized and skip `SelfSubjectAccessReview` call. | ||
|
||
NOTE: `strict` has higher priority so irrespective of the status of `normal` if strict is set to true strict mode will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more of a nit, but could we maybe - instead of two boolean options - have one option that takes three values:
normal
(or maybelax
, because this is more the opposite ofstrict
),strict
false
(ordisabled
, oroff
), being also the default if the field is not set.
Or as an alternative: Have two boolean values, resource.respectRBAC.enabled
and resource.respectRBAC.strict
, but in order to turn on strict
, enabled
has to be set to true as well.
I believe that would be less confusing and more ergonomic to users configuring Argo CD. But as I said, it's a nit, and I'm interested in other people's opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a good idea I think having a single field with 3 values is better, I would probably make it just 2 and use empty "" or missing field as disabled. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- TBD | ||
|
||
reviewers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- TBD | |
- @jannfis |
- TBD | ||
|
||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- TBD | |
- @jannfis |
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…goproj#13479) * feat: respecting rbac for resource exclusions/inclusions proposal (argoproj#13479) Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
…goproj#13479) * feat: respecting rbac for resource exclusions/inclusions proposal (argoproj#13479) Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
This PR adds the proposal for adding a new feature that allows argocd controller to respect rbac while monitoring for resources besides existing resource exclusions/inclusions.
Checklist:
Please see Contribution FAQs if you have questions about your pull-request.