Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: respecting rbac for resource exclusions/inclusions proposal #13479

Merged
merged 10 commits into from
Jul 3, 2023

Conversation

gdsoumya
Copy link
Member

@gdsoumya gdsoumya commented May 6, 2023

This PR adds the proposal for adding a new feature that allows argocd controller to respect rbac while monitoring for resources besides existing resource exclusions/inclusions.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • Optional. My organization is added to USERS.md.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.

Please see Contribution FAQs if you have questions about your pull-request.

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
@codecov
Copy link

codecov bot commented May 6, 2023

Codecov Report

Patch coverage has no change and project coverage change: +0.07 🎉

Comparison is base (bbc51fb) 49.56% compared to head (1e7fbc8) 49.64%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #13479      +/-   ##
==========================================
+ Coverage   49.56%   49.64%   +0.07%     
==========================================
  Files         256      258       +2     
  Lines       43920    44192     +272     
==========================================
+ Hits        21770    21940     +170     
- Misses      19987    20091     +104     
+ Partials     2163     2161       -2     

see 27 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
## Proposal

The configuration for this will be present in the `argocd-cm`, we will add new boolean field `resource.respectRBAC` in the
cm which can be set to `true` to enable this feature, by default the feature is disabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be safe to enable by default? If the controller doesn't have access to a resource, respecting RBAC will only help it avoid 403s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep it enabled by default, but that might suddenly change the behavior of the controller when a user upgrades which I thought might not be welcomed by some users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I think we'd want to everywhere target enabling by default. But agreed, best to move cautiously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be voting for kind of a cautious move here. Make it a feature toggle, test it extensively, and then move forward to enable it by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think by default for now at least this would be disabled and opt in for the user.

cm which can be set to `true` to enable this feature, by default the feature is disabled.

The feature will also modify `gitops-engine` pkg to add a `SelfSubjectAccessReview` request before adding any resource to the watch list,
which will make sure that argocd only monitors resources that it has access to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest not to perform an additional request to SelfSubjectAccessReview, because it potentially doubles the number of requests required for building up the cache. On large clusters, this is problematic already as of today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I'd like to propose evaluating the API response for a given resource during list and/or establishing the watch.

Copy link
Member Author

@gdsoumya gdsoumya May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, updateed proposal to match the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jannfis I presented the proposal in today's meeting and there was a concern with depending on the api response, it was about false positives that might be encountered due to excessive load on the kube api server and also due to env-specific proxies. So I am adding back the SelfSubjectAccessReview implementation option besides the api response approach with their advantages and disadvantages as discussed in the call today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added a third approach recommended by @alexmt where we combine 1 and 2. Tagging others for opinion @crenshaw-dev @leoluz @jannfis

gdsoumya and others added 5 commits May 10, 2023 20:00
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
…ce access

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
3. Combine approaches 1 and 2, in this controller will check the api response for the list call, and if it receives forbidden/unauthorized it will make the `SelfSubjectAccessReview` call.
This approach is accurate and at the same time, only makes extra api calls if the list calls fail in the first place.

In all solutions, once controller determines that it does not have access to the resource it will stop monitoring it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user adds/removes access later ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A controller restart will be necessary

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller automatically does full "resync" every day. So it will auto-discover RBAC change once a day

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes ^^ though if you want it instantly a restart is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexmt The watches will be re-established every 10 minutes as well, right? I think part of that is also rediscovery changes in available APIs. I guess with that, auto-discovery of RBAC changes would also happen all 10 minutes instead of only after cache expiry?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think controller is retrying only in-progress watches every 10 minutes. So it will notice if RBAC no longer allows accessing resource but won't notice if RBAC is allowing new resources

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you add new CRDs to the cluster, Argo CD will pick those up without the need for a restart. Is that due to an informer on the CRD API?

As for the restart, I think a cluster-cache resync (which can be triggered via Argo CD's API) should be sufficient? Maybe we can add a new functionality that Argo CD will also re-try (probe) the APIs it received a 401 for previously? In a configurable interval, which defaults to the same duration as the watch timeout (10 minutes)?

Anyway, I think this is implementation detail and could even be added later on. Just keeping this here as a potential idea.

1. Modify `gitops-engine` pkg to make a `SelfSubjectAccessReview` request before adding any resource to the watch list, in this approach we are making an extra
api server call to check if controller has access to the resource, this does increase the no. of kubeapi calls made but is more accurate.
2. Modify `gitops-engine` pkg to check for forbidden/unauthorized errors when listing for resources, this is more efficient approach as the
no. of kubeapi calls made does not change, but there is a chance of false positives as similar errors can be returned from kubeapi server or env specific proxies in other situations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what kind of false positives that would be?

And, if there's a false-positive, chances are that SelfSubjectAccessReview will return positive, but the API server will reply with a 403 when trying to establish the watch, then again resulting in a hard error?

For the implementation there are 3 proposals :

1. Modify `gitops-engine` pkg to make a `SelfSubjectAccessReview` request before adding any resource to the watch list, in this approach we are making an extra
api server call to check if controller has access to the resource, this does increase the no. of kubeapi calls made but is more accurate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: The number of kubeapi calls. In clusters with a lot of APIs (e.g. OpenShift comes with 200 APIs installed by default), that's already at least a burst of 200 API calls in a very short time frame. The default QPS for the client-side rate limiter in the kube API client is 50 with a burst of 100 (2 * QPS).

With additional SelfSubjectAccessReview, this numbers would increase to a burst of 400 calls.

Things get even worse when you consider namespace-scoped mode. Because the watches would be established for each of the APIs per managed namespaces. Considering the 200+ APIs example, this would result in at least 400 calls per managed namespace.

This seems to be numbers that can easily break the API server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default QPS for the client-side rate limiter in the kube API client is 50 with a burst of 100 (2 * QPS).

The default in kube has actually been increased to either 300 or 500. But Argo's default is currently set lower. We should set ours to the new default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed.

50/100 is a very conservative number. I know that in the OpenShift client, this has been changed for a while (I think to 350 for the burst at least), and if upstream K8s has adopted a similar increase by default, we should do it as well. People can still fine-tune it using ARGOCD_K8S_CLIENT_QPS and ARGOCD_K8S_CLIENT_BURST when they want more conservative rate limits, but I guess the better user experience is to have them increased by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +47 to +48
3. Combine approaches 1 and 2, in this controller will check the api response for the list call, and if it receives forbidden/unauthorized it will make the `SelfSubjectAccessReview` call.
This approach is accurate and at the same time, only makes extra api calls if the list calls fail in the first place.
Copy link
Member

@jannfis jannfis Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strongly depends on the use-case of the user and what kind of permissions they are willing to give to a specific Argo CD instance. For example, if there's an instance that is only allowed to manage 10 out of 200 APIs, you would still have 190 additional calls to the SelfSubjectAccessReview.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe make the use of SubjectSelfAccessReview configurable? Then people can have the trade-off they want - on clusters with a low amount of APIs, they can configure the mechanism to the highest precision while on clusters with high amount of APIs, they trade precision against speed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I think we can have another config option that enables the stricter check. Will update the proposal with this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Comment on lines 54 to 58
It was decided that we will go with approach 3 from the above list, but we shall provide 2 boolean configurations options to users :
- `resource.respectRBAC.strict` : This will perform both the checks i.e. whether the list call response is forbidden/unauthorized and if it is make the `SelfSubjectAccessReview` call to confirm.
- `resource.respectRBAC.normal` : This will only check whether the list call response is forbidden/unauthorized and skip `SelfSubjectAccessReview` call.

NOTE: `strict` has higher priority so irrespective of the status of `normal` if strict is set to true strict mode will be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more of a nit, but could we maybe - instead of two boolean options - have one option that takes three values:

  • normal (or maybe lax, because this is more the opposite of strict),
  • strict
  • false (or disabled, or off), being also the default if the field is not set.

Or as an alternative: Have two boolean values, resource.respectRBAC.enabled and resource.respectRBAC.strict, but in order to turn on strict, enabled has to be set to true as well.

I believe that would be less confusing and more ergonomic to users configuring Argo CD. But as I said, it's a nit, and I'm interested in other people's opinion.

Copy link
Member Author

@gdsoumya gdsoumya Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good idea I think having a single field with 3 values is better, I would probably make it just 2 and use empty "" or missing field as disabled. Wdyt?

Copy link
Member

@jannfis jannfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- TBD

reviewers:
- TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- TBD
- @jannfis

- TBD

approvers:
- TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- TBD
- @jannfis

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Copy link
Collaborator

@alexmt alexmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alexmt alexmt merged commit d7632df into argoproj:master Jul 3, 2023
yyzxw pushed a commit to yyzxw/argo-cd that referenced this pull request Aug 9, 2023
…goproj#13479)

* feat: respecting rbac for resource exclusions/inclusions proposal (argoproj#13479)

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
@suzaku suzaku mentioned this pull request Aug 15, 2023
tesla59 pushed a commit to tesla59/argo-cd that referenced this pull request Dec 16, 2023
…goproj#13479)

* feat: respecting rbac for resource exclusions/inclusions proposal (argoproj#13479)

Signed-off-by: Soumya Ghosh Dastidar <gdsoumya@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants