Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(backend): Less insane/insecure rbac rules #1768

Closed
wants to merge 13 commits into from

Conversation

juliusvonkohout
Copy link
Member

What this PR does / why we need it:

The current RBAC rules are completely insane and allow katib-controller and katib-ui to destroy the whole kubeflow cluster. My changes greatly reduce the attack surface.

@coveralls
Copy link

coveralls commented Jan 9, 2022

Coverage Status

Coverage remained the same at 73.545% when pulling 91f7e7c on juliusvonkohout:patch-1 into e02eb6e on kubeflow:master.

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Jan 10, 2022

Sadly I cannot see which example exactly failed in your build system. I will try to add some debugging and tesst all your examples from https://github.com/kubeflow/katib/tree/master/examples/v1beta1 manually

@tenzen-y
Copy link
Member

Hi, @juliusvonkohout.
You can check each test in this Argo Workflow Dashboard.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this @juliusvonkohout!
I think this relates to: #1639.
We should check which permissions our controller requires, some of them might be not very clear from the first look.
I would suggest to discuss about it on our next AutoML community meeting.

@juliusvonkohout
Copy link
Member Author

i have found one very dangerous design decision that also crashes the test. For earlystopping you create a new serviceaccount role and rolebinding to manage trials.

@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Jan 11, 2022
@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Jan 12, 2022

i have found a workaround, but it seems that your build system has problems. i rebased to master but there are image builds failing that are totally unrelated to my changes. You are pulling from the rate-limited docker.io...

INFO[0000] Resolved base name golang:alpine to build-env
INFO[0000] Using dockerignore file: /mnt/test-data-volume/kubeflow-katib-presubmit-e2e-v1beta1-1768-c
cd66d6-1264-7a74/src/github.com/kubeflow/katib/.dockerignore
INFO[0000] Retrieving image manifest golang:alpine
INFO[0000] Retrieving image golang:alpine
ERRO[0000] Error while retrieving image from cache: golang:alpine GET https://index.docker.io/v2/libr
ary/golang/manifests/alpine: TOOMANYREQUESTS: You have reached your pull rate limit. You may increase
 the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
INFO[0000] Retrieving image manifest golang:alpine
INFO[0000] Retrieving image golang:alpine
error building image: GET https://index.docker.io/v2/library/golang/manifests/alpine: TOOMANYREQUESTS
: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading:
https://www.docker.com/increase-rate-limit

On my private cluster even the following works

# We only need the verb create and not delete because all things are owned transitively by the
# experiment, so everything is cascade-deleted on experiment deletion.
# Early stopping currently creates a serviceaccount, role and rolebinding to manage trials and trails/status.
# The roles and rolebinding creation is the most important potential security issue now.
# Via new roles, rolebindings and pods in kubernetes system namespaces the cluster 
# could become compromised.
rules:
  - verbs:
      - get
      - list
      - watch
      - create
      - patch
    apiGroups:
      - ''
    resources:
      - events
  - verbs:
      - get
      - list
      - watch
    apiGroups:
      - ''
    resources:
      - configmaps
      - persistentvolumes
      - namespaces
  - verbs:
      - get
      - list
      - watch
      - create
    apiGroups:
      - ''
      - apps
      - rbac.authorization.k8s.io
    resources:
      - services
      - persistentvolumeclaims
      - pods
      - pods/log
      - pods/status
      - deployments
      - serviceaccounts
      - roles
      - rolebindings
  - verbs:
      - '*'
    apiGroups:
      - batch
    resources:
      - jobs
      - cronjobs
  - verbs:
      - '*'
    apiGroups:
      - kubeflow.org
    resources:
      - experiments
      - experiments/status
      - experiments/finalizers
      - trials
      - trials/status
      - trials/finalizers
      - suggestions
      - suggestions/status
      - suggestions/finalizers
      - tfjobs
      - pytorchjobs
      - mpijobs
      - xgboostjobs

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Jan 12, 2022

Now that the pipeline has finished successfully for ccd66d6 i will try the even more restrictive rules above.

@andreyvelich
Copy link
Member

@juliusvonkohout Thank you for debugging this.
As we discussed on our AutoML community meeting, before changing the ClusterRole, we should verify all Katib components to get list of necessary permissions.
Some of them might not be straightforward.

You can create a Google Doc or Comment in this issue: #1639 each permission that we need and why do we need them. If some of them are not clear, I can help you with that.
Does it sounds good ?

@juliusvonkohout
Copy link
Member Author

@andreyvelich @hougangliu @sperlingxx your build system is partially broken due to docker.io rate limit. Is there something to fix it?

@andreyvelich
Copy link
Member

@andreyvelich @hougangliu @sperlingxx your build system is partially broken due to docker.io rate limit. Is there something to fix it?

Since our AWS test infra is using free Docker hub account, we have these limits.
I am not sure how we can solve it.

@juliusvonkohout
Copy link
Member Author

@andreyvelich @hougangliu @sperlingxx your build system is partially broken due to docker.io rate limit. Is there something to fix it?

Since our AWS test infra is using free Docker hub account, we have these limits. I am not sure how we can solve it.

i can /retest every few hours. I strongly recommend to use proper OCI registries like quay.io, gcr.io etc. to build your OCI containers (docker is dead and deprecated). But i do not know how to rerun failed github actions.

@juliusvonkohout
Copy link
Member Author

/retest

1 similar comment
@juliusvonkohout
Copy link
Member Author

/retest

@aws-kf-ci-bot
Copy link
Contributor

@juliusvonkohout: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-katib-presubmit c74fa0e link /test kubeflow-katib-presubmit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@stale
Copy link

stale bot commented Apr 19, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@juliusvonkohout
Copy link
Member Author

@andreyvelich i thought #1639 has priority/p1 . What is your plan to fix this?

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: juliusvonkohout
Once this PR has been reviewed and has the lgtm label, please assign hougangliu for approval by writing /assign @hougangliu in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stale stale bot removed the lifecycle/stale label Sep 21, 2022
@juliusvonkohout
Copy link
Member Author

Superseeded by #2091

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants