Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterapi: refresh kubeconfig bearer tokens for management and workload kubeconfigs dynamically #5951

Conversation

cnmcavoy
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Continuously update the bearer token provided by a kube config file for the clusterapi provider.

Which issue(s) this PR fixes:

Fixes #4784

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 13, 2023
@cnmcavoy
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

@cnmcavoy: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cnmcavoy cnmcavoy force-pushed the cmcavoy/clusterapi-refresh-kubeconfig-bearertokens branch from 911efe7 to 17e36cf Compare July 17, 2023 15:57
@cnmcavoy cnmcavoy force-pushed the cmcavoy/clusterapi-refresh-kubeconfig-bearertokens branch from 17e36cf to 2e67057 Compare August 1, 2023 17:58
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cnmcavoy
Once this PR has been reviewed and has the lgtm label, please assign bigdarkclown for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the patch @cnmcavoy , i am a little out of my depth with some of this token wrapping stuff. i wonder if we might bug @stlaz or @liggitt to give us a comment on this?

@elmiko
Copy link
Contributor

elmiko commented Aug 30, 2023

i'm still keen to see this get included, @cnmcavoy have you done some local testing of the refresh?

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Sep 6, 2023

i'm still keen to see this get included, @cnmcavoy have you done some local testing of the refresh?

Indeed has been using a fork of the CA with this patch included for a month. Initially this PR only covered the kube config in the clusterapi provider package, but that led us to discover there is a third kubeconfig in use for leader election. I updated the PR earlier to address that issue and we haven't discovered anything new since.

@elmiko
Copy link
Contributor

elmiko commented Sep 6, 2023

thanks for the update @cnmcavoy , i'm inclined to merge this based on your comments but i will bring up the PR at the CAPI meeting today just to see if we can get an extra review or two. i will plan to merge this on friday if there is no further activity.

@jackfrancis
Copy link
Contributor

/assign

@jackfrancis
Copy link
Contributor

@cnmcavoy code seems sane to me, is there an appropriate slot in the Cluster API provider documentation that we can use to describe this new feature? I'd be happy to build a cluster-autoscaler image from this branch and run through that test flow and report back (which would have the benefit of testing both the functionality and the documentation 😀).

@cnmcavoy cnmcavoy force-pushed the cmcavoy/clusterapi-refresh-kubeconfig-bearertokens branch from 2e67057 to 148c69f Compare September 21, 2023 21:49
@cnmcavoy
Copy link
Contributor Author

@jackfrancis sure! I added a note after the various configuration modes between management and workload clusters.

@cnmcavoy cnmcavoy requested a review from elmiko October 16, 2023 21:16
@enj
Copy link
Member

enj commented Nov 1, 2023

Why is this not using something like HTTPWrappersForConfig which invokes NewBearerAuthWithRefreshRoundTripper to handle this flow? You should not be writing any custom code to handle automatic token refresh when client-go already does it for you.

Comment on lines +143 to +147
kubeConfig, err := kube_client_cmd.BuildConfigFromFlags("", rt.kubeconfigPath)
if err != nil {
return nil, fmt.Errorf("cannot build kube cluster config: %w", err)
}
rt.bearerToken = kubeConfig.BearerToken
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the entire kubeconfig but only honoring the bearer token changing doesn't really make sense. The bearer token should be in a distinct file that is reloaded, the rest of the contents should remain static.

Copy link
Contributor Author

@cnmcavoy cnmcavoy Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/clusterapi#autoscaler-running-anywhere-with-separate-kubeconfigs-for-management-and-workload-clusters

Cluster Autoscaler can be configured in a mode where it is running in Cluster "1", but the cluster and pods it needs to scale for and manage are in cluster "3", and the cloud configuration and node groups to scale are in cluster "2". In this scenario, there is no service account bearer token, except for running in Cluster "1" due to these being separate clusters.

The contents of the file we are loading is created by the CAPA awscluster controller, only has a bearer token, and is shaped like this:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0t...
    server: https://XXXXX.us-east-2.eks.amazonaws.com
  name: cluster-name
contexts:
- context:
    cluster: cluster-3
    user: cluster-3-capi-admin
  name: cluster-3-capi-admin@cluster-3
current-context: cluster-3-capi-admin@cluster-3
kind: Config
preferences: {}
users:
- name: cluster-3-capi-admin
  user:
    token: k8s-aws-v1...

Are you suggesting that CAPA should encode a second secret value with just the token value contents above, and have the Cluster Autoscaler reload that contents, to simulate the behavior of a file mounted service account bearer token? Or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right if you change the YAML to:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0t...
    server: https://XXXXX.us-east-2.eks.amazonaws.com
  name: cluster-name
contexts:
- context:
    cluster: cluster-3
    user: cluster-3-capi-admin
  name: cluster-3-capi-admin@cluster-3
current-context: cluster-3-capi-admin@cluster-3
kind: Config
preferences: {}
users:
- name: cluster-3-capi-admin
  user:
    tokenFile: location_of_file

And have the contents of location_of_file be:

k8s-aws-v1...

Then client-go will handle the contents of location_of_file changing for you.

return rt.rt.RoundTrip(req)
}

var _ http.RoundTripper = &bearerAuthRoundTripper{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to implement utilnet.RoundTripperWrapper

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@RichardoC
Copy link

Is this still needed given the Kubernetes client-go library has been updated to a version that should fix this?

@elmiko
Copy link
Contributor

elmiko commented Jan 3, 2024

@RichardoC that link to the aws docs is only about eks clusters though, is there a link to an issue in client-go that might talk about this?

the capi provider doesn't necessarily use eks, so we would still want a refresh here (as i understand it).

@RichardoC
Copy link

This functionality is upstream k8s behaviour, just couldn't find the link earlier https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#auto-generated-legacy-serviceaccount-token-clean-up
My best guess is client-go was fixed with kubernetes/client-go@66e83da#diff-a6baf0c4cd67228defd6ac03faa4c9922f5cf4dd2b27672509043b33e3d881f7R289 specifically the NewBearerAuthWithRefreshRoundTripper codepath then this shouldn't be needed?

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Jan 3, 2024

@RichardoC I think there is some confusion here, this change was about kubeconfig's with an embedded service account token expiring. This differs from the built-in client-go handling of service account tokens, where you are correct, client-go does automatically refresh. So this is still an active issue for CAPI users with EKS clusters.

However, I went to a SIG-APIMachinery meeting last year to discuss better approaches to handling this and they suggested changes downstream, in how the CAPI project manage their kubeconfigs, to avoid this type of issue entirely. So I'm pursuing that over in kubernetes-sigs/cluster-api-provider-aws#4648

If all goes well, there will likely need to be some small autoscaler changes from that, but only for it's helm chart / docs. I'll open a new PR for that work.

@cnmcavoy cnmcavoy closed this Jan 3, 2024
@elmiko
Copy link
Contributor

elmiko commented Jan 3, 2024

@cnmcavoy thanks for the update and continuing to follow this issue. happy to hear about the status, and thanks for closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster-autoscaler caches workload cluster kubeconfig
6 participants