Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws cloud controller manager is unable to manage the nodes in cluster #916

Open
karty-s opened this issue May 16, 2024 · 6 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@karty-s
Copy link

karty-s commented May 16, 2024

What happened: We are running k8s cluster of version 1.26 using kubeadm with resources from aws. We wanted to upgrade our clusters to 1.28 (1.26->1.27->1.28) as per update notes we tried to move from in-tree aws cloud provider to external aws cloud provider.
As per the upgrade process we deployed the new 1.27 nodes along with aws cloud controller manager in the cluster, post which we scaled down the 1.26 nodes.

What you expected to happen: The issue we face is that the etcd and worker nodes of 1.26 version which is scaled down gets removed from the cluster, but the control plane nodes still shows up in the cluster even after its ec2 instance is removed. eg -

NAME                            STATUS                     ROLES                  AGE     VERSION
ip-.ec2.internal   Ready,SchedulingDisabled   control-plane,master   96m     v1.26.7
ip-.ec2.internal   Ready                      etcd                   11m     v1.27.13
ip-.ec2.internal   Ready                      etcd                   9m10s   v1.27.13
ip-.ec2.internal   Ready                      control-plane,master   5m59s   v1.27.13
ip-.ec2.internal   Ready,SchedulingDisabled   control-plane,master   95m     v1.26.7
ip-.ec2.internal    Ready                      node                   6m12s   v1.27.13
ip-.ec2.internal    Ready                      etcd                   14m     v1.27.13
ip-.ec2.internal    Ready                      control-plane,master   6m1s    v1.27.13
ip-.ec2.internal    Ready                      node                   6m9s    v1.27.13
ip-.ec2.internal    Ready                      node                   6m14s   v1.27.13
ip-.ec2.internal    Ready                      node                   6m15s   v1.27.13
ip-.ec2.internal    Ready,SchedulingDisabled   control-plane,master   96m     v1.26.7
ip-.ec2.internal    Ready                      node                   6m15s   v1.27.13
ip-.ec2.internal    Ready                      node                   6m15s   v1.27.13
ip-.ec2.internal    Ready                      control-plane,master   5m43s   v1.27.13

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
we are seeing this error in the cloud controller manager pod logs -

I0516 08:13:24.811572       1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: ip-10-230-13-35.ec2.internal
I0516 08:13:24.812083       1 event.go:307] "Event occurred" object="ip-10-230-13-35.ec2.internal" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node ip-10-230-13-35.ec2.internal because it does not exist in the cloud provider"

we have set the hostname according to the pre req but still we get this

Environment: kubeadm

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.7", GitCommit:"84e1fc493a47446df2e155e70fca768d2653a398", GitTreeState:"clean", BuildDate:"2023-07-19T12:23:27Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
  • Cloud provider or hardware configuration: aws

  • OS (e.g. from /etc/os-release):

NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3374.2.4
VERSION_ID=3374.2.4
BUILD_ID=2023-02-15-1824
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)"
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 16, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cartermckinnon
Copy link
Contributor

This:

I0516 08:13:24.811572 1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: ip-10-230-13-35.ec2.internal

Isn't an error, it's expected behavior when a Node becomes NotReady and the corresponding EC2 instance is terminated (or doesn't exist). Are you sure the EC2 instances for your old 1.26 control plane nodes have been terminated? They wouldn't have a Ready status if the kubelet stopped heartbeating.

@shnigam2
Copy link

@cartermckinnon We have followed below steps on existing 1.26 cluster to make it ready for 1.27 upgrade

On existing version 1.26
Add tag to each node [kubernetes.io/cluster/cluster-name: owned
k edit cm kubeadm-config -n kube-system to update cloud-provider=external
Update existing master kube-controller and kube-apiserver manifest to use cloud-provider=external
Made aws-controller-manager running

Now when upgrading cluster to 1.27, below are the issues which we are facing:-

  • ProviderID is not getting displayed when new node joined the cluster.
  • When ASG delete old nodes specially old control-plane nodes , node controller not deleting terminated node.
    Please let us know what step we are missing and what is the correct method to go out-off tree aws CCM for k8s upgrade to 1.27 .

@cartermckinnon
Copy link
Contributor

cartermckinnon commented May 17, 2024

Are you passing --cloud-provider=external to kubelet as well?

CCM should fill in the provider ID if it's missing, but it's generally preferable to just pass it to kubelet to avoid extra API calls in CCM. The EKS AMI uses this helper script to set it: https://github.com/awslabs/amazon-eks-ami/blob/f5111dd100ebd94d9fbfbb1fe2f43b75fd1a6703/templates/al2/runtime/bin/provider-id

@shnigam2
Copy link

@cartermckinnon Let me share you 10-kubeadm-conf and kubeadm-config which we currently have in 1.26 where in tree support is there :-

10-kubeam-conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
Environment="KUBELET_EXTRA_ARGS=--cloud-provider=aws --node-labels=node.kubernetes.io/role=${kind},instance-group=${group_name},${extra_labels} --register-with-taints=${taints} --cert-dir=/etc/kubernetes/pki --cgroup-driver=systemd"
# Environment="KUBELET_KUBEADM_ARGS=--feature-gates=RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true  --rotate-certificates"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/opt/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
kubeadm-config
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
apiServer:
  certSANs:
  - "api-int.${cluster_fqdn}"
  - "api.${cluster_fqdn}"
  extraArgs:
    anonymous-auth: "true"
    audit-log-maxage: "7"
    audit-log-maxbackup: "50"
    audit-log-maxsize: "100"
    audit-log-path: /var/log/kube-apiserver-audit.log
    audit-policy-file: /etc/kubernetes/files/audit-log-policy.yaml
    authorization-mode: Node,RBAC
    cloud-provider: aws
    max-mutating-requests-inflight: "400"
    max-requests-inflight: "800"
    oidc-client-id: "${dex_oidc_client_id}"
    oidc-groups-claim: "${dex_oidc_groups_claim}"
    oidc-issuer-url: "${dex_oidc_issuer_url}"
    oidc-username-claim: "${dex_oidc_username_claim}"
    profiling: "false"
    request-timeout: 30m0s
    service-account-lookup: "true"
    tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
  extraVolumes:
  - hostPath: /etc/kubernetes/files
    mountPath: /etc/kubernetes/files
    name: cloud-config
    readOnly: true
  - hostPath: /var/log
    mountPath: /var/log
    name: var-log
    readOnly: false
  timeoutForControlPlane: 10m0s
certificatesDir: /etc/kubernetes/pki
clusterName: "${cluster_fqdn}"
controlPlaneEndpoint: "${api_endpoint}:${api_port}"
controllerManager:
  extraArgs:
    cluster-signing-cert-file: /etc/kubernetes/pki/ca.crt
    cluster-signing-key-file: /etc/kubernetes/pki/ca.key
    feature-gates: RotateKubeletServerCertificate=true
    profiling: "false"
    terminated-pod-gc-threshold: "12500"
    configure-cloud-routes: "false"
    cluster-name: "${cluster_fqdn}"
    attach-detach-reconcile-sync-period: "1m0s"
    cloud-provider: "aws"
{{- if contains "1.15" .Kubernetes.Version | not }}
    flex-volume-plugin-dir: "/var/lib/kubelet/volumeplugins/"
{{- end }}
dns:
  type: CoreDNS
etcd:
  ${etcd_type}:
    endpoints:
    ${endpoints}
    caFile: ${etcd_cafile}
    certFile: "/etc/kubernetes/pki/apiserver-etcd-client.crt"
    keyFile: "/etc/kubernetes/pki/apiserver-etcd-client.key"
imageRepository: registry.k8s.io
kubernetesVersion: "${k8s_version}"
networking:
  dnsDomain: cluster.local
  podSubnet: "${pod_subnet}"
  serviceSubnet: "100.64.0.0/13"

---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
bootstrapTokens:
- token: "${kubeadm_token}"
  description: "kubeadm bootstrap token"
  ttl: "43800h"
nodeRegistration:
  criSocket: "unix:///var/run/containerd/containerd.sock"
  kubeletExtraArgs:
    container-runtime: remote
    container-runtime-endpoint: unix:///run/containerd/containerd.sock
  ignorePreflightErrors:
  - IsPrivilegedUser
localAPIEndpoint:
  bindPort: 443
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd

Now we are planning to opt out off tree aws cloud controller manager, Could you please guide us what changes we need to make to migrate from in-tree to out-tree . Currently we have deployed aws-cloud-controllermanager daemonset and those are running. But kube-controller-manager also running with above configurations.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

5 participants