protokube - gossip memory leak #13974

efekete · 2022-07-13T08:50:30Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.
1.23.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.23.7

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
N/A

5. What happened after the commands executed?
N/A

6. What did you expect to happen?
N/A

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2019"
  generation: 39
  name: <cluster-name>
spec:
  api:
    loadBalancer:
      class: Classic
      idleTimeoutSeconds: 3600
      type: Public
  authorization:
    rbac: {}
  certManager:
    enabled: true
  channel: stable
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "true"
  cloudProvider: aws
  configBase: s3://<s3-bucket>
  containerRuntime: containerd
  dnsZone: <dnz-zone>
  etcdClusters:
    - etcdMembers:
        - instanceGroup: master-eu-west-1a
          name: a
        - instanceGroup: master-eu-west-1b
          name: b
        - instanceGroup: master-eu-west-1c
          name: c
      manager:
        env:
          - name: ETCD_LISTEN_METRICS_URLS
            value: http://0.0.0.0:8081
          - name: ETCD_METRICS
            value: basic
      name: main
      provider: Manager
    - etcdMembers:
        - instanceGroup: master-eu-west-1a
          name: a
        - instanceGroup: master-eu-west-1b
          name: b
        - instanceGroup: master-eu-west-1c
          name: c
      name: events
      provider: Manager
  iam:
    allowContainerRegistry: true
    legacy: false
    serviceAccountExternalPermissions:
      - aws:
          policyARNs:
            - arn:aws:iam::<policy>
        name: external-dns
        namespace: kube-system
      - aws:
          policyARNs:
            - arn:aws:iam::<policy>
        name: aws-cluster-autoscaler
        namespace: kube-system
      - aws:
          policyARNs:
            - arn:aws:iam::<policy>
        name: github-runners-controller-actions-runner-controller
        namespace: github-runner
  kubeAPIServer:
    eventTTL: 8h0m0s
  kubeDNS:
    provider: CoreDNS
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
    - 0.0.0.0/0
  kubernetesVersion: 1.23.7
  masterInternalName: <internal-name>
  masterPublicName: <public-name>
  metricsServer:
    enabled: true
    insecure: true
  networkCIDR: 10.10.0.0/16
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: <cidr>
  serviceAccountIssuerDiscovery:
    discoveryStore: <s3-name>
    enableAWSOIDCProvider: true
  sshAccess:
    - 0.0.0.0/0
  subnets:
    - cidr: 10.10.32.0/19
      name: eu-west-1a
      type: Private
      zone: eu-west-1a
    - cidr: 10.10.64.0/19
      name: eu-west-1b
      type: Private
      zone: eu-west-1b
    - cidr: 10.10.96.0/19
      name: eu-west-1c
      type: Private
      zone: eu-west-1c
    - cidr: 10.10.0.0/22
      name: utility-eu-west-1a
      type: Utility
      zone: eu-west-1a
    - cidr: 10.10.4.0/22
      name: utility-eu-west-1b
      type: Utility
      zone: eu-west-1b
    - cidr: 10.10.8.0/22
      name: utility-eu-west-1c
      type: Utility
      zone: eu-west-1c
  topology:
    bastion:
      idleTimeoutSeconds: 1200
    dns:
      type: Public
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Kubernetes cluster with ~200 nodes, control-plane nodes were 16Gi. Gitlab runners started scaling after which API server started restarting. Increased control-plane node size to 32Gi, but Cluster couldn't pass validation so I couldn't restart them.
I ran kops rolling-update for 3 control-plane instance groups with --cloudonly flag
After they booted, API server were working fine except for etcd-events, those started working as expected after events ttl expired.

Day after I realized lot of nodes were crashing, in particular the instance groups where nodes are 4Gi, after investigation we realized protokube is using over 3Gi of memory so nodes were running OOM.
Upgrade of KOPS to 1.24 didn't help (I upgraded almost all nodes, while in parallel I were looking for another solution, there was still few nodes left I didn't upgrade)
The issue stopped after I ran daemonset that restarted all protokube processes (at all nodes) at the same time

While debugging I noticed in journalctl logs of protokube that there were a lot of nodes that didn't exist any more.
The unusual part of logs were:

Jul 10 07:56:21 ip-10-10-116-37 protokube[5041]: I0710 07:56:21.013603    5041 glogger.go:30] ->[10.10.57.40:3999] attempting connection
Jul 10 07:56:21 ip-10-10-116-37 protokube[5041]: I0710 07:56:21.013613    5041 glogger.go:30] ->[10.10.75.64:3999] attempting connection
Jul 10 07:56:21 ip-10-10-116-37 protokube[5041]: I0710 07:56:21.076450    5041 glogger.go:30] ->[10.10.37.61:3999] attempting connection

And after that:

Jul 10 07:56:24 ip-10-10-116-37 protokube[5041]: I0710 07:56:24.683405    5041 glogger.go:30] Removed unreachable peer 0f197621448a17b0ebd24ce0e371c940(i-063d346b769228dbc)
Jul 10 07:56:24 ip-10-10-116-37 protokube[5041]: I0710 07:56:24.683408    5041 glogger.go:30] Removed unreachable peer eed8ce0418c803c85d709218bd835612(i-0f9739243ff457cd9)
Jul 10 07:56:24 ip-10-10-116-37 protokube[5041]: I0710 07:56:24.683411    5041 glogger.go:30] Removed unreachable peer 830fb8af57ae1fccbf4936f4e328dc3d

I copied just the part of it, but there were hundred of those at the same time. After removing, same nodes would appear again. That process would go in a loop.

After synchronous restart of all protokube processes, memory usage came back to normal (on most nodes I were looking at, it went from 30% back to 0.3%) and there were no more loops of removing the same nodes.

My guess is that the wrong node list were propagating through gossip protocol

The text was updated successfully, but these errors were encountered:

fbozic · 2022-09-19T07:47:38Z

Hi, we are facing the same issue. Is there any update about this? Or if this is known bug which can not be fixed in near future, can you maybe share how to migrate gossip based DNS cluster to route53 DNS cluster.

olemarkus · 2022-09-19T11:24:07Z

This is largely missing documentation right now. But a user shared on slack that switching from gossip to memberlist provides significant benefits: #7436

Unfortunate that this issue got closed without a conclusion, but I think we need to revisit that.

olemarkus · 2022-09-19T11:26:25Z

Regarding migrating from gossip to route53, this is simply not possible since gossip is linked to the immutable cluster name.

But try switching to memberlist first.

k8s-triage-robot · 2022-12-18T11:36:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

efekete · 2022-12-28T09:40:31Z

Switching to memberlist resolved the issue, after scale up to 300 nodes there weren't any noticable increase in memory, CPU or network bandwidth increase.

Thank you @olemarkus for your help

#14898

kzap · 2024-08-21T23:27:00Z

For anyone else encountering this, we were able to resolve the issue by restarting protokube across all nodes at the same time so that the Mesh Gossip state gets cleared out.

The daemonset you can use to do that looks something like this:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: restart-protokube
  namespace: kube-system
  labels:
    app: restart-protokube
spec:
  selector:
    matchLabels:
      app: restart-protokube
  template:
    metadata:
      labels:
        app: restart-protokube
    spec:
      containers:
        - name: restart-protokube
          image: alpine
          command:
            - nsenter
          args:
            - '-t'
            - '1'
            - '-m'
            - '-u'
            - '-i'
            - '-n'
            - "systemctl"
            - "restart"
            - "protokube"
          securityContext:
            privileged: true
      hostNetwork: true
      hostPID: true # This allows the pod to interact with the host's PID namespace
      hostIPC: true
      volumes:
        - name: cgroup
          hostPath:
            path: /sys/fs/cgroup
        - name: tmp
          emptyDir: {}
        - name: run
          emptyDir: {}
      restartPolicy: Always
      tolerations:
        - operator: "Exists"

Dont forget to delete the daemonset afterwards or else it will keep running and restarting protokube.

We also explored ways of switching the GossipSecret and then rotating 2/3 masters, workers, then the last master but it was a longer and more error prone process. The protokube reset is quick and has no downtime.

efekete · 2024-08-22T06:06:12Z

Although restarting nodes help to resolve the issue, long term solution for this to not happen again is to upgrade gossip protocol. For us, this issue was always happening at about 200 nodes before we upgraded. In the meantime after upgrading we doubled the number of nodes in cluster and everything is running stable.

At the time of writing this, switching to memberlist gossip protocol was best solution, it is worth now exploring if KOPS has alternative solutions. I know they were some work in progress at the time, I think there were some info on KOPS Slack about that few months ago.

I suggest checking the link from my last comment to find more info.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 13, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2022

efekete closed this as completed Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protokube - gossip memory leak #13974

protokube - gossip memory leak #13974

efekete commented Jul 13, 2022

fbozic commented Sep 19, 2022

olemarkus commented Sep 19, 2022

olemarkus commented Sep 19, 2022

k8s-triage-robot commented Dec 18, 2022

efekete commented Dec 28, 2022 •

edited

Loading

kzap commented Aug 21, 2024 •

edited

Loading

efekete commented Aug 22, 2024 •

edited

Loading

protokube - gossip memory leak #13974

protokube - gossip memory leak #13974

Comments

efekete commented Jul 13, 2022

fbozic commented Sep 19, 2022

olemarkus commented Sep 19, 2022

olemarkus commented Sep 19, 2022

k8s-triage-robot commented Dec 18, 2022

efekete commented Dec 28, 2022 • edited Loading

kzap commented Aug 21, 2024 • edited Loading

efekete commented Aug 22, 2024 • edited Loading

efekete commented Dec 28, 2022 •

edited

Loading

kzap commented Aug 21, 2024 •

edited

Loading

efekete commented Aug 22, 2024 •

edited

Loading