Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section for how to prevent DaemonSet thundering herd issues #645

Merged
merged 1 commit into from
Feb 20, 2025

Conversation

natherz97
Copy link
Contributor

@natherz97 natherz97 commented Feb 13, 2025

Description of changes:
The EKS scalability team has observed resource exhaustion occur on clusters with large node counts due to thundering-herd issues where a large number of pods from DaemonSets issue expensive list requests concurrently. DaemonSets are typically configured with a RollingUpdate strategy to ensure a gradual rollout of new pods, however, there are 3 cases that are not covered currently:

  • New DaemonSet creation
  • Node scale-outs
  • Updating a DaemonSet template when there are NotReady pods

Testing:
Example DaemonSet object used in the following tests:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: kube-system
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  minReadySeconds: 60
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
    spec:
      containers:
      - name: fluentd-elasticsearch
        image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
  1. Showing that there's a thundering herd issue on creation of a new DaemonSet:
% kubectl create -f /tmp/daemonset.yaml
daemonset.apps/fluentd-elasticsearch created

% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS    RESTARTS   AGE
fluentd-elasticsearch-29lsh   1/1     Running   0          68s
fluentd-elasticsearch-2bscp   1/1     Running   0          68s
fluentd-elasticsearch-2twqs   1/1     Running   0          68s
...
  1. Showing that there's a thundering herd issue when a node scale out occurs with an existing DaemonSet:
# Cluster is scaled out from 100 -> 150 nodes
% kubectl get pods -n kube-system -l name=fluentd-elasticsearch | grep -v m
NAME                          READY   STATUS    RESTARTS   AGE
fluentd-elasticsearch-2zwsc   1/1     Running   0          41s
fluentd-elasticsearch-5594r   1/1     Running   0          43s
fluentd-elasticsearch-567z7   1/1     Running   0          43s
...
  1. Showing that there's a thundering herd issue on an update to an existing DaemonSet if there are a large number of NotReady pods:
% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS         RESTARTS   AGE
fluentd-elasticsearch-2cjdw   0/1     ErrImagePull   0          7s
fluentd-elasticsearch-2l8wv   0/1     ErrImagePull   0          6s
fluentd-elasticsearch-2qvn9   0/1     ErrImagePull   0          8s
...

# after fixing image URL
% kubectl edit daemonset fluentd-elasticsearch -n kube-system
% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS    RESTARTS   AGE
fluentd-elasticsearch-2t2zf   1/1     Running   0          5s
fluentd-elasticsearch-2tj4b   1/1     Running   0          4s
fluentd-elasticsearch-424s2   1/1     Running   0          9s
...
  1. To fix the thundering herd issue on new DaemonSet creation:
% kubectl label nodes --all run-daemonset=false

node/ip-192-168-105-20.us-west-2.compute.internal labeled
node/ip-192-168-107-101.us-west-2.compute.internal labeled
node/ip-192-168-109-29.us-west-2.compute.internal labeled
...

# create same DaemonSet with the NodeAffinity setting provided
% kubectl create -f /tmp/daemonset.yaml
daemonset.apps/fluentd-elasticsearch created

% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
No resources found in kube-system namespace.

# Now, run bash script to remove the node labels and gradually create pods
% /tmp/remove_labels.sh
Removing run-daemonset label from node ip-192-168-105-20.us-west-2.compute.internal
node/ip-192-168-105-20.us-west-2.compute.internal unlabeled
Removing run-daemonset label from node ip-192-168-107-101.us-west-2.compute.internal
node/ip-192-168-107-101.us-west-2.compute.internal unlabeled
Removing run-daemonset label from node ip-192-168-109-29.us-west-2.compute.internal
node/ip-192-168-109-29.us-west-2.compute.internal unlabeled
...

# pods are being created 5s apart
% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS    RESTARTS   AGE
fluentd-elasticsearch-69lt5   1/1     Running   0          5s
fluentd-elasticsearch-gb9qb   1/1     Running   0          52s
fluentd-elasticsearch-gm5qz   1/1     Running   0          17s
fluentd-elasticsearch-gmvfl   1/1     Running   0          40s
  1. To fix thundering herd issue on node scale-outs with an existing DaemonSet:
% kubectl label nodes --all run-daemonset=true
node/ip-192-168-103-141.us-west-2.compute.internal labeled
node/ip-192-168-103-150.us-west-2.compute.internal labeled
node/ip-192-168-105-20.us-west-2.compute.internal labeled

# Add the nodeAffinity setting to match any nodes with the run-daemonset=true label
% kubectl edit daemonset fluentd-elasticsearch -n kube-system

# Complete the node scale-out. Next, run the bash script to add the run-daemonset=true to the new nodes
% /tmp/add_labels.sh
Adding run-daemonset=true label to node ip-192-168-102-154.us-west-2.compute.internal
node/ip-192-168-102-154.us-west-2.compute.internal labeled
Adding run-daemonset=true label to node ip-192-168-109-118.us-west-2.compute.internal
node/ip-192-168-109-118.us-west-2.compute.internal labeled
Adding run-daemonset=true label to node ip-192-168-110-239.us-west-2.compute.internal
...

%  kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS              RESTARTS   AGE
fluentd-elasticsearch-2qwhf   1/1     Running             0          76s
fluentd-elasticsearch-5pzsr   1/1     Running             0          2m20s
fluentd-elasticsearch-5s7jd   0/1     ContainerCreating   0          0s
...
  1. To fix thundering herd issue on DaemonSet updates where there are a large number of NotReady pods:
% kubectl create -f /tmp/daemonset.yaml
daemonset.apps/fluentd-elasticsearch created

% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS         RESTARTS   AGE
fluentd-elasticsearch-24qlz   0/1     ErrImagePull   0          10s
fluentd-elasticsearch-2hxlp   0/1     ErrImagePull   0          8s
fluentd-elasticsearch-2nrzh   0/1     ErrImagePull   0          10s
...

# Change the UpdateStrategy to OnDelete and fix the image URL for the pod
% kubectl edit daemonset fluentd-elasticsearch -n kube-system

# Now, run bash script to delete the existing NotReady pods and new pods will be gradually created
# by the daemonset-controller
% /tmp/delete_pods.sh
Removing deleting pod fluentd-elasticsearch-24qlz
pod "fluentd-elasticsearch-24qlz" deleted
Removing deleting pod fluentd-elasticsearch-2hxlp
pod "fluentd-elasticsearch-2hxlp" deleted
Removing deleting pod fluentd-elasticsearch-2nrzh
pod "fluentd-elasticsearch-2nrzh" deleted
...

% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS             RESTARTS   AGE
fluentd-elasticsearch-5mz9f   1/1     Running            0          2s
fluentd-elasticsearch-5x5f4   0/1     ImagePullBackOff   0          5m29s
fluentd-elasticsearch-675sj   0/1     ImagePullBackOff   0          5m31s
fluentd-elasticsearch-67gsp   0/1     ImagePullBackOff   0          5m29s
fluentd-elasticsearch-68zzq   1/1     Running            0          16s
...

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@natherz97 natherz97 requested a review from a team as a code owner February 13, 2025 20:14
@natherz97 natherz97 force-pushed the patch-1 branch 10 times, most recently from e52d811 to 4358b97 Compare February 13, 2025 22:17
@natherz97 natherz97 force-pushed the patch-1 branch 4 times, most recently from 3255bde to 1dda388 Compare February 18, 2025 00:11
Copy link
Collaborator

@svennam92 svennam92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you for the PR!

@svennam92
Copy link
Collaborator

@natherz97 good to merge?

@natherz97
Copy link
Contributor Author

@svennam92 Waiting for a sign off from @mengqiy then we're good to merge.

Copy link
Contributor

@mengqiy mengqiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks!

@mengqiy
Copy link
Contributor

mengqiy commented Feb 19, 2025

@svennam92 This is ready to go

@svennam92 svennam92 merged commit 261caa5 into aws:master Feb 20, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants