Add section for how to prevent DaemonSet thundering herd issues #645

natherz97 · 2025-02-13T20:14:14Z

Description of changes:
The EKS scalability team has observed resource exhaustion occur on clusters with large node counts due to thundering-herd issues where a large number of pods from DaemonSets issue expensive list requests concurrently. DaemonSets are typically configured with a RollingUpdate strategy to ensure a gradual rollout of new pods, however, there are 3 cases that are not covered currently:

New DaemonSet creation
Node scale-outs
Updating a DaemonSet template when there are NotReady pods

Testing:
Example DaemonSet object used in the following tests:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: kube-system
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  minReadySeconds: 60
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
    spec:
      containers:
      - name: fluentd-elasticsearch
        image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2

Showing that there's a thundering herd issue on creation of a new DaemonSet:

% kubectl create -f /tmp/daemonset.yaml
daemonset.apps/fluentd-elasticsearch created

% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS    RESTARTS   AGE
fluentd-elasticsearch-29lsh   1/1     Running   0          68s
fluentd-elasticsearch-2bscp   1/1     Running   0          68s
fluentd-elasticsearch-2twqs   1/1     Running   0          68s
...

Showing that there's a thundering herd issue when a node scale out occurs with an existing DaemonSet:

# Cluster is scaled out from 100 -> 150 nodes
% kubectl get pods -n kube-system -l name=fluentd-elasticsearch | grep -v m
NAME                          READY   STATUS    RESTARTS   AGE
fluentd-elasticsearch-2zwsc   1/1     Running   0          41s
fluentd-elasticsearch-5594r   1/1     Running   0          43s
fluentd-elasticsearch-567z7   1/1     Running   0          43s
...

Showing that there's a thundering herd issue on an update to an existing DaemonSet if there are a large number of NotReady pods:

% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS         RESTARTS   AGE
fluentd-elasticsearch-2cjdw   0/1     ErrImagePull   0          7s
fluentd-elasticsearch-2l8wv   0/1     ErrImagePull   0          6s
fluentd-elasticsearch-2qvn9   0/1     ErrImagePull   0          8s
...

# after fixing image URL
% kubectl edit daemonset fluentd-elasticsearch -n kube-system
% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS    RESTARTS   AGE
fluentd-elasticsearch-2t2zf   1/1     Running   0          5s
fluentd-elasticsearch-2tj4b   1/1     Running   0          4s
fluentd-elasticsearch-424s2   1/1     Running   0          9s
...

To fix the thundering herd issue on new DaemonSet creation:

% kubectl label nodes --all run-daemonset=false

node/ip-192-168-105-20.us-west-2.compute.internal labeled
node/ip-192-168-107-101.us-west-2.compute.internal labeled
node/ip-192-168-109-29.us-west-2.compute.internal labeled
...

# create same DaemonSet with the NodeAffinity setting provided
% kubectl create -f /tmp/daemonset.yaml
daemonset.apps/fluentd-elasticsearch created

% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
No resources found in kube-system namespace.

# Now, run bash script to remove the node labels and gradually create pods
% /tmp/remove_labels.sh
Removing run-daemonset label from node ip-192-168-105-20.us-west-2.compute.internal
node/ip-192-168-105-20.us-west-2.compute.internal unlabeled
Removing run-daemonset label from node ip-192-168-107-101.us-west-2.compute.internal
node/ip-192-168-107-101.us-west-2.compute.internal unlabeled
Removing run-daemonset label from node ip-192-168-109-29.us-west-2.compute.internal
node/ip-192-168-109-29.us-west-2.compute.internal unlabeled
...

# pods are being created 5s apart
% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS    RESTARTS   AGE
fluentd-elasticsearch-69lt5   1/1     Running   0          5s
fluentd-elasticsearch-gb9qb   1/1     Running   0          52s
fluentd-elasticsearch-gm5qz   1/1     Running   0          17s
fluentd-elasticsearch-gmvfl   1/1     Running   0          40s

To fix thundering herd issue on node scale-outs with an existing DaemonSet:

% kubectl label nodes --all run-daemonset=true
node/ip-192-168-103-141.us-west-2.compute.internal labeled
node/ip-192-168-103-150.us-west-2.compute.internal labeled
node/ip-192-168-105-20.us-west-2.compute.internal labeled

# Add the nodeAffinity setting to match any nodes with the run-daemonset=true label
% kubectl edit daemonset fluentd-elasticsearch -n kube-system

# Complete the node scale-out. Next, run the bash script to add the run-daemonset=true to the new nodes
% /tmp/add_labels.sh
Adding run-daemonset=true label to node ip-192-168-102-154.us-west-2.compute.internal
node/ip-192-168-102-154.us-west-2.compute.internal labeled
Adding run-daemonset=true label to node ip-192-168-109-118.us-west-2.compute.internal
node/ip-192-168-109-118.us-west-2.compute.internal labeled
Adding run-daemonset=true label to node ip-192-168-110-239.us-west-2.compute.internal
...

%  kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS              RESTARTS   AGE
fluentd-elasticsearch-2qwhf   1/1     Running             0          76s
fluentd-elasticsearch-5pzsr   1/1     Running             0          2m20s
fluentd-elasticsearch-5s7jd   0/1     ContainerCreating   0          0s
...

To fix thundering herd issue on DaemonSet updates where there are a large number of NotReady pods:

% kubectl create -f /tmp/daemonset.yaml
daemonset.apps/fluentd-elasticsearch created

% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS         RESTARTS   AGE
fluentd-elasticsearch-24qlz   0/1     ErrImagePull   0          10s
fluentd-elasticsearch-2hxlp   0/1     ErrImagePull   0          8s
fluentd-elasticsearch-2nrzh   0/1     ErrImagePull   0          10s
...

# Change the UpdateStrategy to OnDelete and fix the image URL for the pod
% kubectl edit daemonset fluentd-elasticsearch -n kube-system

# Now, run bash script to delete the existing NotReady pods and new pods will be gradually created
# by the daemonset-controller
% /tmp/delete_pods.sh
Removing deleting pod fluentd-elasticsearch-24qlz
pod "fluentd-elasticsearch-24qlz" deleted
Removing deleting pod fluentd-elasticsearch-2hxlp
pod "fluentd-elasticsearch-2hxlp" deleted
Removing deleting pod fluentd-elasticsearch-2nrzh
pod "fluentd-elasticsearch-2nrzh" deleted
...

% kubectl get pods -n kube-system -l name=fluentd-elasticsearch
NAME                          READY   STATUS             RESTARTS   AGE
fluentd-elasticsearch-5mz9f   1/1     Running            0          2s
fluentd-elasticsearch-5x5f4   0/1     ImagePullBackOff   0          5m29s
fluentd-elasticsearch-675sj   0/1     ImagePullBackOff   0          5m31s
fluentd-elasticsearch-67gsp   0/1     ImagePullBackOff   0          5m29s
fluentd-elasticsearch-68zzq   1/1     Running            0          16s
...

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

latest/bpg/scalability/control-plane.adoc

svennam92

LGTM thank you for the PR!

svennam92 · 2025-02-19T19:23:21Z

@natherz97 good to merge?

natherz97 · 2025-02-19T20:21:15Z

@svennam92 Waiting for a sign off from @mengqiy then we're good to merge.

mengqiy

LGTM
Thanks!

mengqiy · 2025-02-19T21:16:24Z

@svennam92 This is ready to go

natherz97 requested a review from a team as a code owner February 13, 2025 20:14

natherz97 force-pushed the patch-1 branch 10 times, most recently from e52d811 to 4358b97 Compare February 13, 2025 22:17

mengqiy reviewed Feb 13, 2025

View reviewed changes

latest/bpg/scalability/control-plane.adoc Outdated Show resolved Hide resolved

latest/bpg/scalability/control-plane.adoc Show resolved Hide resolved

natherz97 force-pushed the patch-1 branch 4 times, most recently from 3255bde to 1dda388 Compare February 18, 2025 00:11

Add section for how to prevent DaemonSet thundering herd issues.

3189788

natherz97 force-pushed the patch-1 branch from 1dda388 to 3189788 Compare February 19, 2025 17:57

svennam92 approved these changes Feb 19, 2025

View reviewed changes

mengqiy approved these changes Feb 19, 2025

View reviewed changes

svennam92 merged commit 261caa5 into aws:master Feb 20, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add section for how to prevent DaemonSet thundering herd issues #645

Add section for how to prevent DaemonSet thundering herd issues #645

natherz97 commented Feb 13, 2025 •

edited

Loading

svennam92 left a comment

svennam92 commented Feb 19, 2025

natherz97 commented Feb 19, 2025

mengqiy left a comment •

edited

Loading

mengqiy commented Feb 19, 2025

Add section for how to prevent DaemonSet thundering herd issues #645

Add section for how to prevent DaemonSet thundering herd issues #645

Conversation

natherz97 commented Feb 13, 2025 • edited Loading

svennam92 left a comment

Choose a reason for hiding this comment

svennam92 commented Feb 19, 2025

natherz97 commented Feb 19, 2025

mengqiy left a comment • edited Loading

Choose a reason for hiding this comment

mengqiy commented Feb 19, 2025

natherz97 commented Feb 13, 2025 •

edited

Loading

mengqiy left a comment •

edited

Loading