k8s_drain runs into a timeout when evicting a pod which is part of a stateful set #792

OttaviaB · 2024-11-15T10:09:53Z

SUMMARY

k8s_drain runs into a timeout when evicting a pod which is part of a stateful set.

The pod gets the same name on a different node and because k8s_drain checks only the pod name, but not the node name, it thinks that the original pod is still running.

ISSUE TYPE

Bug Report

COMPONENT NAME

k8s_drain

ANSIBLE VERSION

ansible [core 2.15.12]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['~/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = ~/venvs/ansible/lib64/python3.9/site-packages/ansible
  ansible collection location = ~/.ansible/collections:/usr/share/ansible/collections
  executable location = ~/venvs/ansible/bin/ansible
  python version = 3.9.19 (main, Mar 22 2024, 21:01:47) [GCC] (~/venvs/ansible/bin/python3.9)
  jinja version = 3.1.4
  libyaml = True

COLLECTION VERSION

# ~/.ansible/collections/ansible_collections
Collection                    Version
----------------------------- -------
kubernetes.core               5.0.0

CONFIGURATION

CALLBACKS_ENABLED(~/.ansible.cfg) = ['ansible.posix.profile_tasks']
CONFIG_FILE() = ~/.ansible.cfg
PAGER(env: PAGER) = less

OS / ENVIRONMENT

Ansible controller: SLES 15-SP5
Remote hosts: SLES 15-SP5
RKE2 Version: v1.28.10+rke2r1

STEPS TO REPRODUCE

Deploy a stateful set
Drain the node where the stateful set's pod is running

---
- hosts: "{{ target }}"
  serial: 1
  gather_facts: false
  vars:
    kubeconfig_dir: "/home/ansible/.kubeconfigs"
  tasks:
    - name: Drain node
      kubernetes.core.k8s_drain:
        kubeconfig: "{{ kubeconfig_path }}"
        name: "{{ inventory_hostname }}"
        delete_options:
          ignore_daemonsets: true
          delete_emptydir_data: true
          wait_timeout: 100
          force: true
          wait_sleep: 1
      delegate_to: localhost

EXPECTED RESULTS

k8s_drain should return directly after the pods are evicted.

ACTUAL RESULTS

k8s_drain keeps going until the timeout is reached although the pods are long gone. It then returns a warning.


PLAY [Test] *******************************************************************

TASK [Drain node] *****************************************************************
Donnerstag 18 Juli 2024  14:40:13 +0200 (0:00:00.051)       0:00:00.051 ******* 
[WARNING]: cannot delete mirror Pods using API server: kube-system/kube-proxy-<node_name>.
[WARNING]: timeout reached while pods were still running.
changed: [<node_name> -> localhost]

PLAY RECAP ************************************************************************
<node_name>             : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Donnerstag 18 Juli 2024  14:41:55 +0200 (0:01:41.902)       0:01:41.953 ******* 
=============================================================================== 
Drain node --------------------------------------------------------------- 101.90s

This happens because the function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running:

            try:
                response = self._api_instance.read_namespaced_pod(
                    namespace=pod[0], name=pod[1]
                )
                if not response:
                    pod = None
                time.sleep(wait_sleep)

The conditon if not response is never met, because the new pod has the same name as the old one.

The text was updated successfully, but these errors were encountered:

SUMMARY Fixes #792 . The function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running: try: response = self._api_instance.read_namespaced_pod( namespace=pod[0], name=pod[1] ) if not response: pod = None time.sleep(wait_sleep) This means that if a pod is successfully evicted and restarted with the same name on a new node, k8s_drain does not notice and thinks that the original pod is still running. This is the case for pods which are part of a stateful set. ISSUE TYPE Bugfix Pull Request COMPONENT NAME k8s_drain Reviewed-by: Mike Graves <mgraves@redhat.com> (cherry picked from commit fca0dc0)

…807) This is a backport of PR #793 as merged into main (fca0dc0). SUMMARY Fixes #792 . The function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running: try: response = self._api_instance.read_namespaced_pod( namespace=pod[0], name=pod[1] ) if not response: pod = None time.sleep(wait_sleep) This means that if a pod is successfully evicted and restarted with the same name on a new node, k8s_drain does not notice and thinks that the original pod is still running. This is the case for pods which are part of a stateful set. ISSUE TYPE Bugfix Pull Request COMPONENT NAME k8s_drain

…808) This is a backport of PR #793 as merged into main (fca0dc0). SUMMARY Fixes #792 . The function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running: try: response = self._api_instance.read_namespaced_pod( namespace=pod[0], name=pod[1] ) if not response: pod = None time.sleep(wait_sleep) This means that if a pod is successfully evicted and restarted with the same name on a new node, k8s_drain does not notice and thinks that the original pod is still running. This is the case for pods which are part of a stateful set. ISSUE TYPE Bugfix Pull Request COMPONENT NAME k8s_drain

SUMMARY Version 3.3.0 of ansible-collection kubernetes.core came with several improvements and bugfixes ISSUE TYPE New release pull request Changelog Minor Changes k8s_drain - Improve error message for pod disruption budget when draining a node (#797). Bugfixes helm - Helm version checks did not support RC versions. They now accept any version tags. (#745). helm_pull - Apply no_log=True to pass_credentials to silence false positive warning.. (#796). k8s_drain - Fix k8s_drain does not wait for single pod (#769). k8s_drain - Fix k8s_drain runs into a timeout when evicting a pod which is part of a stateful set (#792). kubeconfig option should not appear in module invocation log (#782). kustomize - kustomize plugin fails with deprecation warnings (#639). waiter - Fix waiting for daemonset when desired number of pods is 0. (#756). ADDITIONAL INFORMATION Collection kubernets.core version 3.3.0 is compatible with ansible-core>=2.14.0 Reviewed-by: Alina Buzachis Reviewed-by: Yuriy Novostavskiy Reviewed-by: Mike Graves <mgraves@redhat.com>

SUMMARY This release came with new module helm_registry_auth, and improvements to the error messages in the k8s_drain module, new parameter insecure_registry for helm_template module and several bug fixes. ISSUE TYPE New release pull request Changelog Minor Changes Bump version of ansible-lint to minimum 24.7.0 (#765). Parameter insecure_registry added to helm_template as equivalent of insecure-skip-tls-verify (#805). connection/kubectl.py - Added an example of using the kubectl connection plugin to the documentation (#741). k8s_drain - Improve error message for pod disruption budget when draining a node (#797). Bugfixes helm - Helm version checks did not support RC versions. They now accept any version tags. (#745). helm_pull - Apply no_log=True to pass_credentials to silence false positive warning.. (#796). k8s_drain - Fix k8s_drain does not wait for single pod (#769). k8s_drain - Fix k8s_drain runs into a timeout when evicting a pod which is part of a stateful set (#792). kubeconfig option should not appear in module invocation log (#782). kustomize - kustomize plugin fails with deprecation warnings (#639). waiter - Fix waiting for daemonset when desired number of pods is 0. (#756). New Modules helm_registry_auth - Helm registry authentication module ADDITIONAL INFORMATION Collection kubernets.core version 3.1.0 is compatible with ansible-core>=2.15.0 Reviewed-by: Mike Graves <mgraves@redhat.com>

SUMMARY This release came with new module helm_registry_auth, and improvements to the error messages in the k8s_drain module, new parameter insecure_registry for helm_template module and several bug fixes. ISSUE TYPE New release pull request Changelog Minor Changes Bump version of ansible-lint to minimum 24.7.0 (ansible-collections#765). Parameter insecure_registry added to helm_template as equivalent of insecure-skip-tls-verify (ansible-collections#805). connection/kubectl.py - Added an example of using the kubectl connection plugin to the documentation (ansible-collections#741). k8s_drain - Improve error message for pod disruption budget when draining a node (ansible-collections#797). Bugfixes helm - Helm version checks did not support RC versions. They now accept any version tags. (ansible-collections#745). helm_pull - Apply no_log=True to pass_credentials to silence false positive warning.. (ansible-collections#796). k8s_drain - Fix k8s_drain does not wait for single pod (ansible-collections#769). k8s_drain - Fix k8s_drain runs into a timeout when evicting a pod which is part of a stateful set (ansible-collections#792). kubeconfig option should not appear in module invocation log (ansible-collections#782). kustomize - kustomize plugin fails with deprecation warnings (ansible-collections#639). waiter - Fix waiting for daemonset when desired number of pods is 0. (ansible-collections#756). New Modules helm_registry_auth - Helm registry authentication module ADDITIONAL INFORMATION Collection kubernets.core version 3.1.0 is compatible with ansible-core>=2.15.0 Reviewed-by: Mike Graves <mgraves@redhat.com>

OttaviaB mentioned this issue Nov 15, 2024

Fix k8s_drain runs into timeout with pods from stateful sets. #793

Merged

softwarefactory-project-zuul bot closed this as completed in #793 Dec 10, 2024

softwarefactory-project-zuul bot closed this as completed in fca0dc0 Dec 10, 2024

patchback bot mentioned this issue Dec 10, 2024

[PR #793/fca0dc04 backport][stable-3] Fix k8s_drain runs into timeout with pods from stateful sets. #807

Merged

patchback bot mentioned this issue Dec 10, 2024

[PR #793/fca0dc04 backport][stable-5] Fix k8s_drain runs into timeout with pods from stateful sets. #808

Merged

yurnov mentioned this issue Jan 20, 2025

Prepare release 3.3.0 #863

Merged

This was referenced Jan 20, 2025

Prepare release 5.1.0 #864

Closed

Prepare release 5.1.0 #865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s_drain runs into a timeout when evicting a pod which is part of a stateful set #792

k8s_drain runs into a timeout when evicting a pod which is part of a stateful set #792

OttaviaB commented Nov 15, 2024

k8s_drain runs into a timeout when evicting a pod which is part of a stateful set #792

k8s_drain runs into a timeout when evicting a pod which is part of a stateful set #792

Comments

OttaviaB commented Nov 15, 2024

SUMMARY

ISSUE TYPE

COMPONENT NAME

ANSIBLE VERSION

COLLECTION VERSION

CONFIGURATION

OS / ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS