Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s_drain runs into a timeout when evicting a pod which is part of a stateful set #792

Closed
OttaviaB opened this issue Nov 15, 2024 · 0 comments · Fixed by #793
Closed

k8s_drain runs into a timeout when evicting a pod which is part of a stateful set #792

OttaviaB opened this issue Nov 15, 2024 · 0 comments · Fixed by #793

Comments

@OttaviaB
Copy link
Contributor

SUMMARY

k8s_drain runs into a timeout when evicting a pod which is part of a stateful set.

The pod gets the same name on a different node and because k8s_drain checks only the pod name, but not the node name, it thinks that the original pod is still running.

ISSUE TYPE
  • Bug Report
COMPONENT NAME

k8s_drain

ANSIBLE VERSION
ansible [core 2.15.12]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['~/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = ~/venvs/ansible/lib64/python3.9/site-packages/ansible
  ansible collection location = ~/.ansible/collections:/usr/share/ansible/collections
  executable location = ~/venvs/ansible/bin/ansible
  python version = 3.9.19 (main, Mar 22 2024, 21:01:47) [GCC] (~/venvs/ansible/bin/python3.9)
  jinja version = 3.1.4
  libyaml = True
COLLECTION VERSION
# ~/.ansible/collections/ansible_collections
Collection                    Version
----------------------------- -------
kubernetes.core               5.0.0
CONFIGURATION
CALLBACKS_ENABLED(~/.ansible.cfg) = ['ansible.posix.profile_tasks']
CONFIG_FILE() = ~/.ansible.cfg
PAGER(env: PAGER) = less
OS / ENVIRONMENT
  • Ansible controller: SLES 15-SP5
  • Remote hosts: SLES 15-SP5
  • RKE2 Version: v1.28.10+rke2r1
STEPS TO REPRODUCE
  • Deploy a stateful set
  • Drain the node where the stateful set's pod is running
---
- hosts: "{{ target }}"
  serial: 1
  gather_facts: false
  vars:
    kubeconfig_dir: "/home/ansible/.kubeconfigs"
  tasks:
    - name: Drain node
      kubernetes.core.k8s_drain:
        kubeconfig: "{{ kubeconfig_path }}"
        name: "{{ inventory_hostname }}"
        delete_options:
          ignore_daemonsets: true
          delete_emptydir_data: true
          wait_timeout: 100
          force: true
          wait_sleep: 1
      delegate_to: localhost
EXPECTED RESULTS

k8s_drain should return directly after the pods are evicted.

ACTUAL RESULTS

k8s_drain keeps going until the timeout is reached although the pods are long gone. It then returns a warning.


PLAY [Test] *******************************************************************

TASK [Drain node] *****************************************************************
Donnerstag 18 Juli 2024  14:40:13 +0200 (0:00:00.051)       0:00:00.051 ******* 
[WARNING]: cannot delete mirror Pods using API server: kube-system/kube-proxy-<node_name>.
[WARNING]: timeout reached while pods were still running.
changed: [<node_name> -> localhost]

PLAY RECAP ************************************************************************
<node_name>             : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Donnerstag 18 Juli 2024  14:41:55 +0200 (0:01:41.902)       0:01:41.953 ******* 
=============================================================================== 
Drain node --------------------------------------------------------------- 101.90s

This happens because the function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running:

            try:
                response = self._api_instance.read_namespaced_pod(
                    namespace=pod[0], name=pod[1]
                )
                if not response:
                    pod = None
                time.sleep(wait_sleep)

The conditon if not response is never met, because the new pod has the same name as the old one.

patchback bot pushed a commit that referenced this issue Dec 10, 2024
SUMMARY
Fixes #792 .
The function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running:
            try:
                response = self._api_instance.read_namespaced_pod(
                    namespace=pod[0], name=pod[1]
                )
                if not response:
                    pod = None
                time.sleep(wait_sleep)
This means that if a pod is successfully evicted and restarted with the same name on a new node, k8s_drain does not notice and thinks that the original pod is still running. This is the case for pods which are part of a stateful set.

ISSUE TYPE

Bugfix Pull Request

COMPONENT NAME
k8s_drain

Reviewed-by: Mike Graves <mgraves@redhat.com>
(cherry picked from commit fca0dc0)
patchback bot pushed a commit that referenced this issue Dec 10, 2024
SUMMARY
Fixes #792 .
The function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running:
            try:
                response = self._api_instance.read_namespaced_pod(
                    namespace=pod[0], name=pod[1]
                )
                if not response:
                    pod = None
                time.sleep(wait_sleep)
This means that if a pod is successfully evicted and restarted with the same name on a new node, k8s_drain does not notice and thinks that the original pod is still running. This is the case for pods which are part of a stateful set.

ISSUE TYPE

Bugfix Pull Request

COMPONENT NAME
k8s_drain

Reviewed-by: Mike Graves <mgraves@redhat.com>
(cherry picked from commit fca0dc0)
gravesm pushed a commit that referenced this issue Dec 11, 2024
SUMMARY
Fixes #792 .
The function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running:
            try:
                response = self._api_instance.read_namespaced_pod(
                    namespace=pod[0], name=pod[1]
                )
                if not response:
                    pod = None
                time.sleep(wait_sleep)
This means that if a pod is successfully evicted and restarted with the same name on a new node, k8s_drain does not notice and thinks that the original pod is still running. This is the case for pods which are part of a stateful set.

ISSUE TYPE

Bugfix Pull Request

COMPONENT NAME
k8s_drain

Reviewed-by: Mike Graves <mgraves@redhat.com>
(cherry picked from commit fca0dc0)
gravesm pushed a commit that referenced this issue Dec 11, 2024
SUMMARY
Fixes #792 .
The function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running:
            try:
                response = self._api_instance.read_namespaced_pod(
                    namespace=pod[0], name=pod[1]
                )
                if not response:
                    pod = None
                time.sleep(wait_sleep)
This means that if a pod is successfully evicted and restarted with the same name on a new node, k8s_drain does not notice and thinks that the original pod is still running. This is the case for pods which are part of a stateful set.

ISSUE TYPE

Bugfix Pull Request

COMPONENT NAME
k8s_drain

Reviewed-by: Mike Graves <mgraves@redhat.com>
(cherry picked from commit fca0dc0)
softwarefactory-project-zuul bot pushed a commit that referenced this issue Dec 11, 2024
…807)

This is a backport of PR #793 as merged into main (fca0dc0).
SUMMARY
Fixes #792 .
The function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running:
            try:
                response = self._api_instance.read_namespaced_pod(
                    namespace=pod[0], name=pod[1]
                )
                if not response:
                    pod = None
                time.sleep(wait_sleep)
This means that if a pod is successfully evicted and restarted with the same name on a new node, k8s_drain does not notice and thinks that the original pod is still running. This is the case for pods which are part of a stateful set.

ISSUE TYPE


Bugfix Pull Request

COMPONENT NAME
k8s_drain
softwarefactory-project-zuul bot pushed a commit that referenced this issue Dec 11, 2024
…808)

This is a backport of PR #793 as merged into main (fca0dc0).
SUMMARY
Fixes #792 .
The function wait_for_pod_deletion in k8s_drain never checks on which node a pod is actually running:
            try:
                response = self._api_instance.read_namespaced_pod(
                    namespace=pod[0], name=pod[1]
                )
                if not response:
                    pod = None
                time.sleep(wait_sleep)
This means that if a pod is successfully evicted and restarted with the same name on a new node, k8s_drain does not notice and thinks that the original pod is still running. This is the case for pods which are part of a stateful set.

ISSUE TYPE


Bugfix Pull Request

COMPONENT NAME
k8s_drain
softwarefactory-project-zuul bot pushed a commit that referenced this issue Jan 20, 2025
SUMMARY
Version 3.3.0 of ansible-collection kubernetes.core came with several improvements and bugfixes
ISSUE TYPE

New release pull request

Changelog
Minor Changes

k8s_drain - Improve error message for pod disruption budget when draining a node (#797).

Bugfixes

helm - Helm version checks did not support RC versions. They now accept any version tags. (#745).
helm_pull - Apply no_log=True to pass_credentials to silence false positive warning.. (#796).
k8s_drain - Fix k8s_drain does not wait for single pod (#769).
k8s_drain - Fix k8s_drain runs into a timeout when evicting a pod which is part of a stateful set  (#792).
kubeconfig option should not appear in module invocation log (#782).
kustomize - kustomize plugin fails with deprecation warnings (#639).
waiter - Fix waiting for daemonset when desired number of pods is 0. (#756).

ADDITIONAL INFORMATION
Collection kubernets.core version 3.3.0 is compatible with ansible-core>=2.14.0

Reviewed-by: Alina Buzachis
Reviewed-by: Yuriy Novostavskiy
Reviewed-by: Mike Graves <mgraves@redhat.com>
This was referenced Jan 20, 2025
softwarefactory-project-zuul bot pushed a commit that referenced this issue Jan 20, 2025
SUMMARY
This release came with new module helm_registry_auth, and improvements to the error messages in the k8s_drain module, new parameter insecure_registry for helm_template module and several bug fixes.
ISSUE TYPE

New release pull request

Changelog
Minor Changes

Bump version of ansible-lint to minimum 24.7.0 (#765).
Parameter insecure_registry added to helm_template as equivalent of insecure-skip-tls-verify (#805).
connection/kubectl.py - Added an example of using the kubectl connection plugin to the documentation (#741).
k8s_drain - Improve error message for pod disruption budget when draining a node (#797).

Bugfixes

helm - Helm version checks did not support RC versions. They now accept any version tags. (#745).
helm_pull - Apply no_log=True to pass_credentials to silence false positive warning.. (#796).
k8s_drain - Fix k8s_drain does not wait for single pod (#769).
k8s_drain - Fix k8s_drain runs into a timeout when evicting a pod which is part of a stateful set  (#792).
kubeconfig option should not appear in module invocation log (#782).
kustomize - kustomize plugin fails with deprecation warnings (#639).
waiter - Fix waiting for daemonset when desired number of pods is 0. (#756).

New Modules

helm_registry_auth - Helm registry authentication module

ADDITIONAL INFORMATION
Collection kubernets.core version 3.1.0 is compatible with ansible-core>=2.15.0

Reviewed-by: Mike Graves <mgraves@redhat.com>
yurnov added a commit to yurnov/kubernetes.core that referenced this issue Jan 20, 2025
SUMMARY
This release came with new module helm_registry_auth, and improvements to the error messages in the k8s_drain module, new parameter insecure_registry for helm_template module and several bug fixes.
ISSUE TYPE

New release pull request

Changelog
Minor Changes

Bump version of ansible-lint to minimum 24.7.0 (ansible-collections#765).
Parameter insecure_registry added to helm_template as equivalent of insecure-skip-tls-verify (ansible-collections#805).
connection/kubectl.py - Added an example of using the kubectl connection plugin to the documentation (ansible-collections#741).
k8s_drain - Improve error message for pod disruption budget when draining a node (ansible-collections#797).

Bugfixes

helm - Helm version checks did not support RC versions. They now accept any version tags. (ansible-collections#745).
helm_pull - Apply no_log=True to pass_credentials to silence false positive warning.. (ansible-collections#796).
k8s_drain - Fix k8s_drain does not wait for single pod (ansible-collections#769).
k8s_drain - Fix k8s_drain runs into a timeout when evicting a pod which is part of a stateful set  (ansible-collections#792).
kubeconfig option should not appear in module invocation log (ansible-collections#782).
kustomize - kustomize plugin fails with deprecation warnings (ansible-collections#639).
waiter - Fix waiting for daemonset when desired number of pods is 0. (ansible-collections#756).

New Modules

helm_registry_auth - Helm registry authentication module

ADDITIONAL INFORMATION
Collection kubernets.core version 3.1.0 is compatible with ansible-core>=2.15.0

Reviewed-by: Mike Graves <mgraves@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant