etcd-events "appears" unhealthy after 2.18 port change #8557

mac-chaffee · 2022-02-17T22:29:37Z

When changing the port number for etcd-events during an upgrade to release-2.18, the upgrade fails since the task responsible for checking if etcd is healthy uses the new port number (before that change has taken effect):

TASK [etcd : Configure | Wait for etcd-events cluster to be healthy] *************************************************************************************************************
FAILED - RETRYING: Configure | Wait for etcd-events cluster to be healthy (4 retries left).
FAILED - RETRYING: Configure | Wait for etcd-events cluster to be healthy (3 retries left).
FAILED - RETRYING: Configure | Wait for etcd-events cluster to be healthy (2 retries left).
FAILED - RETRYING: Configure | Wait for etcd-events cluster to be healthy (1 retries left).
fatal: [k8s-node15]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.016800", "end": "2022-02-17 17:09:45.483587", "msg": "non-zero return code", "rc": 1, "start": "2022-02-17 17:09:40.466787", "stderr": "{\"level\":\"warn\",\"ts\":\"2022-02-17T17:09:45.482-0500\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001e8fc0/#initially=[https://172.25.13.115:2383;https://172.25.13.116:2383;https://172.25.13.117:2383]\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 172.25.13.116:2383: connect: connection refused\\\"\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2022-02-17T17:09:45.482-0500\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001e8fc0/#initially=[https://172.25.13.115:2383;https://172.25.13.116:2383;https://172.25.13.117:2383]\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 172.25.13.116:2383: connect: connection refused\\\"\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

Related to #8232

Kubespray version (commit) (git rev-parse --short HEAD): release-2.18

The text was updated successfully, but these errors were encountered:

mac-chaffee · 2022-02-17T22:29:56Z

As a temporary fix, I commented out everything after the "Wait for etcd-events cluster to be healthy" task:

kubespray/roles/etcd/tasks/configure.yml

Lines 105 to 168 in cc45e36

    
           - name: Configure | Wait for etcd-events cluster to be healthy 
        
             shell: "set -o pipefail && {{ bin_dir }}/etcdctl endpoint --cluster status && {{ bin_dir }}/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null" 
        
             args: 
        
               executable: /bin/bash 
        
             register: etcd_events_cluster_is_healthy 
        
             until: etcd_events_cluster_is_healthy.rc == 0 
        
             retries: "{{ etcd_retries }}" 
        
             delay: "{{ retry_stagger | random + 3 }}" 
        
             changed_when: false 
        
             check_mode: no 
        
             run_once: yes 
        
             when: 
        
               - is_etcd_master 
        
               - etcd_events_cluster_setup 
        
             tags: 
        
               - facts 
        
             environment: 
        
               ETCDCTL_API: 3 
        
               ETCDCTL_CERT: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem" 
        
               ETCDCTL_KEY: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem" 
        
               ETCDCTL_CACERT: "{{ etcd_cert_dir }}/ca.pem" 
        
               ETCDCTL_ENDPOINTS: "{{ etcd_events_access_addresses }}" 
        
           - name: Configure | Check if member is in etcd cluster 
        
             shell: "{{ bin_dir }}/etcdctl member list | grep -q {{ etcd_access_address }}" 
        
             register: etcd_member_in_cluster 
        
             ignore_errors: true  # noqa ignore-errors 
        
             changed_when: false 
        
             check_mode: no 
        
             when: is_etcd_master and etcd_cluster_setup 
        
             tags: 
        
               - facts 
        
             environment: 
        
               ETCDCTL_API: 3 
        
               ETCDCTL_CERT: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem" 
        
               ETCDCTL_KEY: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem" 
        
               ETCDCTL_CACERT: "{{ etcd_cert_dir }}/ca.pem" 
        
               ETCDCTL_ENDPOINTS: "{{ etcd_access_addresses }}" 
        
           - name: Configure | Check if member is in etcd-events cluster 
        
             shell: "{{ bin_dir }}/etcdctl member list | grep -q {{ etcd_access_address }}" 
        
             register: etcd_events_member_in_cluster 
        
             ignore_errors: true  # noqa ignore-errors 
        
             changed_when: false 
        
             check_mode: no 
        
             when: is_etcd_master and etcd_events_cluster_setup 
        
             tags: 
        
               - facts 
        
             environment: 
        
               ETCDCTL_API: 3 
        
               ETCDCTL_CERT: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem" 
        
               ETCDCTL_KEY: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem" 
        
               ETCDCTL_CACERT: "{{ etcd_cert_dir }}/ca.pem" 
        
               ETCDCTL_ENDPOINTS: "{{ etcd_events_access_addresses }}" 
        
           - name: Configure | Join member(s) to etcd cluster one at a time 
        
             include_tasks: join_etcd_member.yml 
        
             with_items: "{{ groups['etcd'] }}" 
        
             when: inventory_hostname == item and etcd_cluster_setup and etcd_member_in_cluster.rc != 0 and etcd_cluster_is_healthy.rc == 0 
        
           - name: Configure | Join member(s) to etcd-events cluster one at a time 
        
             include_tasks: join_etcd-events_member.yml 
        
             with_items: "{{ groups['etcd'] }}" 
        
             when: inventory_hostname == item and etcd_events_cluster_setup and etcd_events_member_in_cluster.rc != 0 and etcd_events_cluster_is_healthy.rc == 0

k8s-triage-robot · 2022-05-18T22:55:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-06-17T23:46:58Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-07-18T00:17:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-07-18T00:17:30Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mac-chaffee added the kind/bug Categorizes issue or PR as related to a bug. label Feb 17, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 17, 2022

k8s-ci-robot closed this as completed Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd-events "appears" unhealthy after 2.18 port change #8557

etcd-events "appears" unhealthy after 2.18 port change #8557

mac-chaffee commented Feb 17, 2022

mac-chaffee commented Feb 17, 2022

k8s-triage-robot commented May 18, 2022

k8s-triage-robot commented Jun 17, 2022

k8s-triage-robot commented Jul 18, 2022

k8s-ci-robot commented Jul 18, 2022

etcd-events "appears" unhealthy after 2.18 port change #8557

etcd-events "appears" unhealthy after 2.18 port change #8557

Comments

mac-chaffee commented Feb 17, 2022

mac-chaffee commented Feb 17, 2022

k8s-triage-robot commented May 18, 2022

k8s-triage-robot commented Jun 17, 2022

k8s-triage-robot commented Jul 18, 2022

k8s-ci-robot commented Jul 18, 2022