Install stuck in "wait for the first server be ready" with kubevip, cilium and kube proxy disabled #157

drustan · 2023-08-23T09:34:59Z

Summary

During the initial installation of a cluster using RKE2 version 1.27.1+rke2r1, kubevip, cilium and kube proxy disabled, the first node is stuck in the NOTREADY state preventing the successful completion of the cluster installation process.

The workaround I found :

Connect to the first server with SSH
Manually set the $rke2_api_ip on the network interface ip a a 192.0.2.20 dev ens224
Restart rke2 service systemctl restart rke2-server.service

Not sure why this is happening so far, possibly due to the disabling of kube proxy.

Issue Type

Bug Report

Ansible Version

Ansible 2.14.8

Steps to Reproduce

Deploy RKE2 with the following variables :

rke2_version: v1.27.1+rke2r1
rke2_cluster_group_name: kubernetes_cluster
rke2_servers_group_name: kubernetes_masters
rke2_agents_group_name: kubernetes_workers
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_additional_sans:
  - kubernetes-api.example.net
rke2_api_ip: "192.0.2.20"
rke2_kubevip_svc_enable: false
rke2_interface: "ens224"
rke2_kubevip_cloud_provider_enable: false
rke2_cni: cilium
rke2_disable:
  - rke2-canal
  - rke2-ingress-nginx
rke2_custom_manifests:
  - rke2-cilium-proxy.yaml
disable_kube_proxy: true
rke2_drain_node_during_upgrade: true
rke2_wait_for_all_pods_to_be_ready: true

Here is the content of rke2-cilium-proxy.yaml :

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    kubeProxyReplacement: strict
    k8sServiceHost: {{ rke2_api_ip }}
    k8sServicePort: {{ rke2_apiserver_dest_port }}
    ipv4NativeRoutingCIDR: 10.43.0.0/15
    hubble:
      enabled: true
      metrics:
        enabled:
        - dns:query;ignoreAAAA
        - drop
        - tcp
        - flow
        - icmp
        - http
      relay:
        enabled: true
        replicas: 3
      ui:
        enabled: true
        replicas: 3
        ingress:
          enabled: false

Expected Results

The first server should at some point be in the READY state, so the installation of the cluster succeed.

Actual Results

[…]
FAILED - RETRYING: [k8s01.example.net]: Wait for the first server be ready (1 retries left).
fatal: [k8s01.example.net]: FAILED! => changed=false 
attempts: 40
cmd: |-
set -o pipefail
  /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes | grep "k8s01.example.net"
delta: '0:00:00.096538'
end: '2023-08-23 09:31:26.649490'
msg: ''
rc: 0
start: '2023-08-23 09:31:26.552952'
stderr: ''
stderr_lines: <omitted>
stdout: k8s01.example.net   NotReady   control-plane,etcd,master   10m   v1.27.1+rke2r1
stdout_lines: <omitted>

The text was updated successfully, but these errors were encountered:

drustan · 2023-08-23T16:34:48Z

Until I find the reason for this, so I can identify the appropriate conditions for a patch, I have applied a temporary workaround by integrating it into the pre-tasks of my playbook.

---
- hosts: kubernetes_masters
  gather_facts: true
  remote_user: ubuntu
  become: true 

  pre_tasks:
    # https://docs.cilium.io/en/v1.13/operations/system_requirements/#systemd-based-distributions
    - name: Do not manage foreign routes
      ansible.builtin.blockinfile:
        path: /etc/systemd/networkd.conf
        insertafter: "^\\[Network\\]"
        block: |
          ManageForeignRoutes=no
          ManageForeignRoutingPolicyRules=no
      register: networkd_patch
    - name: Force systemd to reread configs
      ansible.builtin.systemd:
        daemon_reload: true
      when: networkd_patch.changed

    # https://github.com/lablabs/ansible-role-rke2/issues/157
    - name: Check if {{ rke2_api_ip }} is pingable
      ansible.builtin.shell: "ping -c 1 {{ rke2_api_ip }}"
      register: ping_result
      ignore_errors: yes
    - name: Add the {{ rke2_api_ip }} address to the first node if no ICMP reply
      ansible.builtin.shell: "ip addr add {{ rke2_api_ip }}/32 dev {{ rke2_interface }}"
      when:
        - ping_result.failed
        - inventory_hostname == groups[rke2_servers_group_name].0

  roles:
    - ansible-role-rke2

sofronic · 2024-01-16T09:11:59Z

It's a chicken or the egg scenario. Celium without kube-proxy needs to talk to the kube-api that's being loadbalanced by kube-vip that needs a working cni to contact the kube-api's internal k8s service. As a workaround I set up cilium with:
k8sServiceHost: rke2 first manager's external ip
And after the cluster is up and running you can change it back to:
k8sServiceHost: kube-vip ip

drustan added the bug Something isn't working label Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Install stuck in "wait for the first server be ready" with kubevip, cilium and kube proxy disabled #157

Install stuck in "wait for the first server be ready" with kubevip, cilium and kube proxy disabled #157

drustan commented Aug 23, 2023 •

edited

Loading

drustan commented Aug 23, 2023 •

edited

Loading

sofronic commented Jan 16, 2024

Install stuck in "wait for the first server be ready" with kubevip, cilium and kube proxy disabled #157

Install stuck in "wait for the first server be ready" with kubevip, cilium and kube proxy disabled #157

Comments

drustan commented Aug 23, 2023 • edited Loading

Summary

Issue Type

Ansible Version

Steps to Reproduce

Expected Results

Actual Results

drustan commented Aug 23, 2023 • edited Loading

sofronic commented Jan 16, 2024

drustan commented Aug 23, 2023 •

edited

Loading

drustan commented Aug 23, 2023 •

edited

Loading