Networking issue produces CrashLoopBackOff #3214

TannerGabriel · 2021-04-17T19:01:49Z

Environmental Info:
K3s Version: v1.20.5+k3s1

Node(s) CPU architecture, OS, and Version:

4 Nodes (two masters, two workers)
All of them are VMs created using Proxmox running Ubuntu Server
Linux kube-master-01 5.8.0-50-generic #56-Ubuntu SMP Mon Apr 12 17:18:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

HA deployment using 2 masters and 2 workers
Nginx load balancer in front of the masters
MySQL Database

Describe the bug:

I have set up K3S on four Proxmox VMs and am experiencing a networking issue similar to #24 that is producing a CrashLoopBackOff for multiple pods, including:

helm-install-traefik
kubernetes-dashboard

It seems that networking to 10.42.0.0/16 is not working. Take the following error message for example:

panic: Get https://10.43.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf: dial tcp 10.43.0.1:443: i/o timeou

Steps To Reproduce:

Set up Nginx load balancer for the two master node

events {}

stream {
  upstream k3s_servers {
    server IP-Address1:6443;
    server IP-Address2:6443;
  }

  server {
    listen 6443;
    proxy_pass k3s_servers;
  }
}

Setup MySQL database using Docker-Compose
Installed K3s:

Masters:

Installing K3s:

export K3S_DATASTORE_ENDPOINT='mysql://user:password@tcp(LOAD_BALANCER_IP:3306)/DATABASE_NAME'
curl -sfL https://get.k3s.io | sh -s - server --node-taint CriticalAddonsOnly=true:NoExecute --tls-san LOAD_BALANCER_IP

Getting Token:

sudo cat /var/lib/rancher/k3s/server/node-token

Worker:

curl -sfL https://get.k3s.io | K3S_URL=https://LOAD_BALANCER_IP:6443 K3S_TOKEN=TOKEN sh -

Expected behavior:

All pods and deployments should start without going into a CrashLoopBackOff.

Actual behavior:

Deployments experience networking issues on 10.42.0.0/16 addresses and therefore go into a CrashLoopBackOff.

Additional context / logs:

Executing systemctl status k3s on the master node:

k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2021-04-17 15:11:56 UTC; 3h 42min ago


Apr 17 18:54:04 kube-master-02 k3s[41792]: E0417 18:54:04.891299   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:04 kube-master-02 k3s[41792]: E0417 18:54:04.891299   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:04 kube-master-02 k3s[41792]: E0417 18:54:04.891979   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:06 kube-master-02 k3s[41792]: E0417 18:54:06.343671   41792 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.66.183:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.66.183:443/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout excee>
Apr 17 18:54:07 kube-master-02 k3s[41792]: E0417 18:54:07.993695   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:07 kube-master-02 k3s[41792]: E0417 18:54:07.996675   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:08 kube-master-02 k3s[41792]: E0417 18:54:08.005642   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:08 kube-master-02 k3s[41792]: E0417 18:54:08.037385   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:08 kube-master-02 k3s[41792]: E0417 18:54:08.037385   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:11 kube-master-02 k3s[41792]: E0417 18:54:11.351801   41792 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.66.183:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.66.183:443/apis/metrics.k8s.io/v1beta1": context deadline exceeded

Executing sudo systemctl status k3s-agent on worker node:

● k3s-agent.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2021-04-17 15:03:14 UTC; 3h 52min ago

Apr 17 18:51:40 kube-worker-01 k3s[51900]: I0417 18:51:40.335412   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 4e6588b753491c8d8010b643ce87d0a7107e673102008dd975b4bbc58eb8a138
Apr 17 18:51:40 kube-worker-01 k3s[51900]: E0417 18:51:40.335717   51900 pod_workers.go:191] Error syncing pod d93d36cf-7026-4539-9864-362617a97d0b ("helm-install-traefik-cg94z_kube-system(d93d36cf-7026-4539-9864-362617a97d0b)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 5m0s restarting failed container=helm pod=helm-inst>
Apr 17 18:51:55 kube-worker-01 k3s[51900]: I0417 18:51:55.335375   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 4e6588b753491c8d8010b643ce87d0a7107e673102008dd975b4bbc58eb8a138
Apr 17 18:51:55 kube-worker-01 k3s[51900]: E0417 18:51:55.336290   51900 pod_workers.go:191] Error syncing pod d93d36cf-7026-4539-9864-362617a97d0b ("helm-install-traefik-cg94z_kube-system(d93d36cf-7026-4539-9864-362617a97d0b)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 5m0s restarting failed container=helm pod=helm-inst>
Apr 17 18:52:07 kube-worker-01 k3s[51900]: I0417 18:52:07.335382   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 4e6588b753491c8d8010b643ce87d0a7107e673102008dd975b4bbc58eb8a138
Apr 17 18:55:19 kube-worker-01 k3s[51900]: I0417 18:55:19.171453   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 4e6588b753491c8d8010b643ce87d0a7107e673102008dd975b4bbc58eb8a138
Apr 17 18:55:19 kube-worker-01 k3s[51900]: I0417 18:55:19.172197   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 02877492c86e56f6c8806b4a89b6feb798f9a42795ffc85437d9bb9d5d3c37b7
Apr 17 18:55:19 kube-worker-01 k3s[51900]: E0417 18:55:19.172435   51900 pod_workers.go:191] Error syncing pod d93d36cf-7026-4539-9864-362617a97d0b ("helm-install-traefik-cg94z_kube-system(d93d36cf-7026-4539-9864-362617a97d0b)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 5m0s restarting failed container=helm pod=helm-inst>
Apr 17 18:55:33 kube-worker-01 k3s[51900]: I0417 18:55:33.335224   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 02877492c86e56f6c8806b4a89b6feb798f9a42795ffc85437d9bb9d5d3c37b7
Apr 17 18:55:33 kube-worker-01 k3s[51900]: E0417 18:55:33.335533   51900 pod_workers.go:191] Error syncing pod d93d36cf-7026-4539-9864-362617a97d0b ("helm-install-traefik-cg94z_kube-system(d93d36cf-7026-4539-9864-362617a97d0b)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 5m0s restarting failed container=helm pod=helm-inst

Solutions I have tried

I have already looked at issue #24 and deleted all iptables rules, but it did not fix my specific problem.

The text was updated successfully, but these errors were encountered:

brandond · 2021-04-19T21:47:27Z

What address range are you using for these VMs? Do the VMs have more than one interface? Do you see the same result if you stop and disable firewalld before installing K3s?

TannerGabriel · 2021-04-20T09:44:27Z

The VMs are running in the 192.168.88.0/24 IP-Address range. Regarding the interface, I only see one interface in the Proxmox interface. Still, here is the result of ip link show command:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether b6:25:9e:07:7a:3e brd ff:ff:ff:ff:ff:ff
    altname enp0s18
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 02:42:81:a7:ee:5e brd ff:ff:ff:ff:ff:ff
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether ea:a7:cb:12:be:9f brd ff:ff:ff:ff:ff:ff
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether e2:b5:6b:86:52:3f brd ff:ff:ff:ff:ff:ff
6: veth2cbd105c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether ce:f0:20:e9:59:16 brd ff:ff:ff:ff:ff:ff link-netns cni-d49e98ea-f152-280b-09ca-a20d35356a28
7: veth592fd492@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether 52:80:d5:fb:53:5e brd ff:ff:ff:ff:ff:ff link-netns cni-2c547ec2-5b9c-3af6-afd5-193021f8743d
8: veth3787916b@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether 66:57:42:72:d0:b1 brd ff:ff:ff:ff:ff:ff link-netns cni-7ec7f577-2421-2ef2-41fe-c58651b6fcf8
9: veth334e49ad@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether 76:7a:df:55:a6:13 brd ff:ff:ff:ff:ff:ff link-netns cni-d41ccf11-1f33-f70d-80a8-b7a556bab1eb

Also, firewalld seems not to be installed by default on Ubuntu server and I haven't installed it manually; therefore it was not enabled. I also checked other firewalls like ufw and also iptables, but they were all disabled or not installed.

brandond · 2021-04-21T01:18:27Z

Can you attach the full k3s and k3s-agent service logs, as well as the output of k3s check-config on both nodes?

TannerGabriel · 2021-04-23T10:15:18Z

I uploaded the logs as files since they are pretty long:

kriansa · 2021-04-26T05:42:06Z

I'm having this issue on these conditions:

I'm using multiple interfaces (3)
I'm installing with bind-address option (curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--bind-address 192.168.10.11" sh -)

If I install it without that option then it works.

I'm using Debian without firewall.

kriansa · 2021-04-26T13:47:57Z

After few hours of debugging, I solved my issue with:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--bind-address 192.168.10.11 --advertise-address 192.168.10.11" sh -

That flag was not needed in v1.20.1. Maybe a regression?

TannerGabriel · 2021-04-27T06:09:59Z

I tried adding the --advertise-address to the install command of the K3S masters, but I'm still receiving the same error. Maybe this is because I use k3s version v1.20.5+k3s1.

TannerGabriel · 2021-10-21T13:41:16Z

I have got this working using a newer version and will therefore close the issue.

TannerGabriel closed this as completed Oct 21, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Networking issue produces CrashLoopBackOff #3214

Networking issue produces CrashLoopBackOff #3214

TannerGabriel commented Apr 17, 2021 •

edited

Loading

brandond commented Apr 19, 2021 •

edited

Loading

TannerGabriel commented Apr 20, 2021

brandond commented Apr 21, 2021

TannerGabriel commented Apr 23, 2021

kriansa commented Apr 26, 2021

kriansa commented Apr 26, 2021

TannerGabriel commented Apr 27, 2021

TannerGabriel commented Oct 21, 2021

Networking issue produces CrashLoopBackOff #3214

Networking issue produces CrashLoopBackOff #3214

Comments

TannerGabriel commented Apr 17, 2021 • edited Loading

brandond commented Apr 19, 2021 • edited Loading

TannerGabriel commented Apr 20, 2021

brandond commented Apr 21, 2021

TannerGabriel commented Apr 23, 2021

kriansa commented Apr 26, 2021

kriansa commented Apr 26, 2021

TannerGabriel commented Apr 27, 2021

TannerGabriel commented Oct 21, 2021

TannerGabriel commented Apr 17, 2021 •

edited

Loading

brandond commented Apr 19, 2021 •

edited

Loading