Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking issue produces CrashLoopBackOff #3214

Closed
TannerGabriel opened this issue Apr 17, 2021 · 8 comments
Closed

Networking issue produces CrashLoopBackOff #3214

TannerGabriel opened this issue Apr 17, 2021 · 8 comments

Comments

@TannerGabriel
Copy link

TannerGabriel commented Apr 17, 2021

Environmental Info:
K3s Version: v1.20.5+k3s1

Node(s) CPU architecture, OS, and Version:

  • 4 Nodes (two masters, two workers)
  • All of them are VMs created using Proxmox running Ubuntu Server
  • Linux kube-master-01 5.8.0-50-generic #56-Ubuntu SMP Mon Apr 12 17:18:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

  • HA deployment using 2 masters and 2 workers
  • Nginx load balancer in front of the masters
  • MySQL Database

Describe the bug:

I have set up K3S on four Proxmox VMs and am experiencing a networking issue similar to #24 that is producing a CrashLoopBackOff for multiple pods, including:

  • helm-install-traefik
  • kubernetes-dashboard

It seems that networking to 10.42.0.0/16 is not working. Take the following error message for example:

panic: Get https://10.43.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf: dial tcp 10.43.0.1:443: i/o timeou

Steps To Reproduce:

  • Set up Nginx load balancer for the two master node
events {}

stream {
  upstream k3s_servers {
    server IP-Address1:6443;
    server IP-Address2:6443;
  }

  server {
    listen 6443;
    proxy_pass k3s_servers;
  }
}
  • Setup MySQL database using Docker-Compose
  • Installed K3s:

Masters:

Installing K3s:

export K3S_DATASTORE_ENDPOINT='mysql://user:password@tcp(LOAD_BALANCER_IP:3306)/DATABASE_NAME'
curl -sfL https://get.k3s.io | sh -s - server --node-taint CriticalAddonsOnly=true:NoExecute --tls-san LOAD_BALANCER_IP

Getting Token:

sudo cat /var/lib/rancher/k3s/server/node-token

Worker:

curl -sfL https://get.k3s.io | K3S_URL=https://LOAD_BALANCER_IP:6443 K3S_TOKEN=TOKEN sh -

Expected behavior:

All pods and deployments should start without going into a CrashLoopBackOff.

Actual behavior:

Deployments experience networking issues on 10.42.0.0/16 addresses and therefore go into a CrashLoopBackOff.

Additional context / logs:

Executing systemctl status k3s on the master node:

k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2021-04-17 15:11:56 UTC; 3h 42min ago


Apr 17 18:54:04 kube-master-02 k3s[41792]: E0417 18:54:04.891299   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:04 kube-master-02 k3s[41792]: E0417 18:54:04.891299   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:04 kube-master-02 k3s[41792]: E0417 18:54:04.891979   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:06 kube-master-02 k3s[41792]: E0417 18:54:06.343671   41792 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.66.183:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.66.183:443/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout excee>
Apr 17 18:54:07 kube-master-02 k3s[41792]: E0417 18:54:07.993695   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:07 kube-master-02 k3s[41792]: E0417 18:54:07.996675   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:08 kube-master-02 k3s[41792]: E0417 18:54:08.005642   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:08 kube-master-02 k3s[41792]: E0417 18:54:08.037385   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:08 kube-master-02 k3s[41792]: E0417 18:54:08.037385   41792 status.go:71] apiserver received an error that is not an metav1.Status: &fmt.wrapError{msg:"error trying to reach service: context canceled", err:(*errors.errorString)(0xc0001121e0)}
Apr 17 18:54:11 kube-master-02 k3s[41792]: E0417 18:54:11.351801   41792 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.66.183:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.66.183:443/apis/metrics.k8s.io/v1beta1": context deadline exceeded

Executing sudo systemctl status k3s-agent on worker node:

● k3s-agent.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2021-04-17 15:03:14 UTC; 3h 52min ago

Apr 17 18:51:40 kube-worker-01 k3s[51900]: I0417 18:51:40.335412   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 4e6588b753491c8d8010b643ce87d0a7107e673102008dd975b4bbc58eb8a138
Apr 17 18:51:40 kube-worker-01 k3s[51900]: E0417 18:51:40.335717   51900 pod_workers.go:191] Error syncing pod d93d36cf-7026-4539-9864-362617a97d0b ("helm-install-traefik-cg94z_kube-system(d93d36cf-7026-4539-9864-362617a97d0b)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 5m0s restarting failed container=helm pod=helm-inst>
Apr 17 18:51:55 kube-worker-01 k3s[51900]: I0417 18:51:55.335375   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 4e6588b753491c8d8010b643ce87d0a7107e673102008dd975b4bbc58eb8a138
Apr 17 18:51:55 kube-worker-01 k3s[51900]: E0417 18:51:55.336290   51900 pod_workers.go:191] Error syncing pod d93d36cf-7026-4539-9864-362617a97d0b ("helm-install-traefik-cg94z_kube-system(d93d36cf-7026-4539-9864-362617a97d0b)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 5m0s restarting failed container=helm pod=helm-inst>
Apr 17 18:52:07 kube-worker-01 k3s[51900]: I0417 18:52:07.335382   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 4e6588b753491c8d8010b643ce87d0a7107e673102008dd975b4bbc58eb8a138
Apr 17 18:55:19 kube-worker-01 k3s[51900]: I0417 18:55:19.171453   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 4e6588b753491c8d8010b643ce87d0a7107e673102008dd975b4bbc58eb8a138
Apr 17 18:55:19 kube-worker-01 k3s[51900]: I0417 18:55:19.172197   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 02877492c86e56f6c8806b4a89b6feb798f9a42795ffc85437d9bb9d5d3c37b7
Apr 17 18:55:19 kube-worker-01 k3s[51900]: E0417 18:55:19.172435   51900 pod_workers.go:191] Error syncing pod d93d36cf-7026-4539-9864-362617a97d0b ("helm-install-traefik-cg94z_kube-system(d93d36cf-7026-4539-9864-362617a97d0b)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 5m0s restarting failed container=helm pod=helm-inst>
Apr 17 18:55:33 kube-worker-01 k3s[51900]: I0417 18:55:33.335224   51900 scope.go:111] [topologymanager] RemoveContainer - Container ID: 02877492c86e56f6c8806b4a89b6feb798f9a42795ffc85437d9bb9d5d3c37b7
Apr 17 18:55:33 kube-worker-01 k3s[51900]: E0417 18:55:33.335533   51900 pod_workers.go:191] Error syncing pod d93d36cf-7026-4539-9864-362617a97d0b ("helm-install-traefik-cg94z_kube-system(d93d36cf-7026-4539-9864-362617a97d0b)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 5m0s restarting failed container=helm pod=helm-inst

Solutions I have tried

I have already looked at issue #24 and deleted all iptables rules, but it did not fix my specific problem.

@brandond
Copy link
Contributor

brandond commented Apr 19, 2021

What address range are you using for these VMs? Do the VMs have more than one interface? Do you see the same result if you stop and disable firewalld before installing K3s?

@TannerGabriel
Copy link
Author

The VMs are running in the 192.168.88.0/24 IP-Address range. Regarding the interface, I only see one interface in the Proxmox interface. Still, here is the result of ip link show command:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether b6:25:9e:07:7a:3e brd ff:ff:ff:ff:ff:ff
    altname enp0s18
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 02:42:81:a7:ee:5e brd ff:ff:ff:ff:ff:ff
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether ea:a7:cb:12:be:9f brd ff:ff:ff:ff:ff:ff
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether e2:b5:6b:86:52:3f brd ff:ff:ff:ff:ff:ff
6: veth2cbd105c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether ce:f0:20:e9:59:16 brd ff:ff:ff:ff:ff:ff link-netns cni-d49e98ea-f152-280b-09ca-a20d35356a28
7: veth592fd492@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether 52:80:d5:fb:53:5e brd ff:ff:ff:ff:ff:ff link-netns cni-2c547ec2-5b9c-3af6-afd5-193021f8743d
8: veth3787916b@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether 66:57:42:72:d0:b1 brd ff:ff:ff:ff:ff:ff link-netns cni-7ec7f577-2421-2ef2-41fe-c58651b6fcf8
9: veth334e49ad@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether 76:7a:df:55:a6:13 brd ff:ff:ff:ff:ff:ff link-netns cni-d41ccf11-1f33-f70d-80a8-b7a556bab1eb

Also, firewalld seems not to be installed by default on Ubuntu server and I haven't installed it manually; therefore it was not enabled. I also checked other firewalls like ufw and also iptables, but they were all disabled or not installed.

@brandond
Copy link
Contributor

Can you attach the full k3s and k3s-agent service logs, as well as the output of k3s check-config on both nodes?

@TannerGabriel
Copy link
Author

I uploaded the logs as files since they are pretty long:

@kriansa
Copy link

kriansa commented Apr 26, 2021

I'm having this issue on these conditions:

  • I'm using multiple interfaces (3)
  • I'm installing with bind-address option (curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--bind-address 192.168.10.11" sh -)

If I install it without that option then it works.

I'm using Debian without firewall.

@kriansa
Copy link

kriansa commented Apr 26, 2021

After few hours of debugging, I solved my issue with:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--bind-address 192.168.10.11 --advertise-address 192.168.10.11" sh -

That flag was not needed in v1.20.1. Maybe a regression?

@TannerGabriel
Copy link
Author

I tried adding the --advertise-address to the install command of the K3S masters, but I'm still receiving the same error. Maybe this is because I use k3s version v1.20.5+k3s1.

@TannerGabriel
Copy link
Author

I have got this working using a newer version and will therefore close the issue.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants