Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using k3s when WiFi goes down/unavailable for sometime #5048

Closed
Shaked opened this issue Jan 31, 2022 · 6 comments
Closed

Using k3s when WiFi goes down/unavailable for sometime #5048

Shaked opened this issue Jan 31, 2022 · 6 comments

Comments

@Shaked
Copy link

Shaked commented Jan 31, 2022

Environmental Info:
K3s Version:

k3s version v1.20.0+k3s2 (2ea6b163)
go version go1.15.5

Node(s) CPU architecture, OS, and Version:

Linux xaviershaked1 4.9.201-tegra #1 SMP PREEMPT Fri Feb 19 08:42:04 PST 2021 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration:
single node on a Jetson Xavier:

/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--data-dir' \
        '/xavier_ssd/var/lib/rancher/k3s' \
        '--kubelet-arg' \
        'cpu-manager-policy=static' \
        '--kubelet-arg' \
        'kube-reserved=cpu=1' \

Describe the bug:

Running k3s requires a default route. k3s is known as a great solution for kubernetes on IoT edge devices. IoT devices might experience connectivity issues such as low or no WiFi signal for a period of time. While this happens, if k3s service has been started (sudo service k3s start) it will crash with the following error:

unable to find suitable network address.error='no default routes found in "/proc/net/route" or "/proc/net/ipv6_route"'. Try to set the AdvertiseAddress directly or provide a valid BindAddress to fix this

There are sort of related bugs: #1144, #14840, #1103. However:

  1. They are not solved
  2. These bugs are specific for air gap solution. In my case, I need to make sure k3s goes up even if WiFi/network is not available and be able to recover when it's back.

Currently I'm using these hack to solve this problem:

  • When WiFi is down, run:
ip link add dummy0 type dummy
ip link set dummy0 up
ip addr add 10.100.102.0/24 dev dummy0
ip route add default via 10.100.102.1 dev dummy0 metric 1000
# make sure default route is there
ip route | grep default
  • When WiFi is back:
ip link set dummy0 down

Notes:

Steps To Reproduce:

  • Installed K3s:
/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--data-dir' \
        '/xavier_ssd/var/lib/rancher/k3s' \
        '--kubelet-arg' \
        'cpu-manager-policy=static' \
        '--kubelet-arg' \
        'kube-reserved=cpu=1' \
  • Stop Wifi
  • Check default route ip route | grep default should be empty
  • Try starting k3s sudo service k3s restart

Expected behavior:

k3s should work when network/WiFi is not available. IoT devices experience connectivity issues as part of the nature of their usage and k3s should be able to face this.

Actual behavior:
k3s requires a default route which doesn't exist when the network is down.

@brandond
Copy link
Member

brandond commented Feb 2, 2022

This is not K3s specific; it's a limitation of Kubernetes. See the existing discussion at #1144

@brandond brandond closed this as completed Feb 2, 2022
@Shaked
Copy link
Author

Shaked commented Feb 2, 2022

This is not K3s specific; it's a limitation of Kubernetes. See the existing discussion at #1144

@brandond #1144 is a bit different. In this case I need a way to recover once the WiFi connection is available.

I do understand that it’s a k8s limitation but k3s describes itself as a good solution for IoT so I assumed that there would be a way to fix this, even if it’s a patchy way as suggested in #1144 - which didn’t work in my case.

@brandond
Copy link
Member

brandond commented Feb 2, 2022

Hmm. Is the issue that you are left without a default route for a while... or that you are experiencing some secondary issues caused by changes to the network such as receiving a new address on your host from DHCP? In my experience the default route is only necessary on startup. Anything that happens after the node is up is probably more related to addresses changing, DNS timing out, etc.

@Shaked
Copy link
Author

Shaked commented Feb 3, 2022

This issue occurs when WiFi is down and IoT device is being started i.e k3s is being started automatically after boot. At that point there’s no default route until the WiFi is available again.

as for the second part that you mentioned, I have opened a different bug about it rancher/rancher#34601 but for some reason it has been closed and I haven’t solved it.

@brandond
Copy link
Member

brandond commented Feb 3, 2022

Ah OK. So the first part is definitely the issue I linked to. You might be able to fuss around a bit and configure a dummy interface with a higher-cost default route on it, but the best way to do that is probably distro-specific.

The second issue is going to be related as well, and not something Rancher can fix. Kubernetes is mostly designed for datacenter use. We do a fair bit of tweaking to make it work on edge nodes, but there are some bits of it that just aren't really intended to handle dynamic addresses and major network topology changes; a lot of the time the answer is just going to be that things need to be restarted.

@Shaked
Copy link
Author

Shaked commented Feb 3, 2022

Ah OK. So the first part is definitely the issue I linked to. You might be able to fuss around a bit and configure a dummy interface with a higher-cost default route on it, but the best way to do that is probably distro-specific.

OK my current hacky solution is:

ip link add dummy1 type dummy
ip link set dummy1 up
ip -c address add 10.100.103.1/16 dev dummy1
ip route add default via 10.100.103.1 dev dummy1 metric 1000

WiFi has a static IP address 10.10.102.17 and the default route it 10.10.102.1.

These are the steps I did:

  1. sudo service k3s stop
  2. Stop WiFi
  3. Create dummy1 as described above
  4. sudo service k3s start
  5. kubectl logs -f -n kube-system coredns...

At this point I see this error:

Error from server: Get "https://xaviershaked1:10250/containerLogs/kube-system/coredns-854c77959c-fgdmv/coredns?follow=true": dial tcp 127.0.0.1:10250: connect: connection refused

  1. Put back WiFi
  2. sudo systemctl restart networking

At this point everything works again.

What I don't understand is:

  • I'd expect everything to work in section 5 although there's no WiFi. Why doesn't it? This is critical.
  • I'd expect everything to work in section 6 once WiFi is back, but I must restart it as stated in section 7. Any way to avoid this? This is probably relate to the second issue and less critical although would be great to solve somehow automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants