-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eBPF tracking causes panic on Ubuntu kernel 4.4.0-119 #3131
Comments
It'd be interesting to also have the known to work version so one can look at the list of commits that found their way in. |
Summary of the stack:
So the BPF program tries to read a eBPF map because of a open() syscall in the
The message "BUG: unable to handle kernel paging request at 0000000063222428" might be an invalid pointer in the
Maybe Given the timestamps, it seems to happen very soon after the machine boots, so probably during the initialization of Scope. |
Yes. We've had a couple of machines in a k8s cluster in reboot loops since updating to the |
I see one patch on bpf maps in that new kernel (found from here): Maybe it's a kernel patch that was badly backported to the Ubuntu 4.4 kernel? |
[ 263.736006] Modules linked in: xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_mark binfmt_misc xt_comment ebtable_nat ebtables xt_REDIRECT nf_nat_redirect xt_tcpudp iptable_security ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc overlay input_leds i2c_piix4 hv_balloon 8250_fintek joydev serio_raw mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic hv_netvsc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_hyperv hid hv_storvsc hv_utils ptp pps_core hyperv_keyboard scsi_transport_fc aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd hyperv_fb psmouse pata_acpi floppy hv_vmbus fjes Experienced with my k8s cluster and everytime i restarted etcd the ssh connection will hang for a while sometimes I dont restart etcd it hang there |
To apply the As this change will affect all nodes, you can make this change from another working node. |
@leth thank you, one thing I am not clear is that where does this weave scope come from? I didnt create it by myself and I dont k8s ever created it previously? Does this weave scope comes with this kernel only? And it only applies to k8s? |
Weave scope is not part of the kernel, it uses the We have only seen the crash I posted under kernel Your report says "scope Not tainted" I'm not a kernel debugging expert, but this might mean that weave scope is not at fault? |
I tried it with just Scope, no Kubernetes or other stuff:
After:
|
The Ubuntu Xenial update to kernel 4.4.0-119.143 from 4.4.0-116.140 did include a regression in the eBPF code. A basic `bpf_map_lookup_elem` call as found in the tcptracer-bpf library used by Scope leads to a kernel panic. As a result, Scope / the system crashes during startup when the tcptracer is initialized. The Scope bug report can be found here: weaveworks#3131 To avoid crashes and gently fallback to procfs (as Scope already does for systems not supporting eBPF), update `isKernelSupported()` and explicitly check for Ubuntu Kernel versions with the problem. Once the bug is fixed and an update published, the `abiNumber` check in `isKernelSupported()` can and should be updated with an upper limit. The Ubuntu bug report can be found here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763454
Pull request for a workaround (fall back to procfs on affected kernels): #3141 Ubuntu bug report: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763454 |
The Ubuntu Xenial update to kernel 4.4.0-119.143 from 4.4.0-116.140 did include a regression in the eBPF code. A basic `bpf_map_lookup_elem` call as found in the tcptracer-bpf library used by Scope leads to a kernel panic. As a result, Scope / the system crashes during startup when the tcptracer is initialized. The Scope bug report can be found here: weaveworks#3131 To avoid crashes and gently fallback to procfs (as Scope already does for systems not supporting eBPF), update `isKernelSupported()` and explicitly check for Ubuntu Kernel versions with the problem. Once the bug is fixed and an update published, the `abiNumber` check in `isKernelSupported()` can and should be updated with an upper limit. The Ubuntu bug report can be found here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763454
The Ubuntu Xenial update to kernel 4.4.0-119.143 from 4.4.0-116.140 did include a regression in the eBPF code. A basic `bpf_map_lookup_elem` call as found in the tcptracer-bpf library used by Scope leads to a kernel panic. As a result, Scope / the system crashes during startup when the tcptracer is initialized. The Scope bug report can be found here: weaveworks#3131 To avoid crashes and gently fallback to procfs (as Scope already does for systems not supporting eBPF), update `isKernelSupported()` and explicitly check for Ubuntu Kernel versions with the problem. Once the bug is fixed and an update published, the `abiNumber` check in `isKernelSupported()` can and should be updated with an upper limit. The Ubuntu bug report can be found here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763454
The Ubuntu Xenial update to kernel 4.4.0-119.143 from 4.4.0-116.140 did include a regression in the eBPF code. A basic `bpf_map_lookup_elem` call as found in the tcptracer-bpf library used by Scope leads to a kernel panic. As a result, Scope / the system crashes during startup when the tcptracer is initialized. The Scope bug report can be found here: weaveworks#3131 To avoid crashes and gently fallback to procfs (as Scope already does for systems not supporting eBPF), update `isKernelSupported()` and explicitly check for Ubuntu Kernel versions with the problem. Once the bug is fixed and an update published, the `abiNumber` check in `isKernelSupported()` can and should be updated with an upper limit. The Ubuntu bug report can be found here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763454
If you are getting bit by this you can block this specific kernel version using apt preferences until Ubuntu releases a new version. That way you can apt-get update with without worrying about accidentally installing it.
|
Update: Ubuntu has a fix in |
I wonder if this issue is causing my problems. I went from a Debian K8s cluster in AWS to an Ubuntu K8s cluster in AWS. Fresh cluster. Newest Ubuntu Image. Applied the WeaveScope Helm Chart - and BAM! All my hosts went out of service in the ELB (Instance status checks in AWS were failing too). When they came back in service - and I was able to do a |
Upgrading to 1.9.0 fixed my issue. |
With c75700f we added code to detect Ubuntu Xenial kernels with a regression in the eBPF subsystem in order to gently fallback to procfs scanning on such systems (and not crash the host system by running eBPF code). With the latest kernel update for Ubuntu Xenial, the bug was fixed: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763454 Therefore we can update the added check with an upper limit and make sure that eBPF connection tracking only is disabled on kernels within the range having the bug. xref: weaveworks#3131
With c75700f we added code to detect Ubuntu Xenial kernels with a regression in the eBPF subsystem in order to gently fallback to procfs scanning on such systems (and not crash the host system by running eBPF code). With the latest kernel update for Ubuntu Xenial, the bug was fixed: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1763454 Therefore we can update the added check with an upper limit and make sure that eBPF connection tracking only is disabled on kernels within the range having the bug. xref: weaveworks#3131
Full kernel version:
4.4.0-119-generic #143-Ubuntu SMP
We were previously running
4.4.0-116-generic #140-Ubuntu SMP
with no problemsWorkaround:
Disable eBPF connection tracking with
--probe.ebpf.connections=false
Panic details
panic.txt
The text was updated successfully, but these errors were encountered: