Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico node crash when create failsafe port #2904

Closed
wingerted opened this issue Oct 3, 2019 · 12 comments
Closed

Calico node crash when create failsafe port #2904

wingerted opened this issue Oct 3, 2019 · 12 comments
Assignees
Labels

Comments

@wingerted
Copy link

Current Behavior

kubectl get pods -n kube-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-7f68dfc8c6-qzb4s   1/1     Running   0          38m
calico-node-4nvhv                          1/1     Running   0          16m
calico-node-dl8ml                          1/1     Running   0          16m
calico-node-h9mrl                          1/1     Running   0          15m
calico-node-khx9d                          1/1     Running   0          15m
calico-node-msvqd                          0/1     Running   0          15m
2019-10-03 04:34:38.163 [WARNING][16449] int_dataplane.go 728: failed to set XDP failsafe ports, disabling XDP: failed to create map (calico_failsafe_ports_v1): exit status 255
Error: map create failed: Operation not permitted

2019-10-03 04:34:38.296 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-824808941): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-824808941'
Error: failed to load object file
 try=0
2019-10-03 04:34:38.356 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-424485992): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-424485992'
Error: failed to load object file
 try=1
2019-10-03 04:34:38.420 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-403440295): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-403440295'
Error: failed to load object file
 try=2
2019-10-03 04:34:38.472 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-154340314): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-154340314'
Error: failed to load object file
 try=3
2019-10-03 04:34:38.529 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-230288753): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-230288753'
Error: failed to load object file
 try=4
2019-10-03 04:34:38.612 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-864290844): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-864290844'
Error: failed to load object file
 try=5
2019-10-03 04:34:38.672 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-075479755): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-075479755'
Error: failed to load object file
 try=6
2019-10-03 04:34:38.732 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-109286318): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-109286318'
Error: failed to load object file
 try=7
2019-10-03 04:34:38.788 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-777518389): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-777518389'
Error: failed to load object file
 try=8
2019-10-03 04:34:38.836 [WARNING][16449] int_dataplane.go 781: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-444518672): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program.
libbpf: failed to load object '/tmp/felix-bpf-444518672'
Error: failed to load object file
 try=9
2019-10-03 04:34:38.836 [PANIC][16449] int_dataplane.go 784: Failed to wipe the XDP state after 10 tries
panic: (*logrus.Entry) (0x19b44a0,0xc0004dc230)

goroutine 1 [running]:
github.com/sirupsen/logrus.Entry.log(0xc00011c050, 0xc000588db0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f2d00000000, ...)
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/entry.go:112 +0x2d2
github.com/sirupsen/logrus.(*Entry).Panic(0xc0004dc0a0, 0xc000468290, 0x1, 0x1)
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/entry.go:182 +0x103
github.com/sirupsen/logrus.(*Entry).Panicf(0xc0004dc0a0, 0x1a336bb, 0x2b, 0xc000468340, 0x1, 0x1)
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/entry.go:230 +0xd4
github.com/sirupsen/logrus.(*Logger).Panicf(0xc00011c050, 0x1a336bb, 0x2b, 0xc000468340, 0x1, 0x1)
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/logger.go:173 +0x86
github.com/sirupsen/logrus.Panicf(...)
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/exported.go:145
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).shutdownXDPCompletely(0xc00040a900)
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20190910213021-a2d8a80b2ace/dataplane/linux/int_dataplane.go:784 +0x2cd
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).doStaticDataplaneConfig(0xc00040a900)
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20190910213021-a2d8a80b2ace/dataplane/linux/int_dataplane.go:729 +0xbaa
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).Start(0xc00040a900)
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20190910213021-a2d8a80b2ace/dataplane/linux/int_dataplane.go:592 +0x2f
github.com/projectcalico/felix/dataplane.StartDataplaneDriver(0xc0004c7900, 0xc00001e9c0, 0xc000681620, 0x1, 0xc0004697d8, 0x0)
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20190910213021-a2d8a80b2ace/dataplane/driver.go:186 +0xf09
github.com/projectcalico/felix/daemon.Run(0x1a05d48, 0x15, 0x1cbbc38, 0x6, 0x1d106e0, 0x28, 0x1ce87a0, 0x18)
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20190910213021-a2d8a80b2ace/daemon/daemon.go:305 +0x1759
main.main()
	/go/src/github.com/projectcalico/node/cmd/calico-node/main.go:100 +0x405

Context

I setup kubernetes 1.15.3 on a fresh intalled cluster by Kubespray
All nodes run Ubuntu 18.04.3 and kernel 5.0.0-29
calico version is 3.7.3 , I also try 3.9.1 and get the same error

I search the code and find the error command is
bpftool map create /sys/fs/bpf/calico/calico_failsafe_ports_v1 type hash key 4 value 1 entries 65535 name calico_failsafe_ports_v1 flags 1
It failed on only one node and after I reinstall docker it also failed

When it will be Operation not permitted ?

Your Environment

  • Calico version:3.7.3/3.9.1
  • Kubernetes version: 1.15.3
  • Operating System and version: Ubuntu 18.04.3 x64 kernel 5.0.0-29
@lwr20
Copy link
Member

lwr20 commented Oct 3, 2019

2019-10-03 04:34:38.836 [PANIC][16449] int_dataplane.go 784: Failed to wipe the XDP state after 10 tries
panic: (*logrus.Entry) (0x19b44a0,0xc0004dc230)

is similar to #2901

@fasaxc
Copy link
Member

fasaxc commented Oct 3, 2019

Does kubespray run calico as non-root? Maybe we need a new permission. As a workaround you should be able to disable the XDP feature.

@wingerted
Copy link
Author

@fasaxc it run as root
And there is only one node get this error,so I want to know why 😰

@fasaxc
Copy link
Member

fasaxc commented Oct 3, 2019

Anything obviously different about the bad node? Is it configured as the master? Is it running any particular services?

Some things that might help to find out what's special about that node:

sudo sysctl -a | grep bpf

and

kubectl exec -n kube-system <calico-node-pod-name> mount | grep bpf

I'[m wondering if that node had trouble mounting the BPF file system or if the BPF sysctls are disabling BPF calls.

@wingerted
Copy link
Author

sudo sysctl -a | grep bpf
kernel.unprivileged_bpf_disabled = 0
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
net.core.bpf_jit_kallsyms = 0
net.core.bpf_jit_limit = 264241152
kubectl exec -n kube-system calico-node-msvqd mount | grep bpf
/sys/fs/bpf on /sys/fs/bpf type bpf (rw,relatime)

It seem all right...

The bad node is not configured as the master

@fasaxc
Copy link
Member

fasaxc commented Oct 10, 2019

Are you running any policing/enforcing apps on there? For example selinux or something that monitors syscalls?

@fasaxc
Copy link
Member

fasaxc commented Oct 10, 2019

For anyone seeing this, a workaround should be to disable the XDP feature by setting the FELIX_XDPENABLED=false env var in the calico manifest. We're not sure what's causing the permissions errors on some nodes but not others, would be good to connect on Calico Users slack to investigate.

@jayunit100
Copy link
Contributor

jayunit100 commented Oct 19, 2019

So, after reading more about this:

  • XDP is a performance optimization
  • Its not working on one of the nodes of the cluster, maybe b/c of the SELinux rules or NIC or something.

But I think this issue is still actionable... because there is still the fact that:

  • shutdownXDPCompletely fails in a panic and
  • the 10 retries all happen within microseconds of each other it looks like.

... After looking more at the code, (1) there is no backoff between the retries for the XDP shutdown thingy and (2) it appears that the tryResync semantics aren't super clear, I think i have a fix for these, I'll file a PR shortly. That will make it so that this logic is more deterministic in failure scenarios, less spammy. I also wonder wether we should consider (3) not panicing, and just moving on - since we know that XPD isnt working to begin with?

@caseydavenport
Copy link
Member

@fasaxc @jayunit100 can we close this now that projectcalico/felix#2165 has gone in?

@mcmcghee
Copy link

mcmcghee commented Mar 6, 2020

For anyone else that comes across this, I found my issue was a combination of using ubuntu, kernel 5.3 and having secure boot enabled. Some newer kernels are enabling lockdown mode, which breaks BPF. You can read more at this comment and this bug report: Disabling bpf() syscall on kernel lockdown break apps when secure boot is on

@cityofships
Copy link

thanks for comment @mcmcghee, I'm hitting exactly this

@wingerted
Copy link
Author

That's it ! @mcmcghee Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants