Skip to content
This repository has been archived by the owner on Jun 28, 2023. It is now read-only.

Bootstrap cluster failing to initialize on newer kernels #891

Closed
stmcginnis opened this issue Jun 29, 2021 · 6 comments
Closed

Bootstrap cluster failing to initialize on newer kernels #891

stmcginnis opened this issue Jun 29, 2021 · 6 comments
Labels
kind/bug A bug in an existing capability triage/needs-triage Needs triage by TCE maintainers

Comments

@stmcginnis
Copy link
Contributor

Bug Report

Hit this issue on Ubuntu 21.04, kernel version 5.11.0-22-generic.

During deployment of the bootstrap cluster, the process hangs and eventually times out.

Looking at the services during this time, there are two services hitting CrashLoopBackoff:

$ kubectl --kubeconfig=/home/smcginnis/.kube-tkg/tmp/config_P6PAe67a get pods,deployments -A
NAMESPACE            NAME                                                                      READY   STATUS             RESTARTS   AGE
cert-manager         pod/cert-manager-7b7f88644d-8g6hr                                         0/1     Pending            0          59s
cert-manager         pod/cert-manager-cainjector-7cd8d6b475-c4ghb                              0/1     Pending            0          59s
cert-manager         pod/cert-manager-webhook-5f9b95bd4-q8bnh                                  0/1     Pending            0          59s
kube-system          pod/coredns-68d49685bd-pdcz5                                              0/1     Pending            0          2m50s
kube-system          pod/coredns-68d49685bd-v6bc5                                              0/1     Pending            0          2m50s
kube-system          pod/etcd-tkg-kind-c3dofvlugk4dikad0gb0-control-plane                      1/1     Running            0          3m2s
kube-system          pod/kindnet-w6prv                                                         0/1     CrashLoopBackOff   4          2m50s
kube-system          pod/kube-apiserver-tkg-kind-c3dofvlugk4dikad0gb0-control-plane            1/1     Running            0          3m2s
kube-system          pod/kube-controller-manager-tkg-kind-c3dofvlugk4dikad0gb0-control-plane   1/1     Running            0          3m2s
kube-system          pod/kube-proxy-h6ltl                                                      0/1     CrashLoopBackOff   4          2m50s
kube-system          pod/kube-scheduler-tkg-kind-c3dofvlugk4dikad0gb0-control-plane            1/1     Running            0          3m2s
local-path-storage   pod/local-path-provisioner-78776bfc44-r7qm5                               0/1     Pending            0          2m50s

NAMESPACE            NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
cert-manager         deployment.apps/cert-manager              0/1     1            0           59s
cert-manager         deployment.apps/cert-manager-cainjector   0/1     1            0           59s
cert-manager         deployment.apps/cert-manager-webhook      0/1     1            0           59s
kube-system          deployment.apps/coredns                   0/2     2            0           3m3s
local-path-storage   deployment.apps/local-path-provisioner    0/1     1            0           3m2s

Looking at the kube-proxy logs, we see there is a permission denied error:

$ kubectl --kubeconfig=/home/smcginnis/.kube-tkg/tmp/config_P6PAe67a logs -n kube-system pod/kube-proxy-h6ltl
I0629 20:57:46.292878       1 node.go:172] Successfully retrieved node IP: 172.17.0.2
I0629 20:57:46.292936       1 server_others.go:142] kube-proxy node IP is an IPv4 address (172.17.0.2), assume IPv4 operation
W0629 20:57:46.302727       1 server_others.go:578] Unknown proxy mode "", assuming iptables proxy
I0629 20:57:46.302900       1 server_others.go:185] Using iptables Proxier.
I0629 20:57:46.303133       1 server.go:650] Version: v1.20.4+vmware.1
I0629 20:57:46.303647       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
F0629 20:57:46.303668       1 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

It looks like this has been addressed in the latest kind with kubernetes-sigs/kind#2241

Interestingly, also hit this with openSUSE Leap running an older kernel, 5.3.18-59.5-default. In that case I was able to use one of the suggested workarounds of:

sudo sysctl net/netfilter/nf_conntrack_max=524288

Then restart the failing pods or rerun the tanzu standalone-cluster create -i docker foo command. That got past the initial failure, but then hung on an issue in capd-controller that is still being investigated.

Expected Behavior

Bootstrap cluster deployment should not hang.

We may need to move up to the latest kind image to get this though.

Steps to Reproduce the Bug

Attempt to deploy on Ubuntu 21.04.

Environment Details

  • Build version (tanzu version): v0.3.0 (TCE version v0.5.0)
  • Operating System (client): Ubuntu 21.04
@stmcginnis stmcginnis added kind/bug A bug in an existing capability triage/needs-triage Needs triage by TCE maintainers labels Jun 29, 2021
@stmcginnis
Copy link
Contributor Author

Interestingly, also hit this with openSUSE Leap...
In that case I was able to use one of the suggested workarounds ...
That got past the initial failure, but then hung on an issue in capd-controller that is still being investigated.

Restarted, cleaned up pods and ~/.tanzu directory, then ran tanzu standalone-cluster create -i docker testing and it was able to successfully deploy on openSUSE Leap 15.3. So openSUSE appears to at least have the workaround mentioned. Ubuntu 21.04 is blocked.

@randomvariable
Copy link
Contributor

Bumping core in go.mod should resolve this.

@randomvariable
Copy link
Contributor

capd should also be resolved if you're pulling in the providers from the core repo.

@jpmcb
Copy link
Contributor

jpmcb commented Jul 6, 2021

Seeing the same thing on my Pop_OS system (which is ubuntu based):

OS: Pop!_OS 21.04 x86_64
Kernel: 5.11.0-7620-generic

And getting the same errors in kube proxy

@randomvariable
Copy link
Contributor

randomvariable commented Jul 7, 2021

Once TCE is synced with the providers from the framework repo and pulls in a newer version in its go.mod, it should work.

@jpmcb
Copy link
Contributor

jpmcb commented Jul 14, 2021

#918 is merged and I've confirmed off this branch that we no longer are seeing the kube-proxy permission issue

@jpmcb jpmcb closed this as completed Jul 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug A bug in an existing capability triage/needs-triage Needs triage by TCE maintainers
Projects
None yet
Development

No branches or pull requests

3 participants