Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] failing on linux kernel >= 5.12.2 (and >= 5.11.19, ...) #607

Closed
kpango opened this issue May 16, 2021 · 19 comments
Closed

[BUG] failing on linux kernel >= 5.12.2 (and >= 5.11.19, ...) #607

kpango opened this issue May 16, 2021 · 19 comments
Assignees
Labels
bug Something isn't working k3s This is likely an issue with k3s not k3d itself
Milestone

Comments

@kpango
Copy link

kpango commented May 16, 2021

What did you do

  • How was the cluster created?

    • k3d cluster create "vald-cluster" -p "8081:80@loadbalancer" --agents 5
  • What did you do afterwards?

    • k3d commands?
    • docker commands?
    • OS operations (e.g. shutdown/reboot)?

What did you expect to happen

I expect that k3d works with the latest kernels, just as it worked with Linux Kernel 5.11.

When multiple Agents were specified, k3d did not proceed to start the Agents when the cluster was created, and the Docker logs showed that the K3D Server was restarting at a high frequency.

In the container log, it looks like kube-proxy is failing to start due to failure to set nf_conntrack_max.

I looked into this a bit and found a similar problem in kind and minikube, and it seems to be fixed in the following Issue and PR.

for kind
kubernetes-sigs/kind#2240
kubernetes-sigs/kind#2241

for minikube
kubernetes/minikube#11419

Screenshots or terminal output

k3d command cli log

INFO[0000] Prep: Network
INFO[0000] Created network 'k3d-vald-cluster' (dd17969b82f64e60516154a7a8492e91daf0e1eb9f4ed8a70d5d6d57601b3692)
INFO[0000] Created volume 'k3d-vald-cluster-images'
INFO[0001] Creating node 'k3d-vald-cluster-server-0'
INFO[0001] Creating node 'k3d-vald-cluster-agent-0'
INFO[0001] Creating node 'k3d-vald-cluster-agent-1'
INFO[0001] Creating node 'k3d-vald-cluster-agent-2'
INFO[0001] Creating node 'k3d-vald-cluster-agent-3'
INFO[0001] Creating node 'k3d-vald-cluster-agent-4'
INFO[0001] Creating LoadBalancer 'k3d-vald-cluster-serverlb'
INFO[0001] Starting cluster 'vald-cluster'
INFO[0001] Starting servers...
INFO[0001] Starting Node 'k3d-vald-cluster-server-0'
INFO[0005] Starting agents...
INFO[0005] Starting Node 'k3d-vald-cluster-agent-0'  <----- creating process stops here!!!

docker ps

CONTAINER ID   IMAGE                      COMMAND                  CREATED         STATUS                          PORTS     NAMES
4ad2f4712ae1   rancher/k3s:v1.20.6-k3s1   "/bin/k3s agent"         6 minutes ago   Up 3 minutes                              k3d-vald-cluster-agent-0
f9e815595dcf   rancher/k3s:v1.20.6-k3s1   "/bin/k3s server --t…"   6 minutes ago   Restarting (1) 42 seconds ago             k3d-vald-cluster-server-0

docker logs --defail -t f9e815595dcf

2021-05-16T05:05:29.445125257Z  time="2021-05-16T05:05:29.444942848Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
2021-05-16T05:05:29.445182399Z  time="2021-05-16T05:05:29.445150819Z" level=info msg="Cluster bootstrap already complete"
2021-05-16T05:05:29.455391273Z  time="2021-05-16T05:05:29.455257589Z" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
2021-05-16T05:05:29.455407602Z  time="2021-05-16T05:05:29.455285849Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
2021-05-16T05:05:29.455409948Z  time="2021-05-16T05:05:29.455311476Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1621141073: notBefore=2021-05-16 04:57:53 +0000 UTC notAfter=2022-05-16 05:05:29 +0000 UTC"
2021-05-16T05:05:29.455444959Z  time="2021-05-16T05:05:29.455362850Z" level=info msg="Database tables and indexes are up to date"
2021-05-16T05:05:29.456363097Z  time="2021-05-16T05:05:29.456238613Z" level=info msg="Kine listening on unix://kine.sock"
2021-05-16T05:05:29.456432105Z  time="2021-05-16T05:05:29.456335058Z" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,k3s --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-servers=unix://kine.sock --feature-gates=ServiceAccountIssuerDiscovery=false --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
2021-05-16T05:05:29.456995072Z  Flag --insecure-port has been deprecated, This flag has no effect now and will be removed in v1.24.
2021-05-16T05:05:29.457269375Z  I0516 05:05:29.457197       7 server.go:659] external host was not specified, using 192.168.9.3
2021-05-16T05:05:29.457451380Z  I0516 05:05:29.457389       7 server.go:196] Version: v1.20.6+k3s1
2021-05-16T05:05:29.460661489Z  I0516 05:05:29.460623       7 shared_informer.go:240] Waiting for caches to sync for node_authorizer
2021-05-16T05:05:29.461489325Z  I0516 05:05:29.461466       7 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
2021-05-16T05:05:29.461498436Z  I0516 05:05:29.461474       7 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
2021-05-16T05:05:29.462113819Z  I0516 05:05:29.462055       7 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
2021-05-16T05:05:29.462123361Z  I0516 05:05:29.462061       7 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
2021-05-16T05:05:29.474074978Z  I0516 05:05:29.473963       7 instance.go:289] Using reconciler: lease
2021-05-16T05:05:29.492167498Z  I0516 05:05:29.492094       7 rest.go:131] the default service ipfamily for this cluster is: IPv4
2021-05-16T05:05:29.650038941Z  W0516 05:05:29.649900       7 genericapiserver.go:425] Skipping API batch/v2alpha1 because it has no resources.
2021-05-16T05:05:29.655280702Z  W0516 05:05:29.655185       7 genericapiserver.go:425] Skipping API discovery.k8s.io/v1alpha1 because it has no resources.
2021-05-16T05:05:29.661628727Z  W0516 05:05:29.661533       7 genericapiserver.go:425] Skipping API node.k8s.io/v1alpha1 because it has no resources.
2021-05-16T05:05:29.666259481Z  W0516 05:05:29.666153       7 genericapiserver.go:425] Skipping API rbac.authorization.k8s.io/v1alpha1 because it has no resources.
2021-05-16T05:05:29.668215081Z  W0516 05:05:29.668089       7 genericapiserver.go:425] Skipping API scheduling.k8s.io/v1alpha1 because it has no resources.
2021-05-16T05:05:29.671398493Z  W0516 05:05:29.671305       7 genericapiserver.go:425] Skipping API storage.k8s.io/v1alpha1 because it has no resources.
2021-05-16T05:05:29.672868265Z  W0516 05:05:29.672771       7 genericapiserver.go:425] Skipping API flowcontrol.apiserver.k8s.io/v1alpha1 because it has no resources.
2021-05-16T05:05:29.676425119Z  W0516 05:05:29.676254       7 genericapiserver.go:425] Skipping API apps/v1beta2 because it has no resources.
2021-05-16T05:05:29.676434571Z  W0516 05:05:29.676262       7 genericapiserver.go:425] Skipping API apps/v1beta1 because it has no resources.
2021-05-16T05:05:29.681968471Z  I0516 05:05:29.681798       7 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
2021-05-16T05:05:29.681978716Z  I0516 05:05:29.681805       7 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
2021-05-16T05:05:29.689958434Z  time="2021-05-16T05:05:29.689772088Z" level=info msg="Running kube-scheduler --address=127.0.0.1 --bind-address=127.0.0.1 --kubeconfig=/var/lib/rancher/k3s/server/cred/scheduler.kubeconfig --leader-elect=false --port=10251 --profiling=false --secure-port=0"
2021-05-16T05:05:29.690001012Z  time="2021-05-16T05:05:29.689866940Z" level=info msg="Waiting for API server to become available"
2021-05-16T05:05:29.690364949Z  time="2021-05-16T05:05:29.690274194Z" level=info msg="Running kube-controller-manager --address=127.0.0.1 --allocate-node-cidrs=true --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-cert-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --cluster-signing-key-file=/var/lib/rancher/k3s/server/tls/client-ca.key --configure-cloud-routes=false --controllers=*,-service,-route,-cloud-node-lifecycle --kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --leader-elect=false --port=10252 --profiling=false --root-ca-file=/var/lib/rancher/k3s/server/tls/server-ca.crt --secure-port=0 --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --use-service-account-credentials=true"
2021-05-16T05:05:29.691579980Z  time="2021-05-16T05:05:29.691499972Z" level=info msg="Node token is available at /var/lib/rancher/k3s/server/token"
2021-05-16T05:05:29.691588379Z  time="2021-05-16T05:05:29.691532877Z" level=info msg="To join node to cluster: k3s agent -s https://192.168.9.3:6443 -t ${NODE_TOKEN}"
2021-05-16T05:05:29.692480504Z  time="2021-05-16T05:05:29.692335848Z" level=info msg="Wrote kubeconfig /output/kubeconfig.yaml"
2021-05-16T05:05:29.692511505Z  time="2021-05-16T05:05:29.692424415Z" level=info msg="Run: k3s kubectl"
2021-05-16T05:05:29.692662540Z  time="2021-05-16T05:05:29.692560855Z" level=info msg="Module overlay was already loaded"
2021-05-16T05:05:29.692668059Z  time="2021-05-16T05:05:29.692584060Z" level=info msg="Module nf_conntrack was already loaded"
2021-05-16T05:05:29.692691286Z  time="2021-05-16T05:05:29.692590031Z" level=info msg="Module br_netfilter was already loaded"
2021-05-16T05:05:29.693054521Z  time="2021-05-16T05:05:29.692989749Z" level=warning msg="Failed to start iptable_nat module"
2021-05-16T05:05:29.703958256Z  time="2021-05-16T05:05:29.703817885Z" level=info msg="Cluster-Http-Server 2021/05/16 05:05:29 http: TLS handshake error from 127.0.0.1:49440: remote error: tls: bad certificate"
2021-05-16T05:05:29.706308528Z  time="2021-05-16T05:05:29.706185602Z" level=info msg="Cluster-Http-Server 2021/05/16 05:05:29 http: TLS handshake error from 127.0.0.1:49446: remote error: tls: bad certificate"
2021-05-16T05:05:29.717045241Z  time="2021-05-16T05:05:29.716906780Z" level=info msg="certificate CN=k3d-vald-cluster-server-0 signed by CN=k3s-server-ca@1621141073: notBefore=2021-05-16 04:57:53 +0000 UTC notAfter=2022-05-16 05:05:29 +0000 UTC"
2021-05-16T05:05:29.718456054Z  time="2021-05-16T05:05:29.718327116Z" level=info msg="certificate CN=system:node:k3d-vald-cluster-server-0,O=system:nodes signed by CN=k3s-client-ca@1621141073: notBefore=2021-05-16 04:57:53 +0000 UTC notAfter=2022-05-16 05:05:29 +0000 UTC"
2021-05-16T05:05:29.743269834Z  time="2021-05-16T05:05:29.743103566Z" level=info msg="Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log"
2021-05-16T05:05:29.743306120Z  time="2021-05-16T05:05:29.743211177Z" level=info msg="Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd"
2021-05-16T05:05:30.745224049Z  time="2021-05-16T05:05:30.745113942Z" level=info msg="Containerd is now running"
2021-05-16T05:05:30.749774235Z  time="2021-05-16T05:05:30.749667049Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
2021-05-16T05:05:30.750983191Z  time="2021-05-16T05:05:30.750914236Z" level=info msg="Handling backend connection request [k3d-vald-cluster-server-0]"
2021-05-16T05:05:30.751362185Z  time="2021-05-16T05:05:30.751272040Z" level=info msg="Running kubelet --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/var/lib/rancher/k3s/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --cni-bin-dir=/bin --cni-conf-dir=/var/lib/rancher/k3s/agent/etc/cni/net.d --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --container-runtime=remote --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=k3d-vald-cluster-server-0 --kubeconfig=/var/lib/rancher/k3s/agent/kubelet.kubeconfig --node-labels= --pod-manifest-path=/var/lib/rancher/k3s/agent/pod-manifests --read-only-port=0 --resolv-conf=/tmp/k3s-resolv.conf --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/k3s/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/k3s/agent/serving-kubelet.key"
2021-05-16T05:05:30.751655452Z  time="2021-05-16T05:05:30.751584847Z" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --healthz-bind-address=127.0.0.1 --hostname-override=k3d-vald-cluster-server-0 --kubeconfig=/var/lib/rancher/k3s/agent/kubeproxy.kubeconfig --proxy-mode=iptables"
2021-05-16T05:05:30.751837779Z  Flag --cloud-provider has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
2021-05-16T05:05:30.751843191Z  Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
2021-05-16T05:05:30.751929721Z  W0516 05:05:30.751853       7 server.go:226] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
2021-05-16T05:05:30.752339535Z  I0516 05:05:30.752269       7 server.go:412] Version: v1.20.6+k3s1
2021-05-16T05:05:30.752347133Z  W0516 05:05:30.752286       7 proxier.go:651] Failed to read file /lib/modules/5.12.3-zen1-1-zen/modules.builtin with error open /lib/modules/5.12.3-zen1-1-zen/modules.builtin: no such file or directory. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2021-05-16T05:05:30.752714692Z  W0516 05:05:30.752674       7 proxier.go:661] Failed to load kernel module ip_vs with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2021-05-16T05:05:30.752975127Z  W0516 05:05:30.752944       7 proxier.go:661] Failed to load kernel module ip_vs_rr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2021-05-16T05:05:30.753328453Z  W0516 05:05:30.753198       7 proxier.go:661] Failed to load kernel module ip_vs_wrr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2021-05-16T05:05:30.753597919Z  W0516 05:05:30.753517       7 proxier.go:661] Failed to load kernel module ip_vs_sh with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2021-05-16T05:05:30.753896991Z  W0516 05:05:30.753828       7 proxier.go:661] Failed to load kernel module nf_conntrack with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
2021-05-16T05:05:30.756976987Z  time="2021-05-16T05:05:30.756939289Z" level=info msg="Waiting for node k3d-vald-cluster-server-0 CIDR not assigned yet"
2021-05-16T05:05:30.757863082Z  E0516 05:05:30.757804       7 node.go:161] Failed to retrieve node info: nodes "k3d-vald-cluster-server-0" is forbidden: User "system:kube-proxy" cannot get resource "nodes" in API group "" at the cluster scope
2021-05-16T05:05:30.766593282Z  I0516 05:05:30.766529       7 dynamic_cafile_content.go:167] Starting client-ca-bundle::/var/lib/rancher/k3s/agent/client-ca.crt
2021-05-16T05:05:30.766603912Z  W0516 05:05:30.766573       7 manager.go:159] Cannot detect current cgroup on cgroup v2
2021-05-16T05:05:30.826258097Z  I0516 05:05:30.826155       7 dynamic_cafile_content.go:167] Starting request-header::/var/lib/rancher/k3s/server/tls/request-header-ca.crt
2021-05-16T05:05:30.826278544Z  I0516 05:05:30.826163       7 dynamic_cafile_content.go:167] Starting client-ca-bundle::/var/lib/rancher/k3s/server/tls/client-ca.crt
2021-05-16T05:05:30.826402499Z  I0516 05:05:30.826374       7 dynamic_serving_content.go:130] Starting serving-cert::/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt::/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key
2021-05-16T05:05:30.826668880Z  I0516 05:05:30.826616       7 secure_serving.go:197] Serving securely on 127.0.0.1:6444
2021-05-16T05:05:30.826678699Z  I0516 05:05:30.826636       7 autoregister_controller.go:141] Starting autoregister controller
2021-05-16T05:05:30.826681363Z  I0516 05:05:30.826640       7 cache.go:32] Waiting for caches to sync for autoregister controller
2021-05-16T05:05:30.826683407Z  I0516 05:05:30.826649       7 tlsconfig.go:240] Starting DynamicServingCertificateController
2021-05-16T05:05:30.826759615Z  I0516 05:05:30.826736       7 available_controller.go:475] Starting AvailableConditionController
2021-05-16T05:05:30.826764180Z  I0516 05:05:30.826742       7 cache.go:32] Waiting for caches to sync for AvailableConditionController controller
2021-05-16T05:05:30.826770368Z  I0516 05:05:30.826754       7 controller.go:83] Starting OpenAPI AggregationController
2021-05-16T05:05:30.826776183Z  I0516 05:05:30.826761       7 apf_controller.go:261] Starting API Priority and Fairness config controller
2021-05-16T05:05:30.826937152Z  I0516 05:05:30.826838       7 dynamic_serving_content.go:130] Starting aggregator-proxy-cert::/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt::/var/lib/rancher/k3s/server/tls/client-auth-proxy.key
2021-05-16T05:05:30.826942590Z  I0516 05:05:30.826867       7 apiservice_controller.go:97] Starting APIServiceRegistrationController
2021-05-16T05:05:30.826945390Z  I0516 05:05:30.826872       7 cache.go:32] Waiting for caches to sync for APIServiceRegistrationController controller
2021-05-16T05:05:30.826970033Z  I0516 05:05:30.826937       7 crdregistration_controller.go:111] Starting crd-autoregister controller
2021-05-16T05:05:30.826978302Z  I0516 05:05:30.826945       7 shared_informer.go:240] Waiting for caches to sync for crd-autoregister
2021-05-16T05:05:30.827001376Z  I0516 05:05:30.826966       7 customresource_discovery_controller.go:209] Starting DiscoveryController
2021-05-16T05:05:30.827061951Z  I0516 05:05:30.827035       7 controller.go:86] Starting OpenAPI controller
2021-05-16T05:05:30.827105498Z  I0516 05:05:30.827068       7 cluster_authentication_trust_controller.go:440] Starting cluster_authentication_trust_controller controller
2021-05-16T05:05:30.827117880Z  I0516 05:05:30.827084       7 shared_informer.go:240] Waiting for caches to sync for cluster_authentication_trust_controller
2021-05-16T05:05:30.827144380Z  I0516 05:05:30.827123       7 dynamic_cafile_content.go:167] Starting client-ca-bundle::/var/lib/rancher/k3s/server/tls/client-ca.crt
2021-05-16T05:05:30.827176146Z  I0516 05:05:30.827143       7 dynamic_cafile_content.go:167] Starting request-header::/var/lib/rancher/k3s/server/tls/request-header-ca.crt
2021-05-16T05:05:30.827208563Z  I0516 05:05:30.827073       7 naming_controller.go:291] Starting NamingConditionController
2021-05-16T05:05:30.827256586Z  I0516 05:05:30.827233       7 establishing_controller.go:76] Starting EstablishingController
2021-05-16T05:05:30.827308442Z  I0516 05:05:30.827283       7 nonstructuralschema_controller.go:192] Starting NonStructuralSchemaConditionController
2021-05-16T05:05:30.827353842Z  I0516 05:05:30.827330       7 apiapproval_controller.go:186] Starting KubernetesAPIApprovalPolicyConformantConditionController
2021-05-16T05:05:30.827395085Z  I0516 05:05:30.827374       7 crd_finalizer.go:266] Starting CRDFinalizer
2021-05-16T05:05:30.831179268Z  time="2021-05-16T05:05:30.831125056Z" level=info msg="Running cloud-controller-manager with args --profiling=false"
2021-05-16T05:05:30.833689279Z  I0516 05:05:30.833640       7 controllermanager.go:141] Version: v1.20.6+k3s1
2021-05-16T05:05:30.840625877Z  E0516 05:05:30.840565       7 controller.go:156] Unable to remove old endpoints from kubernetes service: no master IPs were listed in storage, refusing to erase all endpoints for the kubernetes service
2021-05-16T05:05:30.861097073Z  I0516 05:05:30.860948       7 shared_informer.go:247] Caches are synced for node_authorizer
2021-05-16T05:05:30.927147030Z  I0516 05:05:30.926809       7 cache.go:39] Caches are synced for autoregister controller
2021-05-16T05:05:30.927212193Z  I0516 05:05:30.926844       7 apf_controller.go:266] Running API Priority and Fairness config worker
2021-05-16T05:05:30.927229303Z  I0516 05:05:30.926892       7 cache.go:39] Caches are synced for AvailableConditionController controller
2021-05-16T05:05:30.927240386Z  I0516 05:05:30.926923       7 cache.go:39] Caches are synced for APIServiceRegistrationController controller
2021-05-16T05:05:30.927394395Z  I0516 05:05:30.927149       7 shared_informer.go:247] Caches are synced for crd-autoregister
2021-05-16T05:05:30.927418697Z  I0516 05:05:30.927158       7 shared_informer.go:247] Caches are synced for cluster_authentication_trust_controller
2021-05-16T05:05:31.779966985Z  I0516 05:05:31.779715       7 node.go:172] Successfully retrieved node IP: 192.168.9.2
2021-05-16T05:05:31.780031732Z  I0516 05:05:31.779791       7 server_others.go:143] kube-proxy node IP is an IPv4 address (192.168.9.2), assume IPv4 operation
2021-05-16T05:05:31.781659505Z  I0516 05:05:31.781468       7 server_others.go:186] Using iptables Proxier.
2021-05-16T05:05:31.782141843Z  I0516 05:05:31.781934       7 server.go:650] Version: v1.20.6+k3s1
2021-05-16T05:05:31.783060048Z  I0516 05:05:31.782851       7 conntrack.go:103] Set sysctl 'net/netfilter/nf_conntrack_max' to 524288
2021-05-16T05:05:31.783130013Z  F0516 05:05:31.782902       7 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

Which OS & Architecture

OS: Arch Linux x86_64
Host: 20QTCTO1WW ThinkPad P1 Gen 2
Kernel: 5.12.3-zen1-1-zen
Shell: zsh 5.8
CPU: Intel i9-9880H (16) @ 4.800GHz
GPU: Intel UHD Graphics 630
GPU: NVIDIA Quadro T2000 Mobile / Max-Q
Memory: 8385MiB / 63983MiB

Which version of k3d

k3d version

k3d version v4.4.3
k3s version v1.20.6-k3s1 (default)

Which version of docker

docker version

Client: Docker Engine - Community
 Version:           20.10.0-rc2
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        dca98e3
 Built:             Wed Dec  2 22:15:25 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.6
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.3
  Git commit:       8728dd246c
  Built:            Mon Apr 12 14:10:25 2021
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          v1.5.0
  GitCommit:        8c906ff108ac28da23f69cc7b74f8e7a470d1df0.m
 runc:
  Version:          1.0.0-rc94
  GitCommit:        2c7861bc5e1b3e756392236553ec14a78a09f8bf
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Build with BuildKit (Docker Inc., v0.5.1)
  compose: Docker Compose (Docker Inc., 2.0.0-beta.1)

Server:
 Containers: 8
  Running: 2
  Paused: 0
  Stopped: 6
 Images: 34
 Server Version: 20.10.6
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runsc runu io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8c906ff108ac28da23f69cc7b74f8e7a470d1df0.m
 runc version: 2c7861bc5e1b3e756392236553ec14a78a09f8bf
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.12.3-zen1-1-zen
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.48GiB
 Name: archpango
 ID: VRVW:GWHR:MEGI:BZLM:CPYT:RLPQ:OGWK:Q5KP:UM4E:4ERY:CSDK:K5YQ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true
 Default Address Pools:
   Base: 192.168.0.0/16, Size: 24
@kpango kpango added the bug Something isn't working label May 16, 2021
@kpango kpango changed the title [BUG] multi node k3d fails on Kernel 5.12.3 [BUG] multi node k3d fails on linux kernel 5.12.3 May 16, 2021
@yann2192
Copy link

yann2192 commented May 17, 2021

Same error on kernel 5.12.4

Quick workaround : kubernetes-sigs/kind#2240 (comment)

@iwilltry42
Copy link
Member

Hi @kpango , thanks for opening this issue and especially for doing some investigation already 👍
Similar issue was already mentioned before in #604 (closed in favor of this issue).
Let's see what we can do here 👍

@iwilltry42 iwilltry42 added the k3s This is likely an issue with k3s not k3d itself label May 17, 2021
@iwilltry42
Copy link
Member

iwilltry42 commented May 17, 2021

I'm currently setting up a test system, but I guess that given the workaround mentioned in the kind issue, this could work here: k3d cluster create --k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" --k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0" (https://rancher.com/docs/k3s/latest/en/installation/install-options/agent-config/ & https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ & kubernetes-sigs/kind#2241)

@iwilltry42 iwilltry42 changed the title [BUG] multi node k3d fails on linux kernel 5.12.3 [BUG] failing on linux kernel >= 5.12.2 May 17, 2021
@iwilltry42
Copy link
Member

iwilltry42 commented May 17, 2021

Just finished my test setup:

  • Command: k3d cluster create --k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" --k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0" --image rancher/k3s:v1.20.6-k3s (before: export K3D_FIX_CGROUPV2=true as the system is on cgroupv2)
  • OS: openSUSE Tumbleweed
  • Kernel: 5.12.3-1-default
  • Docker: 20.10.6-ce
  • k3d: v4.4.3
  • k3s: v1.20.6-k3s

This works just fine 👍

@nemonik can you try this please?

@FedericoAntoniazzi
Copy link

Hi, I just faced this issue and the solution proposed by @iwilltry42 is working perfectly!

  • OS: Arch Linux
  • Kernel: 5.12.4
  • Docker: 20.10.6
  • k3d: v4.4.3
  • k3s: v1.20.6-k3s1

@yann2192
Copy link

@iwilltry42 thanks for the workaround, it works well on Arch.

  • OS: Arch Linux
  • Kernel: 5.12.4
  • Docker: 20.10.6
  • k3d: v4.4.3
  • k3s: v1.20.6-k3s1

@kpango
Copy link
Author

kpango commented May 18, 2021

@iwilltry42 it works,
command:

k3d cluster create --servers 1 --k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" --k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0"

but if we use agents server components still restarting frequently.

k3d cluster create --servers 1 --agents 5 --k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" --k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0"

k3d output

$ k3d cluster create --servers 1 --agents 5 --k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" --k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0"
INFO[0000] Prep: Network
INFO[0000] Re-using existing network 'k3d-k3s-default' (b8b69b90908fd823a253793aba7d4b707f0d6c8de1ead16733f332822e03f697)
INFO[0000] Created volume 'k3d-k3s-default-images'
INFO[0001] Creating node 'k3d-k3s-default-server-0'
INFO[0001] Creating node 'k3d-k3s-default-agent-0'
INFO[0001] Creating node 'k3d-k3s-default-agent-1'
INFO[0001] Creating node 'k3d-k3s-default-agent-2'
INFO[0001] Creating node 'k3d-k3s-default-agent-3'
INFO[0001] Creating node 'k3d-k3s-default-agent-4'
INFO[0001] Creating LoadBalancer 'k3d-k3s-default-serverlb'
INFO[0001] Starting cluster 'k3s-default'
INFO[0001] Starting servers...
INFO[0001] Starting Node 'k3d-k3s-default-server-0'
INFO[0006] Starting agents...
INFO[0006] Starting Node 'k3d-k3s-default-agent-0'
WARN[0389] Node 'k3d-k3s-default-agent-0' is restarting for more than a minute now. Possibly it will recover soon (e.g. when it's waiting to join). Consider using a creation timeout to avoid waiting forever in a Restart Loop.

docker ps

CONTAINER ID   IMAGE                      COMMAND                  CREATED          STATUS                            PORTS     NAMES
8f913ad43059   rancher/k3s:v1.20.6-k3s1   "/bin/k3s agent --ku…"   10 minutes ago   Up 2 minutes                                k3d-k3s-default-agent-0
b9af47763c3a   rancher/k3s:v1.20.6-k3s1   "/bin/k3s server --k…"   10 minutes ago   Restarting (255) 20 seconds ago             k3d-k3s-default-server-0

docker's error log

docker logs --detail -t b9af47763c3a &> error.log

error.log

OS: Arch Linux
Kernel: 5.12.4
Docker: 20.10.6
k3d: v4.4.3
k3s: v1.20.6-k3s1 (default)

@iwilltry42
Copy link
Member

Closing this, as it will be fixed upstream (in k3s), thanks to @brandond 🙏

rancherio-gh-m pushed a commit that referenced this issue May 20, 2021
Author: iwilltry42 <iwilltry42@gmail.com>
Date:   Thu May 20 07:57:22 2021 +0200

    docs: add faq entry on 'nf_conntrack_max: permission denied' issue (#607)
@iwilltry42 iwilltry42 changed the title [BUG] failing on linux kernel >= 5.12.2 [BUG] failing on linux kernel >= 5.12.2 (and >= 5.11.19, ...) May 20, 2021
rancherio-gh-m pushed a commit that referenced this issue May 20, 2021
Author: iwilltry42 <iwilltry42@gmail.com>
Date:   Thu May 20 07:57:22 2021 +0200

    docs: add faq entry on 'nf_conntrack_max: permission denied' issue (#607)
@iwilltry42
Copy link
Member

I've added a new FAQ entry on this issue: https://k3d.io/faq/faq/#nodes-fail-to-start-or-get-stuck-in-notready-state-with-log-nf_conntrack_max-permission-denied

Also, thanks to #612 we quickly checked, that (obviously) also other kernel lines are affected, like 5.11 as of 5.11.19.

@nemonik
Copy link

nemonik commented Jun 12, 2021

Just finished my test setup:

  • Command: k3d cluster create --k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" --k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0" --image rancher/k3s:v1.20.6-k3s (before: export K3D_FIX_CGROUPV2=true as the system is on cgroupv2)
  • OS: openSUSE Tumbleweed
  • Kernel: 5.12.3-1-default
  • Docker: 20.10.6-ce
  • k3d: v4.4.3
  • k3s: v1.20.6-k3s

This works just fine

@nemonik can you try this please?

So, I finally got to the point where I could try to press forward with this... Creating a cluster straight away with k3d on Arch running 5.12.10-arch1-1 still doesn't work using k3d installed out of AUR.

⋊> ⨯ k3d --version   
k3d version v4.4.4
k3s version v1.20.6-k3s1 (default)

So, I did as you asked modifying for the latest k3s docker image or by just stripping out the image param like so

k3d cluster create ${k3d_cluster_name}  --k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" \
    --k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0" --image rancher/k3s:v1.21.1-k3s1 \
    --api-port 127.0.0.1:6443 -p 80:80@loadbalancer -p 443:443@loadbalancer \
    -p 2022:2022@loadbalancer --k3s-server-arg "--no-deploy=traefik" --registry-use ${registry_name}:${registry_port} \
    --servers ${k3d_server_count} --agents ${k3d_agent_count} 

or

k3d cluster create ${k3d_cluster_name}  --k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" \
    --k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0" \
    --api-port 127.0.0.1:6443 -p 80:80@loadbalancer -p 443:443@loadbalancer \
    -p 2022:2022@loadbalancer --k3s-server-arg "--no-deploy=traefik" --registry-use ${registry_name}:${registry_port} \
    --servers ${k3d_server_count} --agents ${k3d_agent_count} 

And I got it working as you advised...

@iwilltry42 Thank you for the fix. Sorry it took so long to try it.

But is this the advised way now (as per the faq) or has this been replaced over the past 20 some days?

@iwilltry42
Copy link
Member

Thanks for the reply @nemonik 👍

Creating a cluster straight away with k3d on Arch running 5.12.10-arch1-1 still doesn't work using k3d installed out of AUR.

k3d v4.4.4 uses k3s v1.20.6 by default (that's hardcoded at build time), so that won't work.

So, I did as you asked modifying for the latest k3s docker image or by just stripping out the image param like so ...

When setting the --image flag with one of the newer k3s versions (that include the fix referenced from k3s earlier in this thread), you don't need the kube-proxy-arg flags anymore 👍

Also, v4.4.5 is just being released (there were issues with the release system delaying it), which will be using one of the newer versions by default (so no image or kube-proxy flags required).

@nemonik
Copy link

nemonik commented Jun 13, 2021

When setting the --image flag with one of the newer k3s versions (that include the fix referenced from k3s earlier in this thread), you don't need the kube-proxy-arg

Do you still need the conntrack-max-per-core=0 args to be passed to the server and agent or do they go too?

I will watch for k3d v4.4.5 to drop in AUR

Thanks for the reply.

@iwilltry42
Copy link
Member

Do you still need the conntrack-max-per-core=0 args to be passed to the server and agent or do they go too?

Not with the new versions of k3s (v1.21.1-k3s1 is the new default in k3d v4.4.5 and includes the fix).
I'll update the FAQ accordingly.

@nemonik
Copy link

nemonik commented Jun 13, 2021

Danke schön @iwilltry42

@nemonik
Copy link

nemonik commented Jun 14, 2021

k3d v4.4.5 hit AUR last night...

@Jean-Baptiste-Lasselle
Copy link

Note that I experienced the same issue with lower versions of the Linux Kernel, :

My Stack versions

$ ./versions.sh 
Client: Docker Engine - Community
 Version:           20.10.8
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        3967b7d
 Built:             Fri Jul 30 19:54:22 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.8
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.6
  Git commit:       75249d8
  Built:            Fri Jul 30 19:52:31 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 runc:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
k3d version v4.4.1
k3s version v1.20.5-k3s1 (default)
go version go1.17 linux/amd64
hugo v0.83.1+extended linux/amd64 BuildDate=unknown
flux version 0.17.0

The error message that I get for all k3s server nodes

k3s server open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

And I found :

jbl@fluxcd:~$ uname -r
5.10.0-8-amd64
jbl@fluxcd:~$ cat /proc/version
Linux version 5.10.0-8-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.46-4 (2021-08-03)

@brandond
Copy link

Yes, some Linux distributions backported the fix into earlier kernel versions. It was only present in mainline linux 5.12.2 and 5.11.19. You'd have to check with the Debian folks to see which releases they backported it to. Either way, there's a fix available. Just use newer K3s.

@Jean-Baptiste-Lasselle
Copy link

@brandond Oh, thanks very much for the information :)

@Jean-Baptiste-Lasselle
Copy link

I confirm that upgrading to k3d 4.4.5 solved the issue for me, thanks all :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working k3s This is likely an issue with k3s not k3d itself
Projects
None yet
Development

No branches or pull requests

7 participants