Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node not ready: container runtime is down #9984

Closed
tjwallace opened this issue Dec 17, 2024 · 22 comments · Fixed by siderolabs/pkgs#1128
Closed

Node not ready: container runtime is down #9984

tjwallace opened this issue Dec 17, 2024 · 22 comments · Fixed by siderolabs/pkgs#1128
Assignees

Comments

@tjwallace
Copy link

Bug Report

Description

After upgrading to Talos 1.9.0 some of my nodes are never ready.

Logs

$ talosctl logs -n mystery containerd
mystery: {"level":"info","msg":"starting containerd","revision":"88aa2f531d6c2922003cc7929e51daf1c14caa0a","time":"2024-12-17T21:59:02.922801527Z","version":"v2.0.1"}
mystery: {"id":"io.containerd.image-verifier.v1.bindir","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.938434957Z","type":"io.containerd.image-verifier.v1"}
mystery: {"id":"io.containerd.warning.v1.deprecations","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.938616550Z","type":"io.containerd.warning.v1"}
mystery: {"id":"io.containerd.content.v1.content","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.938669420Z","type":"io.containerd.content.v1"}
mystery: {"id":"io.containerd.snapshotter.v1.native","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.938760243Z","type":"io.containerd.snapshotter.v1"}
mystery: {"id":"io.containerd.snapshotter.v1.overlayfs","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.938877570Z","type":"io.containerd.snapshotter.v1"}
mystery: {"id":"io.containerd.event.v1.exchange","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.939509134Z","type":"io.containerd.event.v1"}
mystery: {"id":"io.containerd.monitor.task.v1.cgroups","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.940531592Z","type":"io.containerd.monitor.task.v1"}
mystery: {"id":"io.containerd.metadata.v1.bolt","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.941341137Z","type":"io.containerd.metadata.v1"}
mystery: {"level":"info","msg":"metadata content store policy set","policy":"shared","time":"2024-12-17T21:59:02.941474037Z"}
mystery: {"id":"io.containerd.gc.v1.scheduler","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.944835061Z","type":"io.containerd.gc.v1"}
mystery: {"id":"io.containerd.differ.v1.walking","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.944938855Z","type":"io.containerd.differ.v1"}
mystery: {"id":"io.containerd.lease.v1.manager","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945028058Z","type":"io.containerd.lease.v1"}
mystery: {"id":"io.containerd.service.v1.containers-service","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945493666Z","type":"io.containerd.service.v1"}
mystery: {"id":"io.containerd.service.v1.content-service","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945545162Z","type":"io.containerd.service.v1"}
mystery: {"id":"io.containerd.service.v1.diff-service","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945572516Z","type":"io.containerd.service.v1"}
mystery: {"id":"io.containerd.service.v1.images-service","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945599569Z","type":"io.containerd.service.v1"}
mystery: {"id":"io.containerd.service.v1.introspection-service","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945629359Z","type":"io.containerd.service.v1"}
mystery: {"id":"io.containerd.service.v1.namespaces-service","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945706696Z","type":"io.containerd.service.v1"}
mystery: {"id":"io.containerd.service.v1.snapshots-service","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945762612Z","type":"io.containerd.service.v1"}
mystery: {"id":"io.containerd.shim.v1.manager","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945789586Z","type":"io.containerd.shim.v1"}
mystery: {"id":"io.containerd.runtime.v2.task","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945817723Z","type":"io.containerd.runtime.v2"}
mystery: {"id":"io.containerd.service.v1.tasks-service","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.945986952Z","type":"io.containerd.service.v1"}
mystery: {"id":"io.containerd.grpc.v1.containers","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946076297Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.content","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946124777Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.diff","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946164550Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.events","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946240470Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.images","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946266250Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.introspection","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946355207Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.leases","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946421830Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.namespaces","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946448167Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.snapshots","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946479657Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.streaming.v1.manager","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946577011Z","type":"io.containerd.streaming.v1"}
mystery: {"id":"io.containerd.grpc.v1.streaming","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946649773Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.tasks","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946692184Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.transfer.v1.local","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946803084Z","type":"io.containerd.transfer.v1"}
mystery: {"id":"io.containerd.grpc.v1.transfer","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946880594Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.version","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946918504Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.ttrpc.v1.otelttrpc","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946943361Z","type":"io.containerd.ttrpc.v1"}
mystery: {"id":"io.containerd.grpc.v1.healthcheck","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.946972981Z","type":"io.containerd.grpc.v1"}
mystery: {"id":"io.containerd.podsandbox.controller.v1.podsandbox","level":"info","msg":"loading plugin","time":"2024-12-17T21:59:02.947000311Z","type":"io.containerd.podsandbox.controller.v1"}
mystery: {"error":"unable to init client for podsandbox: failed to get \"io.containerd.sandbox.store.v1\" plugin: no plugins registered for io.containerd.sandbox.store.v1: plugin: not found","id":"io.containerd.podsandbox.controller.v1.podsandbox","level":"warning","msg":"failed to load plugin","time":"2024-12-17T21:59:02.947230681Z","type":"io.containerd.podsandbox.controller.v1"}
mystery: {"address":"/system/run/containerd/containerd.sock.ttrpc","level":"info","msg":"serving...","time":"2024-12-17T21:59:02.948070189Z"}
mystery: {"address":"/system/run/containerd/containerd.sock","level":"info","msg":"serving...","time":"2024-12-17T21:59:02.948277202Z"}
mystery: {"level":"info","msg":"containerd successfully booted in 0.028911s","time":"2024-12-17T21:59:02.948326423Z"}
mystery: {"address":"unix:///run/containerd/s/814eeaf75962b52d1228c72642b2a1e34f6ad18299621a2bc0a0d887e86f9db6","level":"info","msg":"connecting to shim apid","namespace":"system","protocol":"ttrpc","time":"2024-12-17T21:59:14.675543337Z","version":3}
$ kubectl describe node ...
...
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 12 Jul 2024 13:30:08 -0700   Fri, 12 Jul 2024 13:30:08 -0700   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Tue, 17 Dec 2024 15:16:36 -0800   Tue, 17 Dec 2024 15:08:06 -0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 17 Dec 2024 15:16:36 -0800   Tue, 17 Dec 2024 15:08:06 -0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 17 Dec 2024 15:16:36 -0800   Tue, 17 Dec 2024 15:08:06 -0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                False   Tue, 17 Dec 2024 15:16:36 -0800   Tue, 17 Dec 2024 15:08:57 -0800   KubeletNotReady              container runtime is down

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
$ talosctl version
Client:
	Tag:         v1.9.0
	SHA:         3cb25ceb
	Built:
	Go version:  go1.23.4
	OS/Arch:     darwin/arm64
Server:
	NODE:        192.168.10.10
	Tag:         v1.9.0
	SHA:         3cb25ceb
	Built:
	Go version:  go1.23.4
	OS/Arch:     linux/amd64
	Enabled:     RBAC
	NODE:        192.168.10.12
	Tag:         v1.9.0
	SHA:         3cb25ceb
	Built:
	Go version:  go1.23.4
	OS/Arch:     linux/amd64
	Enabled:     RBAC
	NODE:        192.168.10.11
	Tag:         v1.9.0
	SHA:         3cb25ceb
	Built:
	Go version:  go1.23.4
	OS/Arch:     linux/amd64
	Enabled:     RBAC
	NODE:        192.168.10.20
	Tag:         v1.9.0
	SHA:         3cb25ceb
	Built:
	Go version:  go1.23.4
	OS/Arch:     linux/amd64
	Enabled:     RBAC
  • Kubernetes version: Server Version: v1.32.0
@tjwallace
Copy link
Author

I was having the same problem in #9980 but I applied the suggested fix of using the Talos discovery service, and still have the same problems.

@buroa
Copy link

buroa commented Dec 18, 2024

I am seeing this as well. I did not see it on the v1.9.0 betas.

@smira
Copy link
Member

smira commented Dec 18, 2024

Please supply a talosctl support bundle.

You should be looking into talosctl logs cri.

Make sure if you have any contianerd/CRI config customizations, that they were update for containerd 2.0 configuration, but it should have failed on Talos 1.8 as well.

@jalim
Copy link

jalim commented Dec 18, 2024

I seem to be experiencing the same issues, at the moment only affecting one of 4 machines, all installed on similar bare metal hardware. It seems to initially come online and report ready only to report not ready shortly thereafter. Have attached support zip if it helps.

support.zip

@buroa
Copy link

buroa commented Dec 18, 2024

support.zip

Here is mine as well. It just happened after a reboot.

@smira
Copy link
Member

smira commented Dec 18, 2024

Not sure what's going on there, but in both support files it happens around the time cephfs plugin is initialized.

@buroa
Copy link

buroa commented Dec 18, 2024

@smira Not sure, but I doubt rook-ceph blew this up. Something changed between the beta and v1.9.0. I have git history and nothing has changed except upgrading to v1.9.0. That's when the problem occurred.

@smira
Copy link
Member

smira commented Dec 18, 2024

You can see yourself in the logs.

We don't have any failures whatsoever in any of the tests, including Ceph. If there's a reproducer, happy to verify.

@buroa
Copy link

buroa commented Dec 18, 2024

I read the same logs as you. It just so happens that cephfs is the last thing that comes up. Most likely a fluke, because again, nothing changed. Maybe there is a problem with upgrading from containerd 2.0.0 to 2.0.1.

@smira
Copy link
Member

smira commented Dec 18, 2024

None of our tests showed any issues, I read Kubernetes source code. You might try to increase log verbosity of the kubelet with -v 9 to see what exactly goes wrong and why it moves into this state. Might be a Kubernetes bug as well.

In your log there's no clear reason on why kubelet considers CRI to be unhappy.

@buroa
Copy link

buroa commented Dec 18, 2024

I changed kubelet to -v=9, here is the support.zip

support.zip

@caycehouse
Copy link

caycehouse commented Dec 18, 2024

I’m facing the same issue. On Talos 1.8.4, CRI worked fine, but after upgrading to 1.9.0, 2 of my 3 nodes go ‘Ready’ for 30-60 seconds before switching to ‘Not Ready.’

@smira
Copy link
Member

smira commented Dec 18, 2024

Which Kubernetes version is everyone having issues using?

@buroa
Copy link

buroa commented Dec 18, 2024

@smira I was experiencing the issue on both v1.31.4 and v1.32.0. I was editing the machine config to go back and forth while debugging.

@smira
Copy link
Member

smira commented Dec 18, 2024

Also, is everybody using Ceph?

@tjwallace
Copy link
Author

I have already reverted my cluster to 1.8.4 but here is a support bundle before I reverted
support.zip

@buroa
Copy link

buroa commented Dec 18, 2024

I can confirm this is due to containerd v2.0.1 and having multus-cni installed. I have a PR out on multus to hopefully fix it: k8snetworkplumbingwg/multus-cni#1371.

@jonasled
Copy link

I don't use multus on my cluster, but I have some nodes with the cephfs csi driver and some nodes without, all nodes with the driver had the problem, the control planes without cephfs worked flawlessly. On my cluster I switched all nodes back to 1.8.3, as it was the last version I had running successfully. ( I never had tried 1.8.4 )

@buroa
Copy link

buroa commented Dec 20, 2024

I filed this for containerd: containerd/containerd#11186

@ekarlso
Copy link
Contributor

ekarlso commented Dec 23, 2024

I got Cilium 1.16.5 and Talos 1.9.0

I get this happening without Multus when going from
K8s 1.30.3 -> 1.31.4

@ekarlso
Copy link
Contributor

ekarlso commented Dec 23, 2024

10.0.0.120: {"ts":1734987711195.6775,"caller":"rest/warnings.go:70","msg":"unknown field \"status.runtimeHandlers[1].features.userNamespaces\"","v":0}                                                                                                                                                                        
10.0.0.120: {"ts":1734987712058.0708,"caller":"kubelet/kubelet.go:2902","msg":"Container runtime network not ready","networkReady":"NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"}                                                                        
10.0.0.120: {"ts":1734987714647.7485,"caller":"cache/reflector.go:561","msg":"k8s.io/client-go/informers/factory.go:160: failed to list *v1.Service: \"spec.clusterIP\" is not a known field selector: only \"metadata.name\", \"metadata.namespace\"","v":0}                                                                 
10.0.0.120: {"ts":1734987714647.7979,"logger":"UnhandledError","caller":"cache/reflector.go:158","msg":"Unhandled Error","err":"k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Service: failed to list *v1.Service: \"spec.clusterIP\" is not a known field selector: only \"metadata.name\", \"metadata.names
pace\""}                                                                                                                                                                                                                                                                                                                      
10.0.0.120: {"ts":1734987714745.3572,"logger":"UnhandledError","caller":"kuberuntime/kuberuntime_manager.go:1274","msg":"Unhandled Error","err":"init container &Container{Name:config,Image:quay.io/cilium/cilium:v1.16.5@sha256:758ca0793f5995bb938a2fa219dcce63dc0b3fa7fc4ce5cc851125281fb7361d,Command:[cilium-dbg build-c
onfig],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:K8S_NODE_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:spec.nodeName,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:CILIUM_K8S_NAMESPACE,Value:,ValueFrom:&EnvVarSource{F
ieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:KUBERNETES_SERVICE_HOST,Value:127.0.0.1,ValueFrom:nil,},EnvVar{Name:KUBERNETES_SERVICE_PORT,Value:7445,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceLis
t{},Requests:ResourceList{},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:tmp,ReadOnly:false,MountPath:/tmp,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:kube-api-access-gs84d,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,
MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:FallbackToLogsOnError,Vol
umeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod cilium-qfdl6_kube-system(216e7cce-1d3f-4f1b-a004-c260136b4043): CreateContainerConfigError: services have not yet been read at least once, cannot construct envvars"}                             
10.0.0.120: {"ts":1734987714745.553,"logger":"UnhandledError","caller":"kuberuntime/kuberuntime_manager.go:1274","msg":"Unhandled Error","err":"container &Container{Name:cilium-envoy,Image:quay.io/cilium/cilium-envoy:v1.30.8-1733837904-eaae5aca0fb988583e5617170a65ac5aa51c0aa8@sha256:709c08ade3d17d52da4ca2af33f431360e
c26268d288d9a6cd1d98acc9a1dced,Command:[/usr/bin/cilium-envoy-starter],Args:[-- -c /var/run/cilium/envoy/bootstrap-config.json --base-id 0 --log-level info --log-format [%Y-%m-%d %T.%e][%t][%l][%n] [%g:%#] %v],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:envoy-metrics,HostPort:9964,ContainerPort:9964,Protocol
:TCP,HostIP:,},},Env:[]EnvVar{EnvVar{Name:K8S_NODE_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:spec.nodeName,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:CILIUM_K8S_NAMESPACE,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVe
rsion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:KUBERNETES_SERVICE_HOST,Value:127.0.0.1,ValueFrom:nil,},EnvVar{Name:KUBERNETES_SERVICE_PORT,Value:7445,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},Claims
:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:envoy-sockets,ReadOnly:false,MountPath:/var/run/cilium/envoy/sockets,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:envoy-artifacts,ReadOnly:true,MountPath:/var/run/cilium/envoy/artifacts,SubPath:,MountPropagation:ni
l,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:envoy-config,ReadOnly:true,MountPath:/var/run/cilium/envoy/,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:bpf-maps,ReadOnly:false,MountPath:/sys/fs/bpf,SubPath:,MountPropagation:*HostToContainer,SubPathExpr:,RecursiveReadOnl
y:nil,},VolumeMount{Name:kube-api-access-mn472,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:{0 9878 },Host:127.0.0.1,Scheme:HT
TP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:5,PeriodSeconds:30,SuccessThreshold:1,FailureThreshold:10,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:{0 9878 },Host:127.0.0.1,Scheme
:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:5,PeriodSeconds:30,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabili
ties:&Capabilities{Add:[NET_ADMIN SYS_ADMIN],Drop:[ALL],},Privileged:nil,SELinuxOptions:&SELinuxOptions{User:,Role:,Type:spc_t,Level:s0,},RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,AppArmorProfile:nil,},Stdi
n:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:FallbackToLogsOnError,VolumeDevices:[]VolumeDevice{},StartupProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:{0 9878 },Host:127.0.0.1,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRP
C:nil,},InitialDelaySeconds:5,TimeoutSeconds:1,PeriodSeconds:2,SuccessThreshold:1,FailureThreshold:105,TerminationGracePeriodSeconds:nil,},ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod cilium-envoy-nx7tr_kube-system(def0aa56-ecf9-4c4e-82cf-fc85710d29fa): CreateContainerConfigError: se
rvices have not yet been read at least once, cannot construct envvars"}                                                                                                                                                                                                                                                       
10.0.0.120: {"ts":1734987714746.8972,"caller":"kubelet/pod_workers.go:1301","msg":"Error syncing pod, skipping","pod":{"name":"cilium-envoy-nx7tr","namespace":"kube-system"},"podUID":"def0aa56-ecf9-4c4e-82cf-fc85710d29fa","err":"failed to \"StartContainer\" for \"cilium-envoy\" with CreateContainerConfigError: \"serv
ices have not yet been read at least once, cannot construct envvars\"","errCauses":[{"error":"failed to \"StartContainer\" for \"cilium-envoy\" with CreateContainerConfigError: \"services have not yet been read at least once, cannot construct envvars\""}]}                                                              
10.0.0.120: {"ts":1734987714746.9038,"caller":"kubelet/pod_workers.go:1301","msg":"Error syncing pod, skipping","pod":{"name":"cilium-qfdl6","namespace":"kube-system"},"podUID":"216e7cce-1d3f-4f1b-a004-c260136b4043","err":"failed to \"StartContainer\" for \"config\" with CreateContainerConfigError: \"services have no
t yet been read at least once, cannot construct envvars\"","errCauses":[{"error":"failed to \"StartContainer\" for \"config\" with CreateContainerConfigError: \"services have not yet been read at least once, cannot construct envvars\""}]}                                                                                
10.0.0.120: {"ts":1734987717060.077,"caller":"kubelet/kubelet.go:2902","msg":"Container runtime network not ready","networkReady":"NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"}      ```

@smira
Copy link
Member

smira commented Dec 24, 2024

I think @buroa found the root cause (most probably): containerd/go-cni#123 (comment)

We'll patch containerd for v1.9.1

This bug is hard to reproduce

smira added a commit to smira/pkgs that referenced this issue Dec 26, 2024
Fixes siderolabs/talos#9984

Patch with containerd/go-cni#126

See also:

* containerd/go-cni#125
* containerd/containerd#11186
* containerd/go-cni#123

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 0b00e86)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants