-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linkerd-destination crashing #8235
Comments
|
I've restarted it a few times and it's not one container that's crashing, sometimes it's destination, other times it's sp-validator and even times where its the majority of policy container crashing. None of them look like they're erroring out:
SP-validator:
However I did notice in the linkerd-proxy logs I get logs like this
I checked the secrets and the tls ones for linked are there. |
Oh and for completeness, the liveness and readiness probes:
|
OK, so the controller's probes are timing out. Does this really only happen on the destination controller and no other pods? You could try running the destination controller's proxy with debug logging... you'll have to reinstall linkerd with the proxy log level set--or you can This could be some interaction with calico? But if that was the case I'd expect errors to other pods. Perhaps there's a node-level issue? The conntrack table filling up? Are other injected components running on the node successfully? |
I took some time and looked into it. It doesn't look like anything related to Calico or the node. Everything appears to be running on that node with no problems. And the Conntrack table isn't even close to filling up (178 entries), along with everything on the same node appears to be running. The linkerd-proxy logs don't show much except connections are getting closed from the client (?) . Should linkerd play well with ipvs kube-proxy? I know I was also facing this issue which was due to ipvs and how it adds (external) IPs, but I'm not entirely convinced it would be a similar issue here. |
@Anthony-Bible Hm, I'm not sure. I don't think we've tested this configuration explicitly. It's possible the IPVS kube proxy isn't working well with Linkerd's proxy-init iptables rules? I think your assessment is correct. Looking at the logs we see:
That is, the admin server gets the probe request and (presumably) when it tries to write the response, the connection has already been closed. Very weird! |
So an update, I had this start happening at work with our cluster on GKE so it's not specific to a provider. Also I did try switching to kube-proxy iptables mode and that didn't help. |
Same here on fresh installation on k3s, I'm getting:
|
Had the same issue during a Linkerd workshop. Using a Civo cluster with 3 nodes |
We're seeing the same issue on our EKS clusters. Not enough to cause the pods to shut down, but we do get a fairly steady level of noise coming from probe failures
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Just tried installing it again with 2.12.4 and still getting the same problem:
|
@olix0r looks like you guys have pprof, would getting a goroutine output help here, maybe something is hanging so it can't mark itself live? |
@Anthony-Bible Thanks for trying a recent version. You may also want to try with an edge release, as there are some CNI/proxy-init related changes that have not yet made it to stable. I'm skeptical that the controller pprof is going to have anything helpful -- there aren't any indicates from the controller logs that the controller isn't healthy. I reviewed the logs you had shared and I'm unable to spot anything that looks like the timeouts being reported by kubelet. To resolve this we might need:
|
Possibly related to #9798 |
Also tagging #9602 as related. |
@Anthony-Bible @mkrupczak3 @VladoPortos @gavinclarkeuk any chance to provide more detailed info as pointed out by @olix0r above? Besides that, it seems probe timeouts have been a persistent issue for a long time in k8s for folks under some scenarios. kubernetes/kubernetes#89898 was just recently closed, and that might provide a resolution in an upcoming k8s release. In the meantime, it appears that by simply setting a higher probe timeout, say 5 seconds (the default is 1), resolves the issue. So if you're still experiencing this, you can try editing the linkerd-destination Deployment and explicitly setting Please let us know if this helps, so we can consider adding that timeout in future releases. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
What is the issue?
Linkerd-destination is crash looping on a bare-metal cluster, when I tried turning on debug logging there doesn't appear to be much information (attached below). No other pods in linkerd or other namespaces are crashing so it seems isolated with linkerd-destination
How can it be reproduced?
I created a bare metal cluster on Hetzner through Kubespray both terraform and ansible-playbook. And then get the error with both a
helm
install and usinglinkerd install | kubectl apply -f
my all.yaml: https://gist.github.com/Anthony-Bible/3e60e1d53d740d181569c82812e6d271
Logs, error output, etc
https://gist.github.com/Anthony-Bible/92d2fc4565081c0469d2f5302c3eb048
output of
linkerd check -o short
When the pod is ready, everything returns all green.
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
The text was updated successfully, but these errors were encountered: