Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd-destination crashing #8235

Closed
Anthony-Bible opened this issue Apr 10, 2022 · 18 comments
Closed

Linkerd-destination crashing #8235

Anthony-Bible opened this issue Apr 10, 2022 · 18 comments
Assignees
Labels
bug priority/P1 Planned for Release wontfix

Comments

@Anthony-Bible
Copy link

Anthony-Bible commented Apr 10, 2022

What is the issue?

Linkerd-destination is crash looping on a bare-metal cluster, when I tried turning on debug logging there doesn't appear to be much information (attached below). No other pods in linkerd or other namespaces are crashing so it seems isolated with linkerd-destination

How can it be reproduced?

I created a bare metal cluster on Hetzner through Kubespray both terraform and ansible-playbook. And then get the error with both a helm install and using linkerd install | kubectl apply -f

my all.yaml: https://gist.github.com/Anthony-Bible/3e60e1d53d740d181569c82812e6d271

Logs, error output, etc

https://gist.github.com/Anthony-Bible/92d2fc4565081c0469d2f5302c3eb048

output of linkerd check -o short

Linkerd core checks
===================

linkerd-existence
-----------------
- pod/linkerd-destination-789cdbc9fd-bstdq container sp-validator is not ready 
Status check results are √
\ pod/linkerd-destination-789cdbc9fd-bstdq container sp-validator is not ready

When the pod is ready, everything returns all green.

Environment

  • Kubernetes: 1.21.11
  • ubuntu 20.04
  • linkerd: 2.11.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

@olix0r
Copy link
Member

olix0r commented Apr 11, 2022

kubectl describe on the pod or deployment should help make it clearer which container is crashing and potentially why. There's nothing that looks like a crash in the destination controller's log. You might check the policy container.

@Anthony-Bible
Copy link
Author

Anthony-Bible commented Apr 12, 2022

kubectl describe on the pod or deployment should help make it clearer which container is crashing and potentially why. There's nothing that looks like a crash in the destination controller's log. You might check the policy container.

I've restarted it a few times and it's not one container that's crashing, sometimes it's destination, other times it's sp-validator and even times where its the majority of policy container crashing. None of them look like they're erroring out:

○ kubectl logs -p -n linkerd linkerd-destination-c8c548fc4-6pnfx policy
2022-04-12T01:31:18.489524Z  INFO serve{addr=0.0.0.0:9990}: linkerd_policy_controller::admin: HTTP admin server listening addr=0.0.0.0:9990
2022-04-12T01:31:18.489550Z  INFO grpc{addr=0.0.0.0:8090 cluster_networks=[10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16]}: linkerd_policy_controller: gRPC server listening addr=0.0.0.0:8090
2022-04-12T01:31:18.489637Z  INFO linkerd_policy_controller: Admission controller server listening addr=0.0.0.0:9443
2022-04-12T01:31:18.506189Z DEBUG linkerd_policy_controller_k8s_index: Ready
2022-04-12T01:37:18.383084Z DEBUG linkerd_policy_controller: Received SIGTERM
2022-04-12T01:37:18.383108Z  INFO linkerd_policy_controller: Shutting down

SP-validator:

○ kubectl logs -p -n linkerd linkerd-destination-c8c548fc4-6pnfx sp-validator 
time="2022-04-12T01:30:08Z" level=info msg="running version stable-2.11.1"
I0412 01:30:08.524393       1 merged_client_builder.go:121] Using in-cluster configuration
time="2022-04-12T01:30:08Z" level=info msg="waiting for caches to sync"
I0412 01:30:08.527949       1 shared_informer.go:270] caches populated
time="2022-04-12T01:30:08Z" level=info msg="caches synced"
time="2022-04-12T01:30:08Z" level=info msg="starting admin server on :9997"
time="2022-04-12T01:30:08Z" level=info msg="listening at :8443"
time="2022-04-12T01:41:48Z" level=info msg="shutting down webhook server"

However I did notice in the linkerd-proxy logs I get logs like this

INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=TLS detection timed out
# And sometimes this:
linkerd-destination-c8c548fc4-6pnfx default-worker-1 linkerd-proxy [   598.954535s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=Transport endpoint is not connected (os error 107)

I checked the secrets and the tls ones for linked are there.

@Anthony-Bible
Copy link
Author

Oh and for completeness, the liveness and readiness probes:

  Warning  Unhealthy  15m                    kubelet            Readiness probe failed: Get "http://10.2.96.54:9996/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  15m                    kubelet            Liveness probe failed: Get "http://10.2.96.54:9997/ping": dial tcp 10.2.96.54:9997: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  15m                    kubelet            Readiness probe failed: Get "http://10.2.96.54:9990/ready": dial tcp 10.2.96.54:9990: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  14m (x4 over 15m)      kubelet            Readiness probe failed: Get "http://10.2.96.54:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  10m (x7 over 14m)      kubelet            Liveness probe failed: Get "http://10.2.96.54:9997/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  5m24s (x2 over 5m34s)  kubelet            Liveness probe failed: Get "http://10.2.96.54:9996/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  14s (x31 over 15m)     kubelet            Liveness probe failed: Get "http://10.2.96.54:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

@olix0r
Copy link
Member

olix0r commented Apr 12, 2022

OK, so the controller's probes are timing out. Does this really only happen on the destination controller and no other pods?

You could try running the destination controller's proxy with debug logging... you'll have to reinstall linkerd with the proxy log level set--or you can kubectl edit -n linkerd deploy linkerd-destinationand set the LINKERD2_PROXY_LOG--to linkerd=debug,info. I'd expect more obvious errors in the logs, though, if the proxy was failing in some obvious way, though... I guess the debug logs can at least help us know whether the connections are making it into/out of the proxy.

This could be some interaction with calico? But if that was the case I'd expect errors to other pods.

Perhaps there's a node-level issue? The conntrack table filling up? Are other injected components running on the node successfully?

@adleong adleong removed the bug label Apr 12, 2022
@Anthony-Bible
Copy link
Author

Anthony-Bible commented Apr 17, 2022

I took some time and looked into it. It doesn't look like anything related to Calico or the node. Everything appears to be running on that node with no problems. And the Conntrack table isn't even close to filling up (178 entries), along with everything on the same node appears to be running. The linkerd-proxy logs don't show much except connections are getting closed from the client (?) .

Should linkerd play well with ipvs kube-proxy? I know I was also facing this issue which was due to ipvs and how it adds (external) IPs, but I'm not entirely convinced it would be a similar issue here.

@olix0r
Copy link
Member

olix0r commented Apr 19, 2022

@Anthony-Bible Hm, I'm not sure. I don't think we've tested this configuration explicitly. It's possible the IPVS kube proxy isn't working well with Linkerd's proxy-init iptables rules?

I think your assessment is correct. Looking at the logs we see:

linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005111s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_tls::server: Peeked bytes from TCP stream sz=107
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005186s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_detect: DetectResult protocol=Some(Http1) elapsed=9.097µs
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005224s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_proxy_http::server: Creating HTTP service version=Http1
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005239s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_proxy_http::server: Handling as HTTP version=Http1
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005276s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_app_inbound::policy::authorize::http: Request authorized permit=Permit { dst: OrigDstAddr(0.0.0.0:4191), protocol: Detect { timeout: 10s }, labels: AuthzLabels { server: ServerLabel("default:all-unauthenticated"), authz: "default:all-unauthenticated" } } tls=None(NoClientHello) client=5.161.97.33:60002
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005352s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
linkerd-destination-dc7b5f7b4-gctr7 default-worker-1 linkerd-proxy [     2.005370s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=5.161.97.33:60002}: linkerd_app_core::serve: Connection closed

That is, the admin server gets the probe request and (presumably) when it tries to write the response, the connection has already been closed. Very weird!

@Anthony-Bible
Copy link
Author

So an update, I had this start happening at work with our cluster on GKE so it's not specific to a provider. Also I did try switching to kube-proxy iptables mode and that didn't help.

@VladoPortos
Copy link

Same here on fresh installation on k3s, I'm getting:

Readiness probe failed: Get "http://10.42.2.122:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

@mkrupczak3
Copy link

Had the same issue during a Linkerd workshop. Using a Civo cluster with 3 nodes

@gavinclarkeuk
Copy link

We're seeing the same issue on our EKS clusters. Not enough to cause the pods to shut down, but we do get a fairly steady level of noise coming from probe failures

Liveness probe failed: Get "http://10.26.110.84:9996/ping": read tcp 10.26.108.70:58692->10.26.110.84:9996: read: connection reset by peer
Readiness probe failed: Get "http://10.26.110.84:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Liveness probe failed: Get "http://10.26.110.84:9996/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

@stale
Copy link

stale bot commented Feb 20, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 20, 2023
@Anthony-Bible
Copy link
Author

Just tried installing it again with 2.12.4 and still getting the same problem:

  Warning  Unhealthy  79s (x2 over 89s)  kubelet            Liveness probe failed: Get "http://10.2.93.61:9990/live": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  79s                kubelet            Liveness probe failed: Get "http://10.2.93.61:9996/ping": dial tcp 10.2.93.61:9996: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  79s                kubelet            Readiness probe failed: Get "http://10.2.93.61:9990/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  69s (x3 over 89s)  kubelet            Readiness probe failed: Get "http://10.2.93.61:9996/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

@stale stale bot removed the wontfix label Feb 21, 2023
@Anthony-Bible
Copy link
Author

@olix0r looks like you guys have pprof, would getting a goroutine output help here, maybe something is hanging so it can't mark itself live?

@olix0r
Copy link
Member

olix0r commented Feb 21, 2023

@Anthony-Bible Thanks for trying a recent version. You may also want to try with an edge release, as there are some CNI/proxy-init related changes that have not yet made it to stable.

I'm skeptical that the controller pprof is going to have anything helpful -- there aren't any indicates from the controller logs that the controller isn't healthy.

I reviewed the logs you had shared and I'm unable to spot anything that looks like the timeouts being reported by kubelet.

To resolve this we might need:

  • ideally, steps to repro this locally in k3d (which we use for integration testing);
    • otherwise, specific configuration/steps to reproduce this in a cloud environment;
  • proxy logs (with log level linkerd=debug,warn) from a pod that is failing probe requests.
    • the output of linkerd diagnostics proxy-metrics -n linkerd <pod> on the same pod
  • TCP dump output from a pod -- the linkerd debug container may help with this.

@jeremychase
Copy link
Contributor

Possibly related to #9798

@alpeb alpeb self-assigned this Mar 6, 2023
@alpeb
Copy link
Member

alpeb commented Mar 6, 2023

Also tagging #9602 as related.

@alpeb
Copy link
Member

alpeb commented Mar 8, 2023

@Anthony-Bible @mkrupczak3 @VladoPortos @gavinclarkeuk any chance to provide more detailed info as pointed out by @olix0r above?

Besides that, it seems probe timeouts have been a persistent issue for a long time in k8s for folks under some scenarios. kubernetes/kubernetes#89898 was just recently closed, and that might provide a resolution in an upcoming k8s release.

In the meantime, it appears that by simply setting a higher probe timeout, say 5 seconds (the default is 1), resolves the issue. So if you're still experiencing this, you can try editing the linkerd-destination Deployment and explicitly setting timeoutSeconds: 5 for all the containers probes (also don't forget to use linkerd 2.12.4).

Please let us know if this helps, so we can consider adding that timeout in future releases.

@stale
Copy link

stale bot commented Sep 27, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 27, 2023
@stale stale bot closed this as completed Oct 11, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug priority/P1 Planned for Release wontfix
Projects
None yet
Development

No branches or pull requests

9 participants