-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linkerd 2.11.x Control Plane Components Failing #8496
Comments
The fact that the identity controller is in crash loop probably points to it being a CNI/proxy-init related issue. We are aware of a likely bug in the Azure CNI. Are you able to test the reproduction described in Azure/AKS#2750? I'd start by trying to understand why the identity controller isn't healthy--nothing else will start without it. |
Hi @olix0r, Removing the lifecycle snippet, things work fine. We aren't using Azure CNI though, we're working with kubenet. |
Well, even in the container creating stage, the curl commands works just fine
This doesn't point to the same issue here then. |
Also to put it out, linkerd-proxy-injector and linkerd-destination stay in a CrashLoopBackOff with the following error initially:
When I remove the lifecycle snippet for linkerd-await, is when we reach the DNS related issue. Linkerd-identity on the other hand keeps failing with a
|
Hi @ayushiaks. Is it possible that you're somehow using the stable-2.11.2 Helm charts with the stable-2.10.2 docker images? I notice in your proxy logs that the proxy seems to version stable-2.10.2. This might explain why the post start hook in the chart references a |
@adleong thanks for pointing that out! Fixing that got my linkerd-identity pods up and running, but the destination and proxy-injector are still failing with:
|
In this situation, I would look at the pod with IP
Does the It might be helpful for you to back up and explain how you tried to upgrade your cluster. For instance, how did you end up with the wrong container images? It sounds like something has gone very wrong, but it's hard for us to diagnose this without a lot more context about how you manage Linkerd in this cluster. If you're really stuck and need hands-on help, you may also want to consider commercial support. |
I found this doc on DNS issue debugs.
The lookup fails in the clusters where I've tried linkerd upgrade (explaining why downgrading isn't helping), but works fine in other clusters. Edit: False alarm, linkerd namespace wasn't up during the lookup.
|
@ayushiaks are you in the Linkerd slack? It would be great to connect a little more synchronously to see what we can do. |
It's possible this is related to #8296. I've pushed a proxy image (which will be included in this week's edge release) that can be used for testing: |
Not yet, thanks will join it!
I've a weird observation here. The records that it is not able to find is a fqdn, and nslookup for it succeeds as well. I replaced Although, after the dns errors go away, the pods are still stuck, with readiness and liveness probes failing, and no warn/err logs. For now, we have downgraded to 2.11.0 and things are working fine. |
We were hitting the same in Openshift 4.10, and we had to downgrade to 2.10.2 to resolve the issue. Just to mention, in our case we also hit the In our case, due to the |
@ayushiaks I don't think we have enough information to make any progress on this. My suspicion is that this has something to do with your cluster networking setup, but that's a guess. In order for us to help you, we really to be able to replicate the problem, or we need a clear enough description of what changes in Linkerd are required to fix it. But at this point we don't have enough information to proceed. I'll note that Buoyant is running Linkerd on AKS on Kubernetes v1.21.2 without any problems, but I don't have any sense of how that configuration differs from your environment.
I think this points to something changing in your cluster's environment. |
Hi @olix0r , We're at linkerd 2.11.0 now, which is working fine. There were some transient issues with it. (2.11.1 and 2.11.2 are still causing above issues^) Anyway, we switched to Azure CNI as our network plugin for our AKS clusters, now one of our linkerd-destinations pod is suddenly in CrashLoopBackOff failing with:
I tried the repro for the issue AKS#2750, but that is not the case for us. And it's only failing in one of our clusters, any idea what might be causing this? Or how I can dig deeper here? EDIT: We started facing random error each time, and in all sorts of pods, related to network connectivity timeouts. Looks like something is wrong with CNI+Linkerd combination. |
This error indicates that the
If I were in your shoes, I would try to identify how these cluster differ.
This isn't really actionable for us.
We have many users that use CNI & Linkerd successfully (in Azure, even), so it's more likely a problem with your specific cluster configuration. I would suggest trying the latest Linkerd stable release, 2.11.4 which includes fixes for some DNS-related discovery problems, but I can't be confident that this will help you since I really don't understand the nature of the problems you're encountering. I'm sorry that you're having trouble running Linkerd on these clusters, but as an open source project we can only really fix problems that we can reproduce; or we need very detailed problem descriptions that identify specific bugs in Linkerd. We don't really have bandwidth to help debug your environment. As I've mentioned previously, there are support vendors and training workshops that may be able to help you with this. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
What is the issue?
After upgrading linkerd helm charts from stable 2.10.2 to 2.11.2, all control-plane components for linkerd are failing.
We're using AKS with kubenet, with k8s version 1.21.7
How can it be reproduced?
Upgrade from helm chart 2.10.2 to 2.11.x
Logs, error output, etc
Logs from linkerd-proxy-injectors and linkerd-destination's linkerd-proxy container:
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
The text was updated successfully, but these errors were encountered: