Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd 2.11.x Control Plane Components Failing #8496

Closed
ayushiaks opened this issue May 16, 2022 · 19 comments
Closed

Linkerd 2.11.x Control Plane Components Failing #8496

ayushiaks opened this issue May 16, 2022 · 19 comments
Labels

Comments

@ayushiaks
Copy link

What is the issue?

After upgrading linkerd helm charts from stable 2.10.2 to 2.11.2, all control-plane components for linkerd are failing.
We're using AKS with kubenet, with k8s version 1.21.7

image

How can it be reproduced?

Upgrade from helm chart 2.10.2 to 2.11.x

Logs, error output, etc

Logs from linkerd-proxy-injectors and linkerd-destination's linkerd-proxy container:

[0.005486s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[0.005494s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[0.008909s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[0.012103s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[0.028235s]  INFO ThreadId(01) linkerd_proxy::signal: received SIGTERM, starting shutdown

output of linkerd check -o short

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
| linkerd-destination-74cbd87444-lfckk status is CrashLoopBackOff

Environment

  • Kubernetes version - 1.21.7
  • AKS Cluster
  • Host OS - linux
  • Linkerd version - helm chart 2.11.2

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

@ayushiaks ayushiaks added the bug label May 16, 2022
@ayushiaks ayushiaks changed the title Linkerd 2.11.x control plane components failing Linkerd 2.11.x Control Plane Components Failing May 16, 2022
@olix0r
Copy link
Member

olix0r commented May 16, 2022

The fact that the identity controller is in crash loop probably points to it being a CNI/proxy-init related issue.

We are aware of a likely bug in the Azure CNI. Are you able to test the reproduction described in Azure/AKS#2750?

I'd start by trying to understand why the identity controller isn't healthy--nothing else will start without it.

@olix0r olix0r added support env/aks Microsoft AKS and removed bug labels May 16, 2022
@ayushiaks
Copy link
Author

ayushiaks commented May 17, 2022

Hi @olix0r,
I tried the repro described and the pod stays in ContainerCreating stage.
I haven't looked into the logs as ephemeral containers aren't enabled on our clusters right now, looking into enabling that.

Removing the lifecycle snippet, things work fine.

We aren't using Azure CNI though, we're working with kubenet.

@ayushiaks
Copy link
Author

Well, even in the container creating stage, the curl commands works just fine

100   165  100   165    0     0   7168      0 --:--:-- --:--:-- --:--:--  7500
2022-05-17T19:21:09.19905971Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-05-17T19:21:09.199148719Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
2022-05-17T19:21:09.227217806Z stdout F }{
2022-05-17T19:21:09.227242408Z stdout F   "kind": "Status",
2022-05-17T19:21:09.22725491Z stdout F   "apiVersion": "v1",
2022-05-17T19:21:09.227265611Z stdout F   "metadata": {
2022-05-17T19:21:09.227276312Z stdout F
2022-05-17T19:21:09.227286713Z stdout F   },
2022-05-17T19:21:09.227296814Z stdout F   "status": "Failure",
2022-05-17T19:21:09.227306715Z stdout F   "message": "Unauthorized",
2022-05-17T19:21:09.227320416Z stdout F   "reason": "Unauthorized",
2022-05-17T19:21:09.227330617Z stdout F   "code": 401
100   165  100   165    0     0   5811      0 --:--:-- --:--:-- --:--:--  5892

This doesn't point to the same issue here then.

@ayushiaks
Copy link
Author

Also to put it out, linkerd-proxy-injector and linkerd-destination stay in a CrashLoopBackOff with the following error initially:

PostStartHookError: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "9221355bbc14fb825dc85ec6023a75607ae9f9570af5729c4c3c1676f610533a": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "/usr/lib/linkerd/linkerd-await": stat /usr/lib/linkerd/linkerd-await: no such file or directory: unknown

When I remove the lifecycle snippet for linkerd-await, is when we reach the DNS related issue.

Linkerd-identity on the other hand keeps failing with a Readiness probe failed: HTTP probe failed with statuscode: 503
The logs don't point out to anything either:

 linkerd-proxy time="2022-05-17T19:24:50Z" level=info msg="running version stable-2.10.2"
 linkerd-proxy [0.004734s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
 linkerd-proxy [0.005672s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
 linkerd-proxy [0.005710s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
 linkerd-proxy [0.005717s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
 linkerd-proxy [0.005724s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
 linkerd-proxy [0.005730s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
 linkerd-proxy [0.005738s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via localhost:8080
 linkerd-proxy [0.005745s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)

@adleong
Copy link
Member

adleong commented May 17, 2022

Hi @ayushiaks. Is it possible that you're somehow using the stable-2.11.2 Helm charts with the stable-2.10.2 docker images? I notice in your proxy logs that the proxy seems to version stable-2.10.2. This might explain why the post start hook in the chart references a linkerd-await that doesn't exist in the container.

@ayushiaks
Copy link
Author

ayushiaks commented May 18, 2022

@adleong thanks for pointing that out! Fixing that got my linkerd-identity pods up and running, but the destination and proxy-injector are still failing with:

linkerd-proxy [ 118.692279s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.244.3.137:8080}: linkerd_reconnect: Failed to connect error=received corrupt message

policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN

@ayushiaks
Copy link
Author

@adleong @olix0r any suggestions here? We're stuck on an upgrade

@olix0r
Copy link
Member

olix0r commented May 19, 2022

linkerd-proxy [ 118.692279s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.244.3.137:8080}: linkerd_reconnect: Failed to connect error=received corrupt message

In this situation, I would look at the pod with IP 10.244.3.137 to see what it's state is. corrupt message might indicate that these connections are not being terminated by the proxy, which would happen if the proxy isn't properly initialized (via iptables). It's hard for us to know what's going on based on only that log message, though.

policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN

Does the linkerd-policy service exist? Does it have endpoints? This service maps to the linkerd-destination controller; so it may be expected to see this error during startup if there are no pods in that service. But, again, we'd need more information about the state of the cluster.


It might be helpful for you to back up and explain how you tried to upgrade your cluster. For instance, how did you end up with the wrong container images? It sounds like something has gone very wrong, but it's hard for us to diagnose this without a lot more context about how you manage Linkerd in this cluster.

If you're really stuck and need hands-on help, you may also want to consider commercial support.

@ayushiaks
Copy link
Author

ayushiaks commented May 20, 2022

Hey, we recently added the changes in our linkerd's values file to pickup controller images for Microsoft's internal container registry, which I missed updating while updating the helm chart version.

This is how our config looks like:

linkerd2:
  # -- MCR image for the linkerd controller
  controllerImage: mcr.microsoft.com/oss/linkerd/controller
  controllerImageVersion: "2.10.2"
  imagePullPolicy: __IMAGE_PULL_POLICY__

There's nothing else that we've changed with linkerd which we have been using for a long time.

Something is going really wrong - true
Even when downgrading to older versions, and removing the MCR images, we're ending up with the same error :/
Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN

The corrupt message was temporary so I couldn't debug it further.
No, the linkerd-policy service does not exist within linkerd's namespace, where is this coming from? Is it same as linkerd-destination?

Right now, this is the cluster's state after the identity pods are up. All other containers are throwing the same error:
image

Will check on the support if there is no clarity here, thanks!

@ayushiaks
Copy link
Author

ayushiaks commented May 20, 2022

I found this doc on DNS issue debugs.

kubectl exec -i -t dnsutils -- nslookup linkerd-identity-headless.linkerd.svc.cluster.local                                                                                                     ─╯
Server:         10.0.0.10
Address:        10.0.0.10#53

** server can't find linkerd-identity-headless.linkerd.svc.cluster.local: NXDOMAIN

command terminated with exit code 1

The lookup fails in the clusters where I've tried linkerd upgrade (explaining why downgrading isn't helping), but works fine in other clusters.

Edit: False alarm, linkerd namespace wasn't up during the lookup.
Now that the lookup is also working fine, I'm running out of ideas to debug this :(

kubectl exec -i -t dnsutils -- nslookup linkerd-identity-headless.linkerd.svc.cluster.local                                                                                  ─╯
Server:         10.0.0.10
Address:        10.0.0.10#53

Name:   linkerd-identity-headless.linkerd.svc.cluster.local
Address: 10.244.4.11
Name:   linkerd-identity-headless.linkerd.svc.cluster.local
Address: 10.244.5.9
Name:   linkerd-identity-headless.linkerd.svc.cluster.local
Address: 10.244.7.17

@JasonMorgan
Copy link
Contributor

@ayushiaks are you in the Linkerd slack? It would be great to connect a little more synchronously to see what we can do.

@olix0r
Copy link
Member

olix0r commented May 24, 2022

It's possible this is related to #8296. I've pushed a proxy image (which will be included in this week's edge release) that can be used for testing: ghcr.io/olix0r/l2-proxy:main.c7b9c6565.

@ayushiaks
Copy link
Author

@ayushiaks are you in the Linkerd slack? It would be great to connect a little more synchronously to see what we can do.

Not yet, thanks will join it!

It's possible this is related to #8296. I've pushed a proxy image (which will be included in this week's edge release) that can be used for testing: ghcr.io/olix0r/l2-proxy:main.c7b9c6565.

I've a weird observation here.

The records that it is not able to find is a fqdn, and nslookup for it succeeds as well.
I tried removing the ending '.' from the record name, and the errors went away.

I replaced linkerd-identity-headless.linkerd.svc.cluster.local. with linkerd-identity-headless.linkerd.svc.cluster.local

Although, after the dns errors go away, the pods are still stuck, with readiness and liveness probes failing, and no warn/err logs.

For now, we have downgraded to 2.11.0 and things are working fine.
For that too, we had to create a new cluster, as the other ones went in a bad state and downgrading back to 2.10.2 also didn't help us.

@tchellomello
Copy link

tchellomello commented May 31, 2022

We were hitting the same in Openshift 4.10, and we had to downgrade to 2.10.2 to resolve the issue.

Just to mention, in our case we also hit the linkerd-identity-headless.linkerd.svc.cluster.local. not being able to resolved by the DNS. However, since this is a headless entry, the DNS will only able to resolve it if the expected pod used in the svc selector is up.

In our case, due to the pod was in a CrashloopBackoff the SRV entry was not resolved, which seems to be expected.

@ayushiaks
Copy link
Author

@olix0r @adleong any updates here? 2.11.0 also started failing for us now :|

@olix0r
Copy link
Member

olix0r commented Jun 22, 2022

@ayushiaks I don't think we have enough information to make any progress on this. My suspicion is that this has something to do with your cluster networking setup, but that's a guess. In order for us to help you, we really to be able to replicate the problem, or we need a clear enough description of what changes in Linkerd are required to fix it. But at this point we don't have enough information to proceed.

I'll note that Buoyant is running Linkerd on AKS on Kubernetes v1.21.2 without any problems, but I don't have any sense of how that configuration differs from your environment.

2.11.0 also started failing for us now

I think this points to something changing in your cluster's environment.

@ayushiaks
Copy link
Author

ayushiaks commented Jul 16, 2022

Hi @olix0r , We're at linkerd 2.11.0 now, which is working fine. There were some transient issues with it. (2.11.1 and 2.11.2 are still causing above issues^)
To repro this, we just bump up the version and we see things failing, that's all.

Anyway, we switched to Azure CNI as our network plugin for our AKS clusters, now one of our linkerd-destinations pod is suddenly in CrashLoopBackOff failing with:

WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

I tried the repro for the issue AKS#2750, but that is not the case for us.

And it's only failing in one of our clusters, any idea what might be causing this? Or how I can dig deeper here?

EDIT: We started facing random error each time, and in all sorts of pods, related to network connectivity timeouts.
As soon as we removed linkerd from our AKS cluster, everything has been working fine.

Looks like something is wrong with CNI+Linkerd combination.
Anything we are missing here?

@olix0r
Copy link
Member

olix0r commented Jul 21, 2022

WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

This error indicates that the policy container is not running--the proxy can't connect to the controller on the localhost. This may happen initially as the proxy waits for the policy controller to start, or it may indicate that the policy controller is failing to start. I'd look at kubectl describe output on the destination pod and/or the logs from the policy controller.

And it's only failing in one of our clusters, any idea what might be causing this? Or how I can dig deeper here?

If I were in your shoes, I would try to identify how these cluster differ.

EDIT: We started facing random error each time, and in all sorts of pods, related to network connectivity timeouts.

This isn't really actionable for us.

Looks like something is wrong with CNI+Linkerd combination.
Anything we are missing here?

We have many users that use CNI & Linkerd successfully (in Azure, even), so it's more likely a problem with your specific cluster configuration.

I would suggest trying the latest Linkerd stable release, 2.11.4 which includes fixes for some DNS-related discovery problems, but I can't be confident that this will help you since I really don't understand the nature of the problems you're encountering.

I'm sorry that you're having trouble running Linkerd on these clusters, but as an open source project we can only really fix problems that we can reproduce; or we need very detailed problem descriptions that identify specific bugs in Linkerd. We don't really have bandwidth to help debug your environment. As I've mentioned previously, there are support vendors and training workshops that may be able to help you with this.

@stale
Copy link

stale bot commented Oct 19, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 19, 2022
@stale stale bot closed this as completed Nov 4, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants