Linkerd 2.11.x Control Plane Components Failing #8496

ayushiaks · 2022-05-16T08:44:27Z

What is the issue?

After upgrading linkerd helm charts from stable 2.10.2 to 2.11.2, all control-plane components for linkerd are failing.
We're using AKS with kubenet, with k8s version 1.21.7

How can it be reproduced?

Upgrade from helm chart 2.10.2 to 2.11.x

Logs, error output, etc

Logs from linkerd-proxy-injectors and linkerd-destination's linkerd-proxy container:

[0.005486s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[0.005494s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[0.008909s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[0.012103s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[0.028235s]  INFO ThreadId(01) linkerd_proxy::signal: received SIGTERM, starting shutdown

output of `linkerd check -o short`

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
| linkerd-destination-74cbd87444-lfckk status is CrashLoopBackOff

Environment

Kubernetes version - 1.21.7
AKS Cluster
Host OS - linux
Linkerd version - helm chart 2.11.2

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

The text was updated successfully, but these errors were encountered:

olix0r · 2022-05-16T13:54:23Z

The fact that the identity controller is in crash loop probably points to it being a CNI/proxy-init related issue.

We are aware of a likely bug in the Azure CNI. Are you able to test the reproduction described in Azure/AKS#2750?

I'd start by trying to understand why the identity controller isn't healthy--nothing else will start without it.

ayushiaks · 2022-05-17T19:02:22Z

Hi @olix0r,
I tried the repro described and the pod stays in ContainerCreating stage.
I haven't looked into the logs as ephemeral containers aren't enabled on our clusters right now, looking into enabling that.

Removing the lifecycle snippet, things work fine.

We aren't using Azure CNI though, we're working with kubenet.

ayushiaks · 2022-05-17T19:28:49Z

Well, even in the container creating stage, the curl commands works just fine

100   165  100   165    0     0   7168      0 --:--:-- --:--:-- --:--:--  7500
2022-05-17T19:21:09.19905971Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-05-17T19:21:09.199148719Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
2022-05-17T19:21:09.227217806Z stdout F }{
2022-05-17T19:21:09.227242408Z stdout F   "kind": "Status",
2022-05-17T19:21:09.22725491Z stdout F   "apiVersion": "v1",
2022-05-17T19:21:09.227265611Z stdout F   "metadata": {
2022-05-17T19:21:09.227276312Z stdout F
2022-05-17T19:21:09.227286713Z stdout F   },
2022-05-17T19:21:09.227296814Z stdout F   "status": "Failure",
2022-05-17T19:21:09.227306715Z stdout F   "message": "Unauthorized",
2022-05-17T19:21:09.227320416Z stdout F   "reason": "Unauthorized",
2022-05-17T19:21:09.227330617Z stdout F   "code": 401
100   165  100   165    0     0   5811      0 --:--:-- --:--:-- --:--:--  5892

This doesn't point to the same issue here then.

ayushiaks · 2022-05-17T19:35:44Z

Also to put it out, linkerd-proxy-injector and linkerd-destination stay in a CrashLoopBackOff with the following error initially:

PostStartHookError: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "9221355bbc14fb825dc85ec6023a75607ae9f9570af5729c4c3c1676f610533a": OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "/usr/lib/linkerd/linkerd-await": stat /usr/lib/linkerd/linkerd-await: no such file or directory: unknown

When I remove the lifecycle snippet for linkerd-await, is when we reach the DNS related issue.

Linkerd-identity on the other hand keeps failing with a Readiness probe failed: HTTP probe failed with statuscode: 503
The logs don't point out to anything either:

 linkerd-proxy time="2022-05-17T19:24:50Z" level=info msg="running version stable-2.10.2"
 linkerd-proxy [0.004734s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
 linkerd-proxy [0.005672s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
 linkerd-proxy [0.005710s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
 linkerd-proxy [0.005717s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
 linkerd-proxy [0.005724s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
 linkerd-proxy [0.005730s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
 linkerd-proxy [0.005738s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via localhost:8080
 linkerd-proxy [0.005745s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)

adleong · 2022-05-17T21:59:39Z

Hi @ayushiaks. Is it possible that you're somehow using the stable-2.11.2 Helm charts with the stable-2.10.2 docker images? I notice in your proxy logs that the proxy seems to version stable-2.10.2. This might explain why the post start hook in the chart references a linkerd-await that doesn't exist in the container.

ayushiaks · 2022-05-18T05:36:51Z

@adleong thanks for pointing that out! Fixing that got my linkerd-identity pods up and running, but the destination and proxy-injector are still failing with:

linkerd-proxy [ 118.692279s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.244.3.137:8080}: linkerd_reconnect: Failed to connect error=received corrupt message

policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN

ayushiaks · 2022-05-19T05:19:12Z

@adleong @olix0r any suggestions here? We're stuck on an upgrade

olix0r · 2022-05-19T14:35:29Z

linkerd-proxy [ 118.692279s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.244.3.137:8080}: linkerd_reconnect: Failed to connect error=received corrupt message

In this situation, I would look at the pod with IP 10.244.3.137 to see what it's state is. corrupt message might indicate that these connections are not being terminated by the proxy, which would happen if the proxy isn't properly initialized (via iptables). It's hard for us to know what's going on based on only that log message, though.

policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN

Does the linkerd-policy service exist? Does it have endpoints? This service maps to the linkerd-destination controller; so it may be expected to see this error during startup if there are no pods in that service. But, again, we'd need more information about the state of the cluster.

It might be helpful for you to back up and explain how you tried to upgrade your cluster. For instance, how did you end up with the wrong container images? It sounds like something has gone very wrong, but it's hard for us to diagnose this without a lot more context about how you manage Linkerd in this cluster.

If you're really stuck and need hands-on help, you may also want to consider commercial support.

ayushiaks · 2022-05-20T18:26:58Z

Hey, we recently added the changes in our linkerd's values file to pickup controller images for Microsoft's internal container registry, which I missed updating while updating the helm chart version.

This is how our config looks like:

linkerd2:
  # -- MCR image for the linkerd controller
  controllerImage: mcr.microsoft.com/oss/linkerd/controller
  controllerImageVersion: "2.10.2"
  imagePullPolicy: __IMAGE_PULL_POLICY__

There's nothing else that we've changed with linkerd which we have been using for a long time.

Something is going really wrong - true
Even when downgrading to older versions, and removing the MCR images, we're ending up with the same error :/
Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN

The corrupt message was temporary so I couldn't debug it further.
No, the linkerd-policy service does not exist within linkerd's namespace, where is this coming from? Is it same as linkerd-destination?

Right now, this is the cluster's state after the identity pods are up. All other containers are throwing the same error:

Will check on the support if there is no clarity here, thanks!

ayushiaks · 2022-05-20T20:03:48Z

I found this doc on DNS issue debugs.

kubectl exec -i -t dnsutils -- nslookup linkerd-identity-headless.linkerd.svc.cluster.local                                                                                                     ─╯
Server:         10.0.0.10
Address:        10.0.0.10#53

** server can't find linkerd-identity-headless.linkerd.svc.cluster.local: NXDOMAIN

command terminated with exit code 1

The lookup fails in the clusters where I've tried linkerd upgrade (explaining why downgrading isn't helping), but works fine in other clusters.

Edit: False alarm, linkerd namespace wasn't up during the lookup.
Now that the lookup is also working fine, I'm running out of ideas to debug this :(

kubectl exec -i -t dnsutils -- nslookup linkerd-identity-headless.linkerd.svc.cluster.local                                                                                  ─╯
Server:         10.0.0.10
Address:        10.0.0.10#53

Name:   linkerd-identity-headless.linkerd.svc.cluster.local
Address: 10.244.4.11
Name:   linkerd-identity-headless.linkerd.svc.cluster.local
Address: 10.244.5.9
Name:   linkerd-identity-headless.linkerd.svc.cluster.local
Address: 10.244.7.17

JasonMorgan · 2022-05-23T18:01:43Z

@ayushiaks are you in the Linkerd slack? It would be great to connect a little more synchronously to see what we can do.

olix0r · 2022-05-24T02:22:14Z

It's possible this is related to #8296. I've pushed a proxy image (which will be included in this week's edge release) that can be used for testing: ghcr.io/olix0r/l2-proxy:main.c7b9c6565.

ayushiaks · 2022-05-24T13:21:57Z

@ayushiaks are you in the Linkerd slack? It would be great to connect a little more synchronously to see what we can do.

Not yet, thanks will join it!

It's possible this is related to #8296. I've pushed a proxy image (which will be included in this week's edge release) that can be used for testing: ghcr.io/olix0r/l2-proxy:main.c7b9c6565.

I've a weird observation here.

The records that it is not able to find is a fqdn, and nslookup for it succeeds as well.
I tried removing the ending '.' from the record name, and the errors went away.

I replaced linkerd-identity-headless.linkerd.svc.cluster.local. with linkerd-identity-headless.linkerd.svc.cluster.local

Although, after the dns errors go away, the pods are still stuck, with readiness and liveness probes failing, and no warn/err logs.

For now, we have downgraded to 2.11.0 and things are working fine.
For that too, we had to create a new cluster, as the other ones went in a bad state and downgrading back to 2.10.2 also didn't help us.

tchellomello · 2022-05-31T19:08:40Z

We were hitting the same in Openshift 4.10, and we had to downgrade to 2.10.2 to resolve the issue.

Just to mention, in our case we also hit the linkerd-identity-headless.linkerd.svc.cluster.local. not being able to resolved by the DNS. However, since this is a headless entry, the DNS will only able to resolve it if the expected pod used in the svc selector is up.

In our case, due to the pod was in a CrashloopBackoff the SRV entry was not resolved, which seems to be expected.

ayushiaks · 2022-06-21T08:06:57Z

@olix0r @adleong any updates here? 2.11.0 also started failing for us now :|

olix0r · 2022-06-22T19:11:21Z

@ayushiaks I don't think we have enough information to make any progress on this. My suspicion is that this has something to do with your cluster networking setup, but that's a guess. In order for us to help you, we really to be able to replicate the problem, or we need a clear enough description of what changes in Linkerd are required to fix it. But at this point we don't have enough information to proceed.

I'll note that Buoyant is running Linkerd on AKS on Kubernetes v1.21.2 without any problems, but I don't have any sense of how that configuration differs from your environment.

2.11.0 also started failing for us now

I think this points to something changing in your cluster's environment.

ayushiaks · 2022-07-16T17:04:47Z

Hi @olix0r , We're at linkerd 2.11.0 now, which is working fine. There were some transient issues with it. (2.11.1 and 2.11.2 are still causing above issues^)
To repro this, we just bump up the version and we see things failing, that's all.

Anyway, we switched to Azure CNI as our network plugin for our AKS clusters, now one of our linkerd-destinations pod is suddenly in CrashLoopBackOff failing with:

WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

I tried the repro for the issue AKS#2750, but that is not the case for us.

And it's only failing in one of our clusters, any idea what might be causing this? Or how I can dig deeper here?

EDIT: We started facing random error each time, and in all sorts of pods, related to network connectivity timeouts.
As soon as we removed linkerd from our AKS cluster, everything has been working fine.

Looks like something is wrong with CNI+Linkerd combination.
Anything we are missing here?

olix0r · 2022-07-21T17:25:41Z

WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

This error indicates that the policy container is not running--the proxy can't connect to the controller on the localhost. This may happen initially as the proxy waits for the policy controller to start, or it may indicate that the policy controller is failing to start. I'd look at kubectl describe output on the destination pod and/or the logs from the policy controller.

And it's only failing in one of our clusters, any idea what might be causing this? Or how I can dig deeper here?

If I were in your shoes, I would try to identify how these cluster differ.

EDIT: We started facing random error each time, and in all sorts of pods, related to network connectivity timeouts.

This isn't really actionable for us.

Looks like something is wrong with CNI+Linkerd combination.
Anything we are missing here?

We have many users that use CNI & Linkerd successfully (in Azure, even), so it's more likely a problem with your specific cluster configuration.

I would suggest trying the latest Linkerd stable release, 2.11.4 which includes fixes for some DNS-related discovery problems, but I can't be confident that this will help you since I really don't understand the nature of the problems you're encountering.

I'm sorry that you're having trouble running Linkerd on these clusters, but as an open source project we can only really fix problems that we can reproduce; or we need very detailed problem descriptions that identify specific bugs in Linkerd. We don't really have bandwidth to help debug your environment. As I've mentioned previously, there are support vendors and training workshops that may be able to help you with this.

stale · 2022-10-19T17:57:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

ayushiaks added the bug label May 16, 2022

ayushiaks changed the title ~~Linkerd 2.11.x control plane components failing~~ Linkerd 2.11.x Control Plane Components Failing May 16, 2022

olix0r added support env/aks Microsoft AKS and removed bug labels May 16, 2022

perrness mentioned this issue Jul 24, 2022

destination and proxy-injector proxies can't get certified by identity after node startup #8965

Closed

nathanmcgarvey mentioned this issue Sep 10, 2022

Prevent misconfigured pods from coming up when running in CNI mode #8120

Closed

stale bot added the wontfix label Oct 19, 2022

stale bot closed this as completed Nov 4, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd 2.11.x Control Plane Components Failing #8496

Linkerd 2.11.x Control Plane Components Failing #8496

ayushiaks commented May 16, 2022

olix0r commented May 16, 2022

ayushiaks commented May 17, 2022 •

edited

Loading

ayushiaks commented May 17, 2022

ayushiaks commented May 17, 2022

adleong commented May 17, 2022

ayushiaks commented May 18, 2022 •

edited

Loading

ayushiaks commented May 19, 2022

olix0r commented May 19, 2022

ayushiaks commented May 20, 2022 •

edited

Loading

ayushiaks commented May 20, 2022 •

edited

Loading

JasonMorgan commented May 23, 2022

olix0r commented May 24, 2022 •

edited

Loading

ayushiaks commented May 24, 2022

tchellomello commented May 31, 2022 •

edited

Loading

ayushiaks commented Jun 21, 2022

olix0r commented Jun 22, 2022

ayushiaks commented Jul 16, 2022 •

edited

Loading

olix0r commented Jul 21, 2022

stale bot commented Oct 19, 2022

Linkerd 2.11.x Control Plane Components Failing #8496

Linkerd 2.11.x Control Plane Components Failing #8496

Comments

ayushiaks commented May 16, 2022

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

olix0r commented May 16, 2022

ayushiaks commented May 17, 2022 • edited Loading

ayushiaks commented May 17, 2022

ayushiaks commented May 17, 2022

adleong commented May 17, 2022

ayushiaks commented May 18, 2022 • edited Loading

ayushiaks commented May 19, 2022

olix0r commented May 19, 2022

ayushiaks commented May 20, 2022 • edited Loading

ayushiaks commented May 20, 2022 • edited Loading

JasonMorgan commented May 23, 2022

olix0r commented May 24, 2022 • edited Loading

ayushiaks commented May 24, 2022

tchellomello commented May 31, 2022 • edited Loading

ayushiaks commented Jun 21, 2022

olix0r commented Jun 22, 2022

ayushiaks commented Jul 16, 2022 • edited Loading

olix0r commented Jul 21, 2022

stale bot commented Oct 19, 2022

output of `linkerd check -o short`

ayushiaks commented May 17, 2022 •

edited

Loading

ayushiaks commented May 18, 2022 •

edited

Loading

ayushiaks commented May 20, 2022 •

edited

Loading

ayushiaks commented May 20, 2022 •

edited

Loading

olix0r commented May 24, 2022 •

edited

Loading

tchellomello commented May 31, 2022 •

edited

Loading

ayushiaks commented Jul 16, 2022 •

edited

Loading