Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Calico #7493

ranjith-vatakkeel · 2021-12-17T14:59:51Z

What is the issue?

A fresh installation of linkerd 2.11.1 control plane along with cni plugin is ending with Container Creating state.

$ k get po -n linkerd
NAME                                      READY   STATUS              RESTARTS   AGE
linkerd-destination-96cd85b9d-7skc7       0/4     ContainerCreating   0          15h
linkerd-identity-cbd8f4795-zxd7c          2/2     Running             0          15h
linkerd-proxy-injector-5ff5bd76d6-pqmh9   0/2     ContainerCreating   0          15h

How can it be reproduced?

Steps:

Install azure aks 1.21.2 with Azure CNI and Calico
Install linkerd-cni version 2.11.1 linkerd install-cni | kubectl apply -f -
Install linked 2.11.1 linkerd install --linkerd-cni-enabled | kubectl apply -f -

You will notice that both destination and proxy-injector pods are stuck on ContainerCreating state.

Logs, error output, etc

Since pods are not starting no any logs or error from respective pods. Identity pod was complaining about the reachability to other pods.

$ k get po -n linkerd
NAME                                      READY   STATUS              RESTARTS   AGE
linkerd-destination-69479855b8-s2pmk      0/4     ContainerCreating   0          4m52s
linkerd-identity-cbd8f4795-bq9ql          2/2     Running             0          4m53s
linkerd-proxy-injector-68967c4549-jk6sq   0/2     ContainerCreating   0          4m52s

$ k logs linkerd-identity-cbd8f4795-bq9ql -n linkerd linkerd-proxy
[     0.000871s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.001644s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.003173s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.003333s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.003427s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.003490s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.003568s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via localhost:8080
[     0.003621s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.005596s]  WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[     0.110413s]  WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[     0.318137s]  WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[     0.746489s]  WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[     1.270472s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
[     5.005493s]  WARN ThreadId(01) policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN

output of `linkerd check -o short`

$ linkerd check -o short
Linkerd core checks

linkerd-existence

\ No running pods for "linkerd-destination"

Environment

Kubernetes version: 1.21.2
Azure AKS with azure CNI and calico
Linux
Linkerd version: 2.11.1 with linkerd-cni plugin

Possible solution

Solution 1:
Reprovision the aks cluster with same config without Calico. All features will be working as expected
Solution 2:
Reinstall the linkerd without linkerd-cni plugin, works like charm.
Solution 3:
Remove the lifecycle section from destination and proxy-injector pods spec, pods will get started and seems everything was working. But don't know this is right solution for a PROD environment.

lifecycle:
          postStart:
            exec:
              command:
              - /usr/lib/linkerd/linkerd-await

Additional context

Look like AKS calico feature is giving problem to linkerd when we implement with linkerd-cni.
Old linked version 2.10.2 was working fine with Calico and linkerd-cni so seems like its a bug with new version.

Would you like to work on fixing this bug?

no

The text was updated successfully, but these errors were encountered:

alpeb · 2021-12-21T19:20:05Z

@ranjith-vatakkeel could you please give this another shot with our latest edge release? We've recently upgraded our CNI libraries dependencies, and that might help.

ranjith-vatakkeel · 2021-12-27T10:09:50Z

@alpeb Thanks for the reply. I will have a check and update you.

olix0r · 2022-01-04T17:26:47Z

@ranjith-vatakkeel is your cluster configured with a custom cluster domain (i.e. not cluster.local)?

alpeb · 2022-01-19T21:24:51Z

I was able to reproduce the issue using the latest edge, which does indeed only appear with the particular combination of Azure CNI + Calico. Unfortunately, I couldn't retrieve enough information to pinpoint the source of the problem. For the time being, the recommendation under this scenario remains, after installing linkerd-cni, to install linkerd using the flag --set proxy.await=false

ranjith-vatakkeel · 2022-01-23T14:55:38Z

@alpeb I tried to set proxy.await=false but it didn't help. Could you please confirm once. ?
env: Azure CNI + Calico + linkerd-cni
linkerd-cni : 2.11.1
AKS : v1.22.4
cmd : linkerd install --set proxy.await=false --linkerd-cni-enabled | kubectl apply -f - Not working
And this linkerd install --set proxy.await=false | kubectl apply -f - works.
@olix0r no, we are using cluster.local.

alpeb · 2022-01-24T14:17:39Z

Indeed, I was under the wrong impression that the proxy.await setting would help here, but it only affects non control-plane pods. You still need to manually remove the lifecycle snippet I'm afraid. OTOH since we were able to reproduce the problem, we're still working on a diagnose and possible solution. Will keep you posted.

ranjith-vatakkeel · 2022-01-25T10:04:44Z

@alpeb Thanks we will wait for that. Just checking, is it fine to remove lifecycle snippet in a prod environment .?

alpeb · 2022-01-25T14:26:38Z

@ranjith-vatakkeel what that hook does is blocking for the proxy to be fully ready before starting the pod's main container. By removing it, the container might start before the proxy is ready and the main container's inbound and outbound connections will fail, at least till the proxy becomes ready. So whether that's fine depends on whether your main containers can tolerate that.

alpeb · 2022-01-25T15:24:18Z

It turns out the issue isn't related to linkerd's CNI, and more likely a glitch on the Azure CNI + Calico combo. I've opened Azure/AKS#2750 to track it down.

CCOLLOT · 2022-03-01T12:42:06Z

Hey I have the same issue but on an EKS+Linkerd+Linkerd CNI + AWS CNI + Calico (for network policies) setup.

After installing Linkerd-cni the destination and injector deployments won't start and are stuck in the await state (due to the lifecycle spec).

Looks like it is not only an AKS problem.

alpeb · 2022-03-01T14:15:12Z

Thanks for the report @CCOLLOT. Are you able to reproduce the issue with a minimal example such as the one referred to in Azure/AKS/issues/2750?

CCOLLOT · 2022-03-01T16:56:12Z

Here is what I get:

Using pod.yml:

apiVersion: v1
kind: Pod
metadata:
  name: curl
spec:
  containers:
  - image: curlimages/curl
    name: curl
    command: [ "sh", "-c", "--" ]
    args: [ "while true; do curl -k https://10.0.0.1; done;" ]
    lifecycle:
      postStart:
        exec:
          command: [ "sh", "-c", "--", "while true; do sleep 30; done;" ]

The container is stuck in ContainerCreating state

Output logs:

{"log":"curl: (28) Failed to connect to 10.0.0.1 port 443 after 131054 ms: Operation timed out\n","stream":"stderr","time":"2022-03-01T16:42:24.026329358Z"}
{"log":"  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n","stream":"stderr","time":"2022-03-01T16:42:24.029975622Z"}
{"log":"                                 Dload  Upload   Total   Spent    Left  Speed\n","stream":"stderr","time":"2022-03-01T16:42:24.030003263Z"}

Using otherpod.yml:

apiVersion: v1
kind: Pod
metadata:
  name: othercurl
spec:
  containers:
  - image: curlimages/curl
    name: curl
    command: [ "sh", "-c", "--" ]
    args: [ "while true; do curl -k https://10.0.0.1; done;" ]

The container starts normally.

Output logs:

{"log":"  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n","stream":"stderr","time":"2022-03-01T16:45:05.870744736Z"}
{"log":"                                 Dload  Upload   Total   Spent    Left  Speed\n","stream":"stderr","time":"2022-03-01T16:45:05.870774556Z"}

Fodoj · 2022-05-23T12:23:28Z

Similar issue is happening with just AWS VPC CNI + Linkerd CNI on AWS EKS (K8s 1.21, Linkerd 2.11.2 stable, VPC CNI 1.10.1)

olix0r · 2022-05-24T16:18:41Z

It's possible this is related to #8296. I've pushed a proxy image (which will be included in this week's edge release) that can be used for testing: ghcr.io/olix0r/l2-proxy:main.c7b9c6565.

stale · 2022-08-31T01:17:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

ranjith-vatakkeel added the bug label Dec 17, 2021

ranjith-vatakkeel changed the title ~~Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Azure CNI + Calico~~ Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Calico Dec 17, 2021

alpeb mentioned this issue Jan 18, 2022

Linkerd installation failing on AKS with linkerd-destination and linkerd-proxy-injector pods ending up in CrashLoopBackOff #7633

Closed

alpeb self-assigned this Jan 18, 2022

olix0r added the env/aks Microsoft AKS label Jan 25, 2022

brkane mentioned this issue Feb 4, 2022

Linkerd 2.11.1 controller pods are not running with linkerd-cni option in self-hosted k8s on AWS EC2 + Calico eBPF #7786

Closed

adleong added the area/cni label Mar 1, 2022

stale bot added the wontfix label Aug 31, 2022

stale bot closed this as completed Sep 14, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Calico #7493

Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Calico #7493

ranjith-vatakkeel commented Dec 17, 2021 •

edited

Loading

alpeb commented Dec 21, 2021

ranjith-vatakkeel commented Dec 27, 2021

olix0r commented Jan 4, 2022

alpeb commented Jan 19, 2022

ranjith-vatakkeel commented Jan 23, 2022 •

edited

Loading

alpeb commented Jan 24, 2022

ranjith-vatakkeel commented Jan 25, 2022

alpeb commented Jan 25, 2022

alpeb commented Jan 25, 2022

CCOLLOT commented Mar 1, 2022

alpeb commented Mar 1, 2022

CCOLLOT commented Mar 1, 2022 •

edited

Loading

Fodoj commented May 23, 2022

olix0r commented May 24, 2022

stale bot commented Aug 31, 2022

Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Calico #7493

Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Calico #7493

Comments

ranjith-vatakkeel commented Dec 17, 2021 • edited Loading

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

$ linkerd check -o short Linkerd core checks

linkerd-existence

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

alpeb commented Dec 21, 2021

ranjith-vatakkeel commented Dec 27, 2021

olix0r commented Jan 4, 2022

alpeb commented Jan 19, 2022

ranjith-vatakkeel commented Jan 23, 2022 • edited Loading

alpeb commented Jan 24, 2022

ranjith-vatakkeel commented Jan 25, 2022

alpeb commented Jan 25, 2022

alpeb commented Jan 25, 2022

CCOLLOT commented Mar 1, 2022

alpeb commented Mar 1, 2022

CCOLLOT commented Mar 1, 2022 • edited Loading

Fodoj commented May 23, 2022

olix0r commented May 24, 2022

stale bot commented Aug 31, 2022

ranjith-vatakkeel commented Dec 17, 2021 •

edited

Loading

output of `linkerd check -o short`

$ linkerd check -o short
Linkerd core checks

ranjith-vatakkeel commented Jan 23, 2022 •

edited

Loading

CCOLLOT commented Mar 1, 2022 •

edited

Loading