Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Calico #7493

Closed
ranjith-vatakkeel opened this issue Dec 17, 2021 · 15 comments
Assignees

Comments

@ranjith-vatakkeel
Copy link

ranjith-vatakkeel commented Dec 17, 2021

What is the issue?

A fresh installation of linkerd 2.11.1 control plane along with cni plugin is ending with Container Creating state.

$ k get po -n linkerd
NAME                                      READY   STATUS              RESTARTS   AGE
linkerd-destination-96cd85b9d-7skc7       0/4     ContainerCreating   0          15h
linkerd-identity-cbd8f4795-zxd7c          2/2     Running             0          15h
linkerd-proxy-injector-5ff5bd76d6-pqmh9   0/2     ContainerCreating   0          15h

How can it be reproduced?

Steps:

  1. Install azure aks 1.21.2 with Azure CNI and Calico
  2. Install linkerd-cni version 2.11.1 linkerd install-cni | kubectl apply -f -
  3. Install linked 2.11.1 linkerd install --linkerd-cni-enabled | kubectl apply -f -

You will notice that both destination and proxy-injector pods are stuck on ContainerCreating state.

Logs, error output, etc

Since pods are not starting no any logs or error from respective pods. Identity pod was complaining about the reachability to other pods.

$ k get po -n linkerd
NAME                                      READY   STATUS              RESTARTS   AGE
linkerd-destination-69479855b8-s2pmk      0/4     ContainerCreating   0          4m52s
linkerd-identity-cbd8f4795-bq9ql          2/2     Running             0          4m53s
linkerd-proxy-injector-68967c4549-jk6sq   0/2     ContainerCreating   0          4m52s

$ k logs linkerd-identity-cbd8f4795-bq9ql -n linkerd linkerd-proxy
[     0.000871s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.001644s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.003173s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.003333s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.003427s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.003490s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.003568s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via localhost:8080
[     0.003621s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.005596s]  WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[     0.110413s]  WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[     0.318137s]  WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[     0.746489s]  WARN ThreadId(02) identity:controller{addr=localhost:8080}:endpoint{addr=127.0.0.1:8080}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[     1.270472s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local
[     5.005493s]  WARN ThreadId(01) policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN

output of linkerd check -o short

$ linkerd check -o short
Linkerd core checks

linkerd-existence

\ No running pods for "linkerd-destination"

Environment

  • Kubernetes version: 1.21.2
  • Azure AKS with azure CNI and calico
  • Linux
  • Linkerd version: 2.11.1 with linkerd-cni plugin

Possible solution

Solution 1:
Reprovision the aks cluster with same config without Calico. All features will be working as expected
Solution 2:
Reinstall the linkerd without linkerd-cni plugin, works like charm.
Solution 3:
Remove the lifecycle section from destination and proxy-injector pods spec, pods will get started and seems everything was working. But don't know this is right solution for a PROD environment.

lifecycle:
          postStart:
            exec:
              command:
              - /usr/lib/linkerd/linkerd-await

Additional context

Look like AKS calico feature is giving problem to linkerd when we implement with linkerd-cni.
Old linked version 2.10.2 was working fine with Calico and linkerd-cni so seems like its a bug with new version.

Would you like to work on fixing this bug?

no

@ranjith-vatakkeel ranjith-vatakkeel changed the title Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Azure CNI + Calico Linkerd 2.11.1 controller pods are not running with linkerd-cni option in AKS with Calico Dec 17, 2021
@alpeb
Copy link
Member

alpeb commented Dec 21, 2021

@ranjith-vatakkeel could you please give this another shot with our latest edge release? We've recently upgraded our CNI libraries dependencies, and that might help.

@ranjith-vatakkeel
Copy link
Author

@alpeb Thanks for the reply. I will have a check and update you.

@olix0r
Copy link
Member

olix0r commented Jan 4, 2022

@ranjith-vatakkeel is your cluster configured with a custom cluster domain (i.e. not cluster.local)?

@alpeb
Copy link
Member

alpeb commented Jan 19, 2022

I was able to reproduce the issue using the latest edge, which does indeed only appear with the particular combination of Azure CNI + Calico. Unfortunately, I couldn't retrieve enough information to pinpoint the source of the problem. For the time being, the recommendation under this scenario remains, after installing linkerd-cni, to install linkerd using the flag --set proxy.await=false

@ranjith-vatakkeel
Copy link
Author

ranjith-vatakkeel commented Jan 23, 2022

@alpeb I tried to set proxy.await=false but it didn't help. Could you please confirm once. ?
env: Azure CNI + Calico + linkerd-cni
linkerd-cni : 2.11.1
AKS : v1.22.4
cmd : linkerd install --set proxy.await=false --linkerd-cni-enabled | kubectl apply -f - Not working
And this linkerd install --set proxy.await=false | kubectl apply -f - works.
@olix0r no, we are using cluster.local.

@alpeb
Copy link
Member

alpeb commented Jan 24, 2022

Indeed, I was under the wrong impression that the proxy.await setting would help here, but it only affects non control-plane pods. You still need to manually remove the lifecycle snippet I'm afraid. OTOH since we were able to reproduce the problem, we're still working on a diagnose and possible solution. Will keep you posted.

@ranjith-vatakkeel
Copy link
Author

@alpeb Thanks we will wait for that. Just checking, is it fine to remove lifecycle snippet in a prod environment .?

@alpeb
Copy link
Member

alpeb commented Jan 25, 2022

@ranjith-vatakkeel what that hook does is blocking for the proxy to be fully ready before starting the pod's main container. By removing it, the container might start before the proxy is ready and the main container's inbound and outbound connections will fail, at least till the proxy becomes ready. So whether that's fine depends on whether your main containers can tolerate that.

@alpeb
Copy link
Member

alpeb commented Jan 25, 2022

It turns out the issue isn't related to linkerd's CNI, and more likely a glitch on the Azure CNI + Calico combo. I've opened Azure/AKS#2750 to track it down.

@CCOLLOT
Copy link

CCOLLOT commented Mar 1, 2022

Hey I have the same issue but on an EKS+Linkerd+Linkerd CNI + AWS CNI + Calico (for network policies) setup.

After installing Linkerd-cni the destination and injector deployments won't start and are stuck in the await state (due to the lifecycle spec).

Looks like it is not only an AKS problem.

@alpeb
Copy link
Member

alpeb commented Mar 1, 2022

Thanks for the report @CCOLLOT. Are you able to reproduce the issue with a minimal example such as the one referred to in Azure/AKS/issues/2750?

@CCOLLOT
Copy link

CCOLLOT commented Mar 1, 2022

Here is what I get:

Using pod.yml:

apiVersion: v1
kind: Pod
metadata:
  name: curl
spec:
  containers:
  - image: curlimages/curl
    name: curl
    command: [ "sh", "-c", "--" ]
    args: [ "while true; do curl -k https://10.0.0.1; done;" ]
    lifecycle:
      postStart:
        exec:
          command: [ "sh", "-c", "--", "while true; do sleep 30; done;" ]

The container is stuck in ContainerCreating state

  • Output logs:
    {"log":"curl: (28) Failed to connect to 10.0.0.1 port 443 after 131054 ms: Operation timed out\n","stream":"stderr","time":"2022-03-01T16:42:24.026329358Z"}
    {"log":"  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n","stream":"stderr","time":"2022-03-01T16:42:24.029975622Z"}
    {"log":"                                 Dload  Upload   Total   Spent    Left  Speed\n","stream":"stderr","time":"2022-03-01T16:42:24.030003263Z"}

Using otherpod.yml:

apiVersion: v1
kind: Pod
metadata:
  name: othercurl
spec:
  containers:
  - image: curlimages/curl
    name: curl
    command: [ "sh", "-c", "--" ]
    args: [ "while true; do curl -k https://10.0.0.1; done;" ]

The container starts normally.

  • Output logs:
    {"log":"  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n","stream":"stderr","time":"2022-03-01T16:45:05.870744736Z"}
    {"log":"                                 Dload  Upload   Total   Spent    Left  Speed\n","stream":"stderr","time":"2022-03-01T16:45:05.870774556Z"}

@Fodoj
Copy link

Fodoj commented May 23, 2022

Similar issue is happening with just AWS VPC CNI + Linkerd CNI on AWS EKS (K8s 1.21, Linkerd 2.11.2 stable, VPC CNI 1.10.1)

@olix0r
Copy link
Member

olix0r commented May 24, 2022

It's possible this is related to #8296. I've pushed a proxy image (which will be included in this week's edge release) that can be used for testing: ghcr.io/olix0r/l2-proxy:main.c7b9c6565.

@stale
Copy link

stale bot commented Aug 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 31, 2022
@stale stale bot closed this as completed Sep 14, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants