Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network inaccessible when pod starting, when using Azure CNI + Calico #2750

Closed
alpeb opened this issue Jan 25, 2022 · 48 comments
Closed

Network inaccessible when pod starting, when using Azure CNI + Calico #2750

alpeb opened this issue Jan 25, 2022 · 48 comments

Comments

@alpeb
Copy link

alpeb commented Jan 25, 2022

What happened:

Linkerd's control-plane pods have a sidecar proxy that starts before the main container. The proxy manifest has a postStart hook that blocks the creation of the main container until the proxy is ready. Only when using the combo Azure CNI + Calico, the proxy container appears to have no network, which avoids it to become ready, forbidding the pod startup to complete (it remains in status ContainerCreating).

What you expected to happen:

The network should be available as soon as the pod starts.

How to reproduce it:

  • Create a single-node AKS instance with Azure CNI (using its default settings) and Calico for Network Policy (although we deploy no NetworkPolicy resources for this repro).
  • Deploy a pod that attempts to hit the network with a postStart lifecycle hook that blocks the pod creation indefinitely:
apiVersion: v1
kind: Pod
metadata:
  name: curl
spec:
  containers:
  - image: curlimages/curl
    name: curl
    command: [ "sh", "-c", "--" ]
    args: [ "while true; do curl -k https://10.0.0.1; done;" ]
    lifecycle:
      postStart:
        exec:
          command: [ "sh", "-c", "--", "while true; do sleep 30; done;" ]

The pod remains in the ContainerCreating status as expected, but the curl command times out. This status forbids us from checking the pod's logs through kubectl logs, but we can get them by getting into the node through kubectl debug:

# cat default_curl_e03159c5-8660-474c-81f3-fa7b139fbbd7/curl/0.log
2022-01-25T14:55:03.076566072Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-01-25T14:55:03.076603573Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:10 --:--:--     0
2022-01-25T14:57:13.30494582Z stderr F curl: (28) Failed to connect to 10.0.0.1 port 443 after 130228 ms: Operation timed out
2022-01-25T14:57:13.30920836Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-01-25T14:57:13.309407062Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:11 --:--:--     0
2022-01-25T14:59:24.376433615Z stderr F curl: (28) Failed to connect to 10.0.0.1 port 443 after 131067 ms: Operation timed out
2022-01-25T14:59:24.380944273Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-01-25T14:59:24.380961573Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:11 --:--:--     0
2022-01-25T15:01:35.448517926Z stderr F curl: (28) Failed to connect to 10.0.0.1 port 443 after 131067 ms: Operation timed out
...

If we remove the lifecycle snippet, then curl works as expected:

$ k logs curl                                                                                                                                                                                                        
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                                                                                                                      
                                 Dload  Upload   Total   Spent    Left  Speed                                                                                                                                        
100   165  100   165    0     0    150      0  0:00:01  0:00:01 --:--:--   150                                                                                                                                       
{                                                                                                                                                                                                                    
  "kind": "Status",                                                                                                                                                                                                  
  "apiVersion": "v1",                                                                                                                                                                                                
  "metadata": {                                                                                           
                                                                                                                                                                                                                     
  },                                                                                                                                                                                                                 
  "status": "Failure",                                                                                    
  "message": "Unauthorized",                                                                                                                                                                                         
  "reason": "Unauthorized",                                                                                                                                                                                          
  "code": 401                                                                                             
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                           
                                 Dload  Upload   Total   Spent    Left  Speed                                                                                                                                        
}
...

Anything else we need to know?:

If we try with Kubenet+Calico, or Azure CNI without Calico, there is no issue.

Environment:

  • Kubernetes version: 1.21.7
  • Size of cluster: 1 node
@ghost ghost added the triage label Jan 25, 2022
@ghost
Copy link

ghost commented Jan 25, 2022

Hi alpeb, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@ghost
Copy link

ghost commented Jan 27, 2022

Triage required from @Azure/aks-pm

@ghost ghost removed the triage label Jan 27, 2022
@ghost
Copy link

ghost commented Jan 27, 2022

@aanandr, @phealy would you be able to assist?

Issue Details

What happened:

Linkerd's control-plane pods have a sidecar proxy that starts before the main container. The proxy manifest has a postStart hook that blocks the creation of the main container until the proxy is ready. Only when using the combo Azure CNI + Calico, the proxy container appears to have no network, which avoids it to become ready, forbidding the pod startup to complete (it remains in status ContainerCreating).

What you expected to happen:

The network should be available as soon as the pod starts.

How to reproduce it:

  • Create a single-node AKS instance with Azure CNI (using its default settings) and Calico for Network Policy (although we deploy no NetworkPolicy resources for this repro).
  • Deploy a pod that attempts to hit the network with a postStart lifecycle hook that blocks the pod creation indefinitely:
apiVersion: v1
kind: Pod
metadata:
  name: curl
spec:
  containers:
  - image: curlimages/curl
    name: curl
    command: [ "sh", "-c", "--" ]
    args: [ "while true; do curl -k https://10.0.0.1; done;" ]
    lifecycle:
      postStart:
        exec:
          command: [ "sh", "-c", "--", "while true; do sleep 30; done;" ]

The pod remains in the ContainerCreating status as expected, but the curl command times out. This status forbids us from checking the pod's logs through kubectl logs, but we can get them by getting into the node through kubectl debug:

# cat default_curl_e03159c5-8660-474c-81f3-fa7b139fbbd7/curl/0.log
2022-01-25T14:55:03.076566072Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-01-25T14:55:03.076603573Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:10 --:--:--     0
2022-01-25T14:57:13.30494582Z stderr F curl: (28) Failed to connect to 10.0.0.1 port 443 after 130228 ms: Operation timed out
2022-01-25T14:57:13.30920836Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-01-25T14:57:13.309407062Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:11 --:--:--     0
2022-01-25T14:59:24.376433615Z stderr F curl: (28) Failed to connect to 10.0.0.1 port 443 after 131067 ms: Operation timed out
2022-01-25T14:59:24.380944273Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-01-25T14:59:24.380961573Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:11 --:--:--     0
2022-01-25T15:01:35.448517926Z stderr F curl: (28) Failed to connect to 10.0.0.1 port 443 after 131067 ms: Operation timed out
...

If we remove the lifecycle snippet, then curl works as expected:

$ k logs curl                                                                                                                                                                                                        
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                                                                                                                      
                                 Dload  Upload   Total   Spent    Left  Speed                                                                                                                                        
100   165  100   165    0     0    150      0  0:00:01  0:00:01 --:--:--   150                                                                                                                                       
{                                                                                                                                                                                                                    
  "kind": "Status",                                                                                                                                                                                                  
  "apiVersion": "v1",                                                                                                                                                                                                
  "metadata": {                                                                                           
                                                                                                                                                                                                                     
  },                                                                                                                                                                                                                 
  "status": "Failure",                                                                                    
  "message": "Unauthorized",                                                                                                                                                                                         
  "reason": "Unauthorized",                                                                                                                                                                                          
  "code": 401                                                                                             
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                           
                                 Dload  Upload   Total   Spent    Left  Speed                                                                                                                                        
}
...

Anything else we need to know?:

If we try with Kubenet+Calico, or Azure CNI without Calico, there is no issue.

Environment:

  • Kubernetes version: 1.21.7
  • Size of cluster: 1 node
Author: alpeb
Assignees: -
Labels:

networking/azcni, triage, action-required

Milestone: -

@ghost ghost removed the action-required label Jan 27, 2022
@ghost ghost added the action-required label Feb 22, 2022
@dguerin
Copy link

dguerin commented Feb 23, 2022

We have the same problem as well in our new AKS setup. Very interested in a solution to this asap. We have disabled CNI in the meantime and worked around it.

@rnemeth90
Copy link

We have the same problem as well in our new AKS setup. Very interested in a solution to this asap. We have disabled CNI in the meantime and worked around it.

Are your node pools running Windows or Linux? Docker or containerd?

@rootik
Copy link

rootik commented Feb 23, 2022

@rnemeth90 we are running AKS with containerd and ubuntu

@ghost
Copy link

ghost commented Feb 28, 2022

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Feb 28, 2022
@ghost
Copy link

ghost commented Mar 15, 2022

Issue needing attention of @Azure/aks-leads

2 similar comments
@ghost
Copy link

ghost commented Mar 31, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Apr 15, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Apr 30, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented May 15, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented May 30, 2022

Issue needing attention of @Azure/aks-leads

@carlospcastro
Copy link

I also have a similar issue but in a productive cluster, would be great to get a fix or even a good workaround that we can apply also in a productive cluster. I can't recreate the cluster switching from CNI to Kubenet.
Thanks in advance.

@ghost
Copy link

ghost commented Jun 23, 2022

Issue needing attention of @Azure/aks-leads

4 similar comments
@ghost
Copy link

ghost commented Jul 8, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 23, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Aug 8, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Aug 23, 2022

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Dec 22, 2022

Issue needing attention of @Azure/aks-leads

@ross-worth
Copy link

Same issue

@ghost
Copy link

ghost commented Jan 11, 2023

Issue needing attention of @Azure/aks-leads

13 similar comments
@ghost
Copy link

ghost commented Jan 26, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Feb 11, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Feb 26, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Mar 13, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Mar 28, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Apr 12, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Apr 28, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented May 13, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented May 28, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 12, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 28, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 13, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 28, 2023

Issue needing attention of @Azure/aks-leads

@lieberlois
Copy link

Is there still no fix for this?

@microsoft-github-policy-service microsoft-github-policy-service bot added the stale Stale issue label Feb 2, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@davesmits
Copy link

Just spent 4 weeks with my team to figure out we have this.. Disappointed

Also lack of confirmation is not giving a great feeling about this

@microsoft-github-policy-service microsoft-github-policy-service bot added the stale Stale issue label May 7, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

Copy link
Contributor

This issue will now be closed because it hasn't had any activity for 7 days after stale. alpeb feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests