Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows containers behind a Kubernetes loadbalancer become unreachable #78

Closed
sebsoto opened this issue Dec 11, 2020 · 8 comments
Closed
Assignees
Labels
bug Something isn't working Networking Connectivity and network infrastructure Windows on Kubernetes Windows Containers using Kubernetes

Comments

@sebsoto
Copy link

sebsoto commented Dec 11, 2020

There is a regression when using Windows worker nodes with newer Windows kernel versions on an OpenShift 4.6.8 cluster with mutltiple Windows worker nodes. The cluster configured with hybrid OVN networking.

This issue is present at least in Windows Server 2019 OS Builds 17763.1579 and 17763.1637.
This issue was not present in Windows Server 2019 OS Build 17763.1457

The issue is that http requests made through a load balancer backed by a webserver deployment with 3 pods, are not always making it to the webservers. This issue occurs only when the pods are running on separate Windows nodes. We are seeing this on both Azure and AWS. The logs in this issue are for an Azure cluster with Windows server 2019 OS Build 17763.1637 worker nodes.

Here is a deployment yaml which can be used to exercise this issue.

apiVersion: v1
kind: Service
metadata:
  name: win-webserver
  labels:
    app: win-webserver
spec:
  ports:
    # the port that this service should serve on
  - port: 80
    targetPort: 80
  selector:
    app: win-webserver
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
spec:
  selector:
    matchLabels:
      app: win-webserver
  replicas: 1
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      tolerations:
      - key: "os"
        value: "Windows"
        Effect: "NoSchedule"
      containers:
      - name: windowswebserver
        env:
        - name: NODENAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: mcr.microsoft.com/powershell:lts-nanoserver-1809
        imagePullPolicy: IfNotPresent
        command:
        - pwsh.exe
        - -command
        - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content= "<html><body><H1>Windows Container Web Server, node:"+ $Env:NODENAME + "</H1></body></html>"; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); };
        securityContext:
          windowsOptions:
            runAsUserName: "ContainerAdministrator"
      nodeSelector:
        beta.kubernetes.io/os: windows

Once the above yaml was applied to the cluster, the external ip of the load balancer was repeatedly curled using this script:

#!/bin/bash -i

for (( c=1; ; c++ ))
do
   echo " - attempt $c - infinite loops [ hit CTRL+C to stop]"
   date
   curl $1
done

With one replica, everything worked fine. When the deployment was scaled to three replicas, after a few sucesses the webservers were unreachable. Scaling back down to one replica fixed this issue.

[sebsoto@localhost openshift]$ oc get svc
NAME            TYPE           CLUSTER-IP      EXTERNAL-IP                            PORT(S)        AGE
win-webserver   LoadBalancer   172.30.93.181   20.37.134.116                          80:30640/TCP   35m

[sebsoto@localhost openshift]$ oc get po
NAME                             READY   STATUS    RESTARTS   AGE
win-webserver-64f5c875df-c5fnn   1/1     Running   0          6m16s

[sebsoto@localhost openshift]$ bash curl.sh 20.37.134.116
 - attempt 1 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:40 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 2 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:40 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 3 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:40 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 4 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:40 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 5 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:40 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 6 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:40 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 7 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:40 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 8 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:40 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 9 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:41 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 10 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:41 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 11 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:41 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 12 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:41 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 13 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:41 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 14 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:41 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 15 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:41 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 16 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:41 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 17 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:52:42 PM EST
^C

[sebsoto@localhost openshift]$ oc scale deploy/win-webserver --replicas 3
deployment.apps/win-webserver scaled

[sebsoto@localhost openshift]$ oc get po
NAME                             READY   STATUS    RESTARTS   AGE
win-webserver-64f5c875df-bvgq2   1/1     Running   0          8s
win-webserver-64f5c875df-c5fnn   1/1     Running   0          6m57s
win-webserver-64f5c875df-rnbhz   1/1     Running   0          8s

[sebsoto@localhost openshift]$ bash curl.sh 20.37.134.116
 - attempt 1 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:53:20 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-2wgm6</H1></body></html> - attempt 2 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:53:24 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-mzfw8</H1></body></html> - attempt 3 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:53:24 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 4 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:53:24 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-2wgm6</H1></body></html> - attempt 5 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:53:31 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-2wgm6</H1></body></html> - attempt 6 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:53:38 PM EST
curl: (28) Failed to connect to 20.37.134.116 port 80: Connection timed out
 - attempt 7 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:55:48 PM EST
curl: (28) Failed to connect to 20.37.134.116 port 80: Connection timed out
 - attempt 8 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 01:57:59 PM EST
curl: (28) Failed to connect to 20.37.134.116 port 80: Connection timed out
 - attempt 9 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:10 PM EST
^C

[sebsoto@localhost openshift]$ oc scale deploy/win-webserver --replicas 1
deployment.apps/win-webserver scaled

[sebsoto@localhost openshift]$ oc get po 
NAME                             READY   STATUS    RESTARTS   AGE
win-webserver-64f5c875df-c5fnn   1/1     Running   0          14m

[sebsoto@localhost openshift]$ bash curl.sh 20.37.134.116
 - attempt 1 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:33 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 2 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:33 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 3 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:34 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 4 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:34 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 5 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:34 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 6 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:34 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 7 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:34 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 8 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:34 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 9 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:34 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 10 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:34 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 11 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:35 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 12 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:35 PM EST
<html><body><H1>Windows Container Web Server, node:winworker-m6ktt</H1></body></html> - attempt 13 - infinite loops [ hit CTRL+C to stop]
Fri 11 Dec 2020 02:00:35 PM EST
@ghost ghost added the triage New and needs attention label Dec 11, 2020
@vrapolinario vrapolinario added bug Something isn't working Networking Connectivity and network infrastructure Windows on Kubernetes Windows Containers using Kubernetes and removed triage New and needs attention labels Dec 11, 2020
@jsturtevant
Copy link

using AKS engine with k8s v1.19.3 I was not able to reproduce:

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

PS C:\Users\azureuser> cmd /c ver

Microsoft Windows [Version 10.0.17763.1637] 
PS C:\Users\azureuser>  
kgp -o wide                                                          
NAME                             READY   STATUS    RESTARTS   AGE     IP             NODE              NOMINATED NODE   READINESS GATES
win-webserver-7b8599fb4c-66cxp   1/1     Running   0          2m27s   10.240.0.152   2278k8s00000002   <none>           <none>
win-webserver-7b8599fb4c-676sl   1/1     Running   0          2m27s   10.240.0.117   2278k8s00000001   <none>           <none>
win-webserver-7b8599fb4c-hrncz   1/1     Running   0          3m56s   10.240.0.41    2278k8s00000000   <none>           <none>
k get svc                                             
NAME            TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
kubernetes      ClusterIP      10.0.0.1       <none>          443/TCP        35m
win-webserver   LoadBalancer   10.0.156.121   20.190.1.45     80:31048/TCP   5m24s
#!/bin/bash -i                               

for (( c=1; ; c++ ))
do
   echo " - attempt $c - infinite loops [ hit CTRL+C to stop]"
   date
   curl 20.190.1.45
done

 - attempt 1 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:39 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000001</H1></body></html> - attempt 2 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:39 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000002</H1></body></html> - attempt 3 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:39 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000000</H1></body></html> - attempt 4 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:39 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000001</H1></body></html> - attempt 5 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:39 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000001</H1></body></html> - attempt 6 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:39 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000001</H1></body></html> - attempt 7 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:40 AM PST

.....

<html><body><H1>Windows Container Web Server, node:2278k8s00000000</H1></body></html> - attempt 357 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:57 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000001</H1></body></html> - attempt 358 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:57 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000002</H1></body></html> - attempt 359 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:57 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000001</H1></body></html> - attempt 360 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:57 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000000</H1></body></html> - attempt 361 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:57 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000002</H1></body></html> - attempt 362 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:57 AM PST
<html><body><H1>Windows Container Web Server, node:2278k8s00000001</H1></body></html> - attempt 363 - infinite loops [ hit CTRL+C to stop]
Mon 21 Dec 2020 11:58:57 AM PST

@jsturtevant
Copy link

There was a change that we detected in aks-engine: Azure/aks-engine#3956 (comment) that caused service calls to fail.

The fix was to re-order the calls: Azure/aks-engine#3956 (comment) done in Azure/aks-engine#4002

Not sure if this is related but seems fairly suspect.

@ghost
Copy link

ghost commented Jan 21, 2021

This issue has been open for 30 days with no updates.
@mkostersitz, @immuzz, please provide an update or close this issue.

@mkostersitz
Copy link

This is still under active investigation.

@ghost
Copy link

ghost commented Feb 21, 2021

This issue has been open for 30 days with no updates.
@mkostersitz, @immuzz, please provide an update or close this issue.

@sebsoto
Copy link
Author

sebsoto commented Mar 18, 2021

This is still occurring on OCP 4.6 (Kubernetes 1.19), but on OCP 4.7 (Kubernetes 1.20), we are not able to reproduce this issue because of a new issue #103

@sebsoto
Copy link
Author

sebsoto commented Mar 30, 2021

This was fixed by kubernetes/kubernetes#96499

@sebsoto sebsoto closed this as completed Mar 30, 2021
@filipe-paredes
Copy link

filipe-paredes commented Apr 30, 2021

Until people move to Kubernetes 1.20 (assuming the problem is fixed in this version) as a workaround you just need to bypass the Azure Load Balancer on your http calls.
To do so, update the "coredns-custom" configmap in order to edit CoreDNS to re-write DNS queries to the target's ingress controller service.

Example
Azure Load Balancer IP: 10.0.16.10
Ingress Controller Namespace: default
Ingress Controller External IP: 10.0.16.10 (the Azure Load Balancer one of course)
Ingress Controller Cluster IP: 10.0.16.20
Ingress Controller Service Name: my-ingress-controller
Ingress Controller Service DNS record: my-ingress-controller.default.svc.cluster.local resolves to 10.0.16.20
DNS: my.app.org resolves to 10.0.16.10 (the Azure Load Balancer)

Calls from any container will translate the name "my.app.org" to 10.0.16.10 and the request will be routed through the Azure Load Balancer and back inside the AKS cluster.
But if we resolve the hostname to the Ingress Controller Cluster IP instead, the request will go directly through the Ingress Controller and this intermittent problem will not happen.

To do so, you just need to add the following into the "coredns-custom" configmap:

windowsfix.server: |
  my.app.org:53 {
    errors
    cache 30
    rewrite name my.app.org my-ingress-controller.default.svc.cluster.local
    forward .  /etc/resolv.conf
  }

jrvaldes added a commit to jrvaldes/windows-machine-config-operator that referenced this issue Jul 30, 2021
Removed 2019-Datacenter-with-Containers SKU fixed version

Removed verbiage regarding the issue with Windows containers behind
a Kubernetes load balancer becoming unreachable, since no longer applicable.
See microsoft/Windows-Containers#78

Added sections for sample machineSet parameters and object
jrvaldes added a commit to jrvaldes/windows-machine-config-operator that referenced this issue Jul 30, 2021
Removed 2019-Datacenter-with-Containers SKU fixed version.

Removed verbiage regarding the issue with Windows containers behind
a Kubernetes load balancer becoming unreachable, since no longer applicable.
See microsoft/Windows-Containers#78

Added sections for sample machineSet parameters and object.
jrvaldes added a commit to jrvaldes/windows-machine-config-operator that referenced this issue Jul 30, 2021
Added sections for sample machineSet parameters and object.

Re-arranged parameters and added command to get the latest
compatible image for a given region.

Removed verbiage regarding the issue Windows containers behind a Kubernetes
load balancer becoming unreachable issue, since no longer applicable.
See microsoft/Windows-Containers#78
jrvaldes added a commit to jrvaldes/windows-machine-config-operator that referenced this issue Aug 4, 2021
Removed 2019-Datacenter-with-Containers SKU fixed version.

Removed verbiage regarding the issue with Windows containers behind
a Kubernetes load balancer becoming unreachable, since no longer applicable.
See microsoft/Windows-Containers#78

Added sections for sample machineSet parameters and object.
jrvaldes added a commit to jrvaldes/windows-machine-config-operator that referenced this issue Aug 4, 2021
Added sections for sample machineSet parameters and object.

Re-arranged parameters and added command to get the latest
compatible image for a given region.

Removed verbiage regarding the issue Windows containers behind a Kubernetes
load balancer becoming unreachable issue, since no longer applicable.
See microsoft/Windows-Containers#78
jrvaldes added a commit to jrvaldes/windows-machine-config-operator that referenced this issue Aug 4, 2021
Removed 2019-Datacenter-with-Containers SKU fixed version.

Removed verbiage regarding the issue with Windows containers behind
a Kubernetes load balancer becoming unreachable, since no longer applicable.
See microsoft/Windows-Containers#78

Added sections for sample machineSet parameters and object.
jrvaldes added a commit to jrvaldes/windows-machine-config-operator that referenced this issue Aug 4, 2021
Added sections for sample machineSet parameters and object.

Re-arranged parameters and added command to get the latest
compatible image for a given region.

Removed verbiage regarding the issue Windows containers behind a Kubernetes
load balancer becoming unreachable issue, since no longer applicable.
See microsoft/Windows-Containers#78
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Networking Connectivity and network infrastructure Windows on Kubernetes Windows Containers using Kubernetes
Projects
None yet
Development

No branches or pull requests

6 participants