Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Existing AKS Cluster Suddenly not Resolving DNS #1320

Closed
GuyPaddock opened this issue Nov 18, 2019 · 54 comments
Closed

Existing AKS Cluster Suddenly not Resolving DNS #1320

GuyPaddock opened this issue Nov 18, 2019 · 54 comments
Assignees
Labels

Comments

@GuyPaddock
Copy link

GuyPaddock commented Nov 18, 2019

What happened:
We've had an AKS cluster deployed since February 2019 that's been stable until tonight. As of 11:30 PM ET on 2019-11-17, it seems as though all DNS requests -- both for hosts inside the cluster (e.g. redis.myapp-dev) as well as hosts outside the cluster (e.g. myapp.mysql.database.azure.com) -- have stopped being resolved.

If I SSH into a node in the cluster, DNS queries to outside hostnames like google.com will resolve.

Here's what I've tried so far:

  • I've tried deleting the coredns pod so that it would respawn, but that did not resolve the issue.
  • I've tried following all the steps in the Debugging DNS Resolution article of the Kubernetes docs, and here's what I can see:
    • nslookup kubernetes.default comes back with:
      nslookup: can't resolve '(null)': Name does not resolve
      
      
      nslookup: can't resolve 'kubernetes.default': Try again
      
    • /etc/resolv.conf looks correct:
      nameserver 10.0.0.10
      search myapp-dev.svc.cluster.local svc.cluster.local cluster.local reddog.microsoft.com
      options ndots:5
      
    • CoreDNS is running:
      $ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
      NAME                       READY   STATUS    RESTARTS   AGE
      coredns-866fc6b6c8-8t8md   1/1     Running   0          54m
      
    • There are two warnings in the CoreDNS log, but no errors:
      for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --namespace=kube-system $p; done
      [WARNING] No files matching import glob pattern: custom/*.override
      [WARNING] No files matching import glob pattern: custom/*.server
      .:53
      2019-11-18T05:55:44.413Z [INFO] CoreDNS-1.2.6
      2019-11-18T05:55:44.413Z [INFO] linux/amd64, go1.11.2, 756749c
      CoreDNS-1.2.6
      linux/amd64, go1.11.2, 756749c
       [INFO] plugin/reload: Running configuration MD5 = d8c69602fc5a3428908dc8f34f9aae58
      
    • The CoreDNS service is up:
      $ kubectl get svc --namespace=kube-system
      NAME                                        TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)                      AGE
      kube-dns                                    ClusterIP      10.0.0.10      <none>         53/UDP,53/TCP                293d
      
    • DNS endpoints are exposed:
      $ kubectl get ep kube-dns --namespace=kube-system
      NAME       ENDPOINTS                     AGE
      kube-dns   10.1.1.125:53,10.1.1.125:53   293d
      

Because of the outage, the AKS dashboard for this cluster is also down.

What you expected to happen:

  • DNS requests for kubernetes.default should resolve inside pods.
  • Cluster-local DNS requests (e.g. redis.myapp-dev) should resolve inside pods.
  • Services hosted on Azure outside the cluster (e.g. myapp.mysql.database.azure.com) should resolve inside pods.

How to reproduce it (as minimally and precisely as possible):

  • kubectl exec -it name-of-a-pod-in-cluster
  • nslookup kubernetes.default

Anything else we need to know?:

  • All the nodes in this cluster were upgraded to Kubernetes 1.13.12 several days ago.
  • The cluster is running Kured to ensure that nodes restart periodically for security updates.
  • All Ubuntu updates were installed on the nodes in the cluster around 5 PM ET today:
    Unpacking bsdutils (1:2.27.1-6ubuntu3.9) over (1:2.27.1-6ubuntu3.8) ...
    Unpacking util-linux (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
    Unpacking mount (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
    Unpacking uuid-runtime (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
    Unpacking grub-pc (2.02~beta2-36ubuntu3.23) over (2.02~beta2-36ubuntu3.22) ...
    Unpacking grub-pc-bin (2.02~beta2-36ubuntu3.23) over (2.02~beta2-36ubuntu3.22) ...
    Unpacking grub2-common (2.02~beta2-36ubuntu3.23) over (2.02~beta2-36ubuntu3.22) ...
    Unpacking grub-common (2.02~beta2-36ubuntu3.23) over (2.02~beta2-36ubuntu3.22) ...
    Unpacking libuuid1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
    Unpacking libblkid1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
    Unpacking libfdisk1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
    Unpacking libmount1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
    Unpacking libsmartcols1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
    Unpacking initramfs-tools (0.122ubuntu8.16) over (0.122ubuntu8.15) ...
    Unpacking initramfs-tools-core (0.122ubuntu8.16) over (0.122ubuntu8.15) ...
    Unpacking initramfs-tools-bin (0.122ubuntu8.16) over (0.122ubuntu8.15) ...
    Unpacking moby-cli (3.0.8) over (3.0.7) ...
    Unpacking moby-engine (3.0.8) over (3.0.7) ...
    Unpacking unattended-upgrades (1.1ubuntu1.18.04.7~16.04.4) over (1.1ubuntu1.18.04.7~16.04.3) ...  
    
  • The nodes were cordoned and rebooted one-by-one after updates. Everything on the cluster was healthy afterwards.
  • We have deployed the same application about 15-20 times today to the cluster, and all deployments were without issue except the last one that failed to come up because the application could not resolve DNS for its database. That was the first unhealthy application we noticed on the cluster, after which the remaining applications on the cluster seemed to encounter the same issue and become unhealthy. The application we deployed is a PHP web application that exists in its own namespace on the cluster, and should not affect DNS at all.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.7", GitCommit:"8fca2ec50a6133511b771a11559e2419
1b1aa2b4", GitTreeState:"clean", BuildDate:"2019-09-18T14:47:22Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"win
dows/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc
77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:04:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"l
inux/amd64"}
  • Size of cluster (how many worker nodes are in the cluster?): 3
  • General description of workloads in the cluster: Perl, PHP, and Go web applications and micro-services
@GuyPaddock
Copy link
Author

GuyPaddock commented Nov 18, 2019

Azure support had me take the following actions:

Action 1:
Please scale number of coredns pods to at least two and delete all existing coredns pods so that k8s can deploy new one for you.

kubectl scale --replicas 2 deployments/coredns

Action 2:
If coredns-autoscaler is not disabled knowingly then kindly enable it using below command:

kubectl scale deployment --replicas=1 coredns-autoscaler --namespace=kube-system

Not sure how this would restore DNS to a functional state, but I tried it. It did not have the desired effect. DNS inside the pods still wasn't working.

As a last resort, I drained, rebooted, and un-cordoned each node in our three-node cluster one-by-one. After the last node came up and was un-cordoned, DNS seems to have come back up with it!

Although I am happy that a reboot of the nodes have mitigated this issue, I am a bit uneasy about the situation because I don't understand what caused it. Hence, I don't know if/when it will happen again...

@GuyPaddock
Copy link
Author

I did come across a post of another user on another provider having similar issues:
https://www.digitalocean.com/community/questions/kubernetes-coredns-not-working-for-external-address

They, too, fixed the issue by rebooting nodes but for them it was only a temporary fix. I am hoping that the issue does not become recurring.

@AlenversFr
Copy link

AlenversFr commented Nov 19, 2019

Hi,
We've experienced exactly the same DNS problem last october 2019 on a production cluster with 5 nodes.
Kubernetes version was 1.11.x and still supported at the time.

Our reboot strategy was not perfect because the 5 nodes had to reboot the same night.
In the morning, the client call us indicating "my pods cannot connect to the database".
{Edited : remove part with some recent nodes in cluster still ok, it was not the case}

We've opened a case with Microsoft support. After 7 hours of investigations with exactly the same kind of tests, we choose to perform a rolling update on the nodes in order to get 5 brand new.

Problem solved after regenerating the nodes.
All actions on resolv.conf nor kube-dns were meaningless.

It's a very tricky situation because all the pods in kube-system seems to be fully operationnal with no ERROR. We're still working on an alarming rule in order to check this.

At this point what we think is the system upgrade after rebooting a node can cause low level failure and incompatibility with kubelet or whatever cloud template is used.
The support always indicates to keep AKS up to date meaning systematic rolling update with minor version. Check this #1303, it's kind of a key feature to achieve this

@jnoller
Copy link
Contributor

jnoller commented Nov 19, 2019

Please see this comment thread: #1326 (comment) - I suspect given the managed pods are fully functional, the latency spikes are due to IOPS quota throttling on the OS disk. I am working on full guidance for this.

@mleneveut
Copy link

We just experienced the same kind of issue.

Our Datadog monitoring was not working any more, because the metrics-server was not able to find the API server.

metrics-server (AKS native pod) : 
Error: Get https://xxx.hcp.northeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp: lookup xxx.hcp.northeurope.azmk8s.io on 100.66.204.10:53: read udp 100.66.200.117:50378->100.66.204.10:53: read: connection refused

panic: Get https://xxx.hcp.northeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp: lookup xxx.hcp.northeurope.azmk8s.io on 100.66.204.10:53: read udp 100.66.200.117:50378->100.66.204.10:53: read: connection refused

goroutine 1 [running]:
main.main()
        /go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b
datadog-kube-state-metrics :
F1119 08:24:04.538515       1 main.go:148] Failed to create client: error while trying to communicate with apiserver: Get https://xxx.hcp.northeurope.azmk8s.io:443/version?timeout=32s: dial tcp: lookup xxx.hcp.northeurope.azmk8s.io on 100.66.204.10:53: read udp 100.66.200.151:54701->100.66.204.10:53: read: connection refused

We upgraded to 1.15.5 (North Europe) on 8th November, and it was working until recently (16th november looking at the lack of data in Datadog). We started to see the problem on Monday 18th November. I think there was Azure issues on Monday and Tuesday, not sure if it is linked.

What we noticed is that all the kube-system services had no endpoints :

kubectl get ep -n kube-system
NAME                      ENDPOINTS   AGE
kube-controller-manager   <none>      189d
kube-dns                              189d
kube-scheduler            <none>      189d
kubernetes-dashboard                  189d
metrics-server                        189d
tiller-deploy                         189d

The coredns pods were Running and with no error logs :

kubectl get pods -n kube-system --selector k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-596b664bb4-bsg7d   1/1     Running   0          11d
coredns-596b664bb4-w2dcl   1/1     Running   0          11d

But looking at the deployments there was a problem :

kubectl get deploy -n kube-system
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
coredns                0/2     2            0           77d
coredns-autoscaler     0/1     1            0           77d
kubernetes-dashboard   0/1     1            0           189d
metrics-server         0/1     1            0           189d
tiller-deploy          0/1     1            0           189d
tunnelfront            0/1     1            0           189d

After deleting all the kube-system pods, they came back alive and the problem was solved (took us 2 days to resolve this...)

We were not in IOPS throtteling :
IOPS

@jnoller
Copy link
Contributor

jnoller commented Nov 20, 2019

@mleneveut please see this comment from this morning: #1326 (comment)

You need io queue depth on the vm worker level

@mleneveut
Copy link

@jnoller Our max disk queue depth on this cluster is 1.65 for the last 7 days.

image

@AlenversFr
Copy link

hello,
I cannot confirm it 100% but I don't think we had IOPS throttling with our case.

@guitmz
Copy link

guitmz commented Dec 4, 2019

is there a way we can run coredns in AKS as daemonset to mitigate this (at least try)?

@param3sh
Copy link

Hi I also facing same issue with Version "1.14.8" and with default version "1.13.12". Is there any solution for this.
Can some one tell me how to reboot nodes? do we need to ssh to them and reboot if so how to ssh to nodes?

Thanks in Advance.

@bhicks329
Copy link

I'm seeing this as well at the moment. Random problems with DNS suddenly not resolving inside and outside the cluster.

@guitmz
Copy link

guitmz commented Dec 12, 2019

btw, this issue seems to only be present if you are using azure-cni, I've recreated my cluster without it (using only kubenet) and its working well now (got this hint from here #667)

@param3sh
Copy link

Hi @guitmz, i have been using "kubenet" only. But i still have issue.
image

@TomsonOne
Copy link

Same problem here... and also using Kubenet...

nslookup hostnames returns errror with Kubernetes 1.14.8

@AlenversFr
Copy link

Same here we use Kubenet.

@guitmz
Copy link

guitmz commented Dec 13, 2019

Interesting. No idea why its working for me, will take a closer look

@brudnyhenry
Copy link

brudnyhenry commented Dec 16, 2019

Hi,
In my case situation is now that DNS resolution works only in pods scheduled on the same node where CoreDNS pods are running (I have two nodes in AKS cluster).
As you can see two coredns pods are running on the same node

kgp -n kube-system  -o wide                  
NAME                                    READY   STATUS    RESTARTS   AGE     IP             NODE                         NOMINATED NODE   READINESS GATES
coredns-7fc597cc45-8gdnc                1/1     Running   0          106s    10.244.2.28    aks-computingpl-37691802-2   <none>           <none>
coredns-7fc597cc45-zw4cd                1/1     Running   0          104s    10.244.2.29    aks-computingpl-37691802-2   <none>           <none>

Shouldn't CoreDNS deployment be distributed across all nodes ? t

@guitmz
Copy link

guitmz commented Dec 16, 2019

@brudnyhenry That would be a better implementation IMO but no way to do it from our side.

Since I've change my cluster to kubenet, I was not seeing this issue anymore but since 1h ago it started happening again suddenly.

Funny thing is that even coredns-autoscaler pod is having DNS issues itself so I'm not even sure autoscaling of coreDNS is working at all:

I1206 13:19:55.864375 1 autoscaler_server.go:133] ConfigMap not found: Get https://my-cluster.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/coredns-autoscaler: read tcp 10.244.0.5:58664->52.136.224.228:443: read: connection timed out, will create one with default params

AKS does not seem to be production ready by any means.

@jnoller
Copy link
Contributor

jnoller commented Dec 16, 2019

The issue is still Azure Disk throttling of your OS disks. AKS defaults to 100gb OS disks for all clusters, you can increase this via ARM however the maximum is 2tb. This means that docker IO, logging, metrics, monitoring, the kernel, etc all share the single OS disk IO path.

As Azure uses network attached storage for OS disks these have both a hard quota on bandwidth and file operations / sec (IOPS). Small file write patterns such as docker will exhaust the quota of the OS disk at which point the storage system throttles at the IO and cache level leading to high VM latency, networking failures, DNS failures, etc.

This is common IaaS sizing/mismatch issue and we are working as we speak on fleet-wide mitigation and analysis of this issue to provide full guidance.

Until then, you can test offloading the Docker/container IO from your OS disk using this utility: https://github.com/juan-lee/knode

# Install knode with defaults
curl -L https://github.com/juan-lee/knode/releases/download/v0.1.2/knode-default.yaml | kubectl apply -f -
kubectl rollout status daemonset -n knode-system knode-daemon

# Update knode to move /var/lib/docker to /mnt/docker
curl -L https://github.com/juan-lee/knode/releases/download/v0.1.2/knode-tmpdir.yaml | kubectl apply -f -
kubectl rollout status daemonset -n knode-system knode-daemon

This moves the docker IO to the VMs temp disk - this means that the docker data dir becomes ephemeral and the size changes (Seem temp disk GiB here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-storage). Deployment and rebooting of the nodes can take some time.

You will need to fix any tools like splunk, etc to point to the new path. This is not 100% - high logging levels, security tools, etc can also trip the IOPs throttle. However moving docker IO has consistently shown better runtime stability in systems I'm working with.

Additionally, we will be exposing the webhook auth flag in January. Until then, you can run the following on your worker nodes (will not persist scale/upgrade) to enable the webhook:

sed -i 's/--authorization-mode=Webhook/--authorization-mode=Webhook --authentication-token-webhook=true/g' /etc/default/kubelet

This will allow you to install the prometheus operator: https://github.com/helm/charts/tree/master/stable/prometheus-operator

The operator includes rich grafana dashboards that will help you see the issue clearly, you need to look at the coreDNS report, USE (cluster and Node) metrics, and node-host-level metrics for IO saturation on Linux device SDA/etc

We are working with the Azure monitoring and storage teams to expose these limits/events on AKS worker nodes and other AKS changes to make this clear and easily manageable by customers.

Every disk IO saturation event directly correlates to latency across all customer worker nodes:

API Server
Kubelet
USE node
USENode

@brudnyhenry
Copy link

Hi,
Small result of the testing. When pod is running on the same node as CoreDNS pods:

kubectl run -it --rm aks-ssh --image=busybox:1.28 --overrides='{"apiVersion":"apps/v1","spec":{"template":{"spec":{"nodeSelector":{"beta.kubernetes.io/os":"linux"}}}}}'
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
/ # cat /etc/resolv.conf 
nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local slrbnnk0hehedap5f4hd422u0b.ax.internal.cloudapp.net
options ndots:5
/ # nslookup kubernetes.default
Server:    10.0.0.10
Address 1: 10.0.0.10
nslookup: can't resolve 'kubernetes.default'


kgp --all-namespaces -o wide | egrep 'ssh|dns'

default         aks-ssh-7d8b6fd4c5-kcb75                                          1/1     Running     0          7m13s   10.244.1.25    aks-computingpl-37691802-2   <none>           <none>
kube-system     coredns-7fc597cc45-d7nmb                                          1/1     Running     2          37h     10.244.0.13    aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-7fc597cc45-wqckk                                          1/1     Running     2          37h     10.244.0.9     aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-autoscaler-7ccc76bfbd-wjrzz                               1/1     Running     2          3d20h   10.244.0.213   aks-computingpl-37691802-1   <none>           <none>

But when pod is running on the same node as CoreDNS then everything works fine

cat /etc/resolv.conf 
nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local slrbnnk0hehedap5f4hd422u0b.ax.internal.cloudapp.net
options ndots:5
/ # nslookup kubernetes.default
Server:    10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.0.0.1 kubernetes.default.svc.cluster.local

kgp --all-namespaces -o wide | egrep 'ssh|dns'

default         aks-ssh-1-78885744d9-8ks6f                                        1/1     Running     0          16s     10.244.0.25    aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-7fc597cc45-d7nmb                                          1/1     Running     2          37h     10.244.0.13    aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-7fc597cc45-wqckk                                          1/1     Running     2          37h     10.244.0.9     aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-autoscaler-7ccc76bfbd-wjrzz                               1/1     Running     2          3d20h   10.244.0.213   aks-computingpl-37691802-1   <none>           <none>

We are using basic Networking and we don't use VMSS.

@mihsoft
Copy link

mihsoft commented Jan 8, 2020

We are experiencing this same issue when using the Kured daemon. When the node scheduled for reboot is cordoned and drained, the CoreDNS container fails due to the PodDisruptionBudget, this then causes multiple CoreDNS pods to be scheduled on the same node and hitting the replica count, thus when the other node reboots the coreDNS does not get scheduled on the node, the SchedulingDisabled flag is not removed.

Manually deleting the second coreDNS pod and rebooting the affected node resolves the issue.

Obviously this is a significant issue as it essentially rules out Kured as a viable option on AKS, for the moment at least.

@jnoller is it worth updating the documentation for AKS regarding the issue (apologies if it has already been updated) as this can be reproduced reliably for new and existing clusters alike even when the clusters themselves are all but empty?

@jnoller
Copy link
Contributor

jnoller commented Jan 8, 2020

The documentation will not be changed, we will work to fix the scheduling issue

@jnoller
Copy link
Contributor

jnoller commented Jan 8, 2020

Please also see this issue for intermittent nodenotready, DNS latency and other crashed related to system load: #1373

@mihsoft
Copy link

mihsoft commented Jan 9, 2020

Thanks @jnoller - Took a look through the guidance, our disk queue length peaked at 10 (during creation of the node itself) but stayed <1 until now. Reads/Writes are minimal. Wasn't sure if you wanted me to log this feedback in #1373

In our scenario we created a new cluster as follows:
Location: EastUS
Scale: 2 Nodes - D2s_v3
Scale sets: Off
Virtual nodes: Off
Advanced networking: linked to existing VNET, using kubelet for the network plugin
Kubernetes - 1.14.8

We then installed the kured daemonset and set a period of 1m to bring about the issue quickly.

To force the scenario, we ran sudo touch /var/run/reboot-required on the node,

During the cordon and drain observed from the kured pod we get a few warnings but the reboot takes place - time="2020-01-09T13:36:44Z" level=warning msg="WARNING: Deleting pods with local storage: coredns- Ignoring DaemonSet-managed pods: kube-proxy-466sg, kured-7qj54" cmd=/usr/bin/kubectl std=err

Upon reboot - both coredns pods were running on the same node
image

Because the node that was rebooted has not been uncordoned, kured now fails and is unable to resolve DNS, coreDNS is now running both pods on a single node.

Here's the queue length
image

Manually running the uncordon on the node,

Pods on the rebooted node fail to resolve DNS.

To resolve we did the following: (As mentioned previously by others in this issue)

  1. Manually run uncordon on the node
  2. Delete the coredns pod
  3. Wait for the coredns pod to come online
  4. Wait for the kured pod to resume

Update 10/01/2020:

At present this issue is now affecting all 3 of our clusters, all on 1.14.8 - in order to keep them up to date we have to manually update them, drain them, uncordon them etc which is proving quite time consuming. All 3 work normally if we manually kill off the coredns pod and ensure that 1 instance runs on each node.

We have now created several other clusters in various configurations to hopefully provide further diagnostic information to Microsoft.

Update 10/01/2020: 16:58pm

Created a new AKS cluster, with VMSS, 1 nodepool, 2 nodes, DS2_v2, East US, advanced networking - kubenet - existing subnet in VNET
Deployed the kured daemonset - immediately received DNS issues in line with the previous failures - time="2020-01-10T16:57:58Z" level=fatal msg="Error testing lock: Get https://sacha-eus--sacha-purefitnes-31fa67-95f3ca9e.hcp.eastus.azmk8s.io:443/apis/extensions/v1beta1/namespaces/kube-system/daemonsets/kured: dial tcp: i/o timeout"

image

Here's the az cli used (some data removed for security)

az aks create `
    --resource-group *** `
    --name *** `
    --node-count 2 `
    --network-plugin kubenet `
    --service-cidr 10.3.0.0/16 `
    --dns-service-ip 10.3.0.10 `
    --pod-cidr 10.52.0.0/16 `
    --docker-bridge-address 172.17.0.1/16 `
    --vnet-subnet-id ***
    --service-principal ***
    --client-secret ***
    --ssh-key-value ***

Update 10/01/2020: 17:19pm

Created a new AKS cluster, with VMSS, 1 nodepool, 2 nodes, DS2_v2, East US, advanced networking - kubenet

Ran the kubed daemon script, both nodes drained and rebooted successfully. No DNS issues.

So, at present, i've not noticed the issue on clusters with Azure CNI, or those using kubenet but not integrated into an existing VNET Subnet

Update 10/01/2020: 17:30pm

Been trying to see if any of the existing clusters can be repaired, tried the following:

  1. Power off all nodes
  2. Attempt reboot

At present, once the issue presents itself, draining any node within the cluster and rebooting it causes the issue to reoccur.

Errors from kubenet-proxy, trouble saving endpoints for kube-DNS

I0110 17:25:30.315041       1 endpoints.go:277] Setting endpoints for "kube-system/kube-dns:dns" to [10.129.2.16:53 10.129.3.19:53]
I0110 17:25:30.315065       1 endpoints.go:277] Setting endpoints for "kube-system/kube-dns:dns-tcp" to [10.129.2.16:53 10.129.3.19:53]
I0110 17:25:30.315086       1 endpoints.go:277] Setting endpoints for "kube-system/kube-dns:dns" to [10.129.2.16:53 10.129.3.19:53]
I0110 17:25:30.315099       1 endpoints.go:277] Setting endpoints for "kube-system/kube-dns:dns-tcp" to [10.129.2.16:53 10.129.3.19:53]

I0110 17:37:05.225987       1 proxier.go:701] Syncing iptables rules
I0110 17:37:05.255950       1 healthcheck.go:235] Not saving endpoints for unknown healthcheck "kube-system/kube-dns"
I0110 17:37:05.255987       1 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s

@mihsoft
Copy link

mihsoft commented Jan 11, 2020

Update: 11/01/2020
Looks like same issue as #592 - unfortunately that issue never got resolved.

Checking the previously created clusters to see if they too are now failing after overnight updates and reboots.

@erewok
Copy link

erewok commented Apr 21, 2020

We are seeing this issue as well and somewhat disappointed there isn't any obvious fix for it. I believe we are probably suffering from the IOPS throttling/performance problem that @jnoller describes here and also in #1373, but I don't see any clear recommendations on how to fix them. That's unfortunate for such a serious problem that has been known for such a lengthy period of time.

@guitmz
Copy link

guitmz commented May 12, 2020

@jnoller is azure doing anything to enable us to mitigate this issue? k8s released this https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/ and other people are working on similar tools (with coredns support https://github.com/contentful-labs/coredns-nodecache) but we cannot make use of these tools because AKS will remove them from the cluster (like stated in #1435).

Right now we are out of options since we tested all workarounds already and discarded the disk performance as being the cause of the issue. IPV6 is disabled in our cluster. The only thing that helps (but don't solve 100%) is single-request-reopen but we cannot use this because we rely on some Alpine images.

EDIT: Azure support mentioned they have a workaround to achieve similar results (dns cache in the nodes) in AKS, we will test it and report back.

@jnoller
Copy link
Contributor

jnoller commented May 12, 2020

@guitmz I would re visit the IO issue leading. To the dns outages, as I mentioned I can recreate these failures regardless of coredns and kernel versions. While racy conntrack, SNAT etc all appear in this failure stack, it’s usually caused by terminal OS disk IO latency.

@andyzhangx and @juan-lee can comment on engineering mitigations

@guitmz
Copy link

guitmz commented May 12, 2020

@jnoller interesting. Can you share more details about replicating this issue? We are having a hard time with Azure support as they don't understand the IO issue and kernel bug are separated things and this could help moving the situation forward. Thanks

@erewok
Copy link

erewok commented May 12, 2020

It turned out that in our case, we had a failure due to a misconfiguration in the static IP used for egress in an ARM template. AKS kept automatically creating an unwanted static IP to use for egress and it turns out that we were manually resolving it by removing it and replacing it with our preferred static IP.

Everything worked fine, but when we later scaled our clusters with this configuration, DNS broke (due to something with the expected egress static IP that had been setup when provisioning our cluster vs the one we had supplanted it with). Disk saturation and DNS failures were both symptomatic of our issue but we think that's because there was lots of thrashing as various kube-system pods started failing. We isolated our own problem, we were able to replicate it, and we have fixed it. I believe that our situation is probably uncommon (but I don't really know 🤷). We will be keeping an eye on disk saturation in the future, though, now that we have all the USE metrics in our prometheus dashboards.

@srihas619
Copy link

@jnoller can you help in recreating the failures? Is there any gist that helps in reproducing this reliably? Thank you.

@jnoller
Copy link
Contributor

jnoller commented May 13, 2020

@srihas619 i mention this in issue #1373 but any sufficiently complex helm chart - such as istio, the Prometheus operator, telegraf etc all triggers this failure, sometimes it’s at a lower or intermittent level and does not cross into a clear failure. You will need to confirm all metrics with the physical logs on the worker VMs

@srihas619
Copy link

@jnoller Thank you. I have watched your video explaining about this scenario in case of istio implementation. However I am unable to replicate it reliably to test on fresh clusters.

@jnoller
Copy link
Contributor

jnoller commented May 13, 2020

@srihas619 The video lacks more of the details: https://github.com/jnoller/kubernaughty/blob/master/docs/part-4-how-you-kill-a-container-runtime.md - You will not see 100% reproduction due to the nature of scale / load exhaustion. Additionally, Kubernetes will attempt to re-start and heal containers orphaned and lost due to this.

This means that under high IO latency the cgroup and container itself is lost at the VM level, but not always (because of system load). When it is lost, usually the write cache that is enabled on all of the hosts disks are also lost/flushed losing all in-flight writes.

I would SSH into the VMs and start watching the disk IO latency using bpf / bcc tools and the docker / kubelet logs. You may be 'seeing it' - but unless the IO load lands on the nodes running the coreDNS pods, you won't see the connection drops (unless CNI is trapped in IOWait)

@almegren
Copy link

We also have this issue. I can recreate it in a brand new AKS cluster with version 1.16.7. I slowly start deploying services and when a second node is needed dns lookup fails on that new node. It still works on pods running on the first node. Restarting coredns with
kubectl rollout restart deployment/coredns -n kube-system can result both in fixing the issue temporarily as well as breaking dns functionality for pods on both nodes.

@srihas619
Copy link

@jnoller Thanks for the kubernaughty, it helped to understand the plot linearly.
We have a relatively big pod (1024m CPU and 2048 MiB mem) in our AKS cluster (with Standard_D4s_v3 nodes), which when run with 10 parallel threads, to test our services, is experiencing timeouts. However, when I run it with 1 or upto 4 threads, there are no timeouts so far. IPV6 is disabled in our cluster and this particular pod has single-request-reopen dnsConfig applied. Currently we have node local dns cache running as DaemonSet (as per MS support's suggestion).
We tried to offload /var/lib/docker to /mnt/docker by using knode. Unfortunately it didn't help us, so we had to revert it.
This makes me think that we can increase the test threads in our pod, to reproduce the timeouts, and trace things with BPF. I am waiting for your part 5 to dive deep.

@juan-lee
Copy link
Contributor

juan-lee commented May 14, 2020

Can you check to see if your node-local-dns configuration include the log plugin? Also, are there limits set on your node-local-dns daemonset? If so, you should remove them. They were inadvertently included and can cause excessive logging and premature OOMKills of the node-local-dns pods.

@jeffwilcox
Copy link

Our team hit all the fun here... end result: our service principal had expired. JFYI in case anyone else hits this. As soon as we updated the SP, everything was happy again. I'm so excited to move to MSI at some point...

@srihas619
Copy link

@juan-lee I have checked them, we don't have log plugin included and there are no resource limits for the DaemonSet.

@jeffwilcox may you give some more info on the service pricipal in context? It will be helpful. Wondering how this will impact the DNS resolution though.

@juan-lee
Copy link
Contributor

juan-lee commented Jun 3, 2020

Indeed @jeffwilcox an expired service principal will cause dns to fail. I don't understand where there's a dependency between coredns and the service principal. I will report back when I figure it out.

@GuyPaddock
Copy link
Author

GuyPaddock commented Jun 11, 2020

Is it possible that the expired principal causes log spamming, which causes excessive IO, which causes IOPS throttling, which brings down the node, etc? (and then the the cat, that killed the rat, that ate the malt, that lay in the house that Jack built, etc.)

@fabiodoaraujo
Copy link

I got the same issues. Pods stopped resolving Kubernetes service names.
During troubleshooting, the name resolution started working again and we got these events on kube-system namespaces:

$ k -n kube-system get events
LAST SEEN   TYPE      REASON                 OBJECT                                    MESSAGE
17m         Normal    TaintManagerEviction   pod/coredns-698c77c5d7-5rn2v              Cancelling deletion of Pod kube-system/coredns-698c77c5d7-5rn2v
7m52s       Normal    TaintManagerEviction   pod/coredns-698c77c5d7-5rn2v              Cancelling deletion of Pod kube-system/coredns-698c77c5d7-5rn2v
16m         Warning   Unhealthy              pod/coredns-autoscaler-5468748cfb-htlqm   Liveness probe failed: HTTP probe failed with statuscode: 500
16m         Normal    LeaderElection         endpoints/kube-controller-manager         kube-controller-manager-649dd7dc8d-57xvd_b9b0ce5c-e19f-4fed-af91-1f50305f1535 became leader
16m         Normal    LeaderElection         endpoints/kube-scheduler                  kube-scheduler-9c78b49df-cbr4z_5467a7ff-2f6e-4923-b639-fdf30a6633c0 became leader

@ghost ghost added the action-required label Jul 22, 2020
@ghost
Copy link

ghost commented Jul 27, 2020

Action required from @Azure/aks-pm

@palma21
Copy link
Member

palma21 commented Aug 6, 2020

This issue is no longer accurate since it touched

  1. Manual tweaks to the RT, which is not supported. Today AKS automatically uses existing RT for kubenet. https://docs.microsoft.com/en-us/azure/aks/configure-kubenet#bring-your-own-subnet-and-route-table-with-kubenet
  2. It also touched IO contention which is explained on a separate issue.
  3. Additional misc problems that are unrelated or unique to one cluster.

As such, I'm closing this and I'll ask that if you're still having issues that you create a specific issue and describe your problem/symptom. Also feel free to paste any support ticket numbers if you already have opened them so we may sync with the internal support teams.

@palma21 palma21 closed this as completed Aug 6, 2020
@palma21 palma21 removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Aug 6, 2020
@c4m4
Copy link

c4m4 commented Aug 18, 2020

I am having this issue using the default OS disk size, I would like to know if is there any fix other create the cluster with a bigger disk?

@ghost ghost locked as resolved and limited conversation to collaborators Sep 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests