Existing AKS Cluster Suddenly not Resolving DNS #1320

GuyPaddock · 2019-11-18T07:03:50Z

What happened:
We've had an AKS cluster deployed since February 2019 that's been stable until tonight. As of 11:30 PM ET on 2019-11-17, it seems as though all DNS requests -- both for hosts inside the cluster (e.g. redis.myapp-dev) as well as hosts outside the cluster (e.g. myapp.mysql.database.azure.com) -- have stopped being resolved.

If I SSH into a node in the cluster, DNS queries to outside hostnames like google.com will resolve.

Here's what I've tried so far:

I've tried deleting the coredns pod so that it would respawn, but that did not resolve the issue.

I've tried following all the steps in the Debugging DNS Resolution article of the Kubernetes docs, and here's what I can see:

nslookup kubernetes.default comes back with:

nslookup: can't resolve '(null)': Name does not resolve


nslookup: can't resolve 'kubernetes.default': Try again

/etc/resolv.conf looks correct:

nameserver 10.0.0.10
search myapp-dev.svc.cluster.local svc.cluster.local cluster.local reddog.microsoft.com
options ndots:5

CoreDNS is running:

$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-866fc6b6c8-8t8md   1/1     Running   0          54m

There are two warnings in the CoreDNS log, but no errors:

for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --namespace=kube-system $p; done
[WARNING] No files matching import glob pattern: custom/*.override
[WARNING] No files matching import glob pattern: custom/*.server
.:53
2019-11-18T05:55:44.413Z [INFO] CoreDNS-1.2.6
2019-11-18T05:55:44.413Z [INFO] linux/amd64, go1.11.2, 756749c
CoreDNS-1.2.6
linux/amd64, go1.11.2, 756749c
 [INFO] plugin/reload: Running configuration MD5 = d8c69602fc5a3428908dc8f34f9aae58

The CoreDNS service is up:

$ kubectl get svc --namespace=kube-system
NAME                                        TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)                      AGE
kube-dns                                    ClusterIP      10.0.0.10      <none>         53/UDP,53/TCP                293d

DNS endpoints are exposed:

$ kubectl get ep kube-dns --namespace=kube-system
NAME       ENDPOINTS                     AGE
kube-dns   10.1.1.125:53,10.1.1.125:53   293d

Because of the outage, the AKS dashboard for this cluster is also down.

What you expected to happen:

DNS requests for kubernetes.default should resolve inside pods.
Cluster-local DNS requests (e.g. redis.myapp-dev) should resolve inside pods.
Services hosted on Azure outside the cluster (e.g. myapp.mysql.database.azure.com) should resolve inside pods.

How to reproduce it (as minimally and precisely as possible):

kubectl exec -it name-of-a-pod-in-cluster
nslookup kubernetes.default

Anything else we need to know?:

All the nodes in this cluster were upgraded to Kubernetes 1.13.12 several days ago.
The cluster is running Kured to ensure that nodes restart periodically for security updates.

All Ubuntu updates were installed on the nodes in the cluster around 5 PM ET today:

Unpacking bsdutils (1:2.27.1-6ubuntu3.9) over (1:2.27.1-6ubuntu3.8) ...
Unpacking util-linux (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
Unpacking mount (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
Unpacking uuid-runtime (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
Unpacking grub-pc (2.02~beta2-36ubuntu3.23) over (2.02~beta2-36ubuntu3.22) ...
Unpacking grub-pc-bin (2.02~beta2-36ubuntu3.23) over (2.02~beta2-36ubuntu3.22) ...
Unpacking grub2-common (2.02~beta2-36ubuntu3.23) over (2.02~beta2-36ubuntu3.22) ...
Unpacking grub-common (2.02~beta2-36ubuntu3.23) over (2.02~beta2-36ubuntu3.22) ...
Unpacking libuuid1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
Unpacking libblkid1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
Unpacking libfdisk1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
Unpacking libmount1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
Unpacking libsmartcols1:amd64 (2.27.1-6ubuntu3.9) over (2.27.1-6ubuntu3.8) ...
Unpacking initramfs-tools (0.122ubuntu8.16) over (0.122ubuntu8.15) ...
Unpacking initramfs-tools-core (0.122ubuntu8.16) over (0.122ubuntu8.15) ...
Unpacking initramfs-tools-bin (0.122ubuntu8.16) over (0.122ubuntu8.15) ...
Unpacking moby-cli (3.0.8) over (3.0.7) ...
Unpacking moby-engine (3.0.8) over (3.0.7) ...
Unpacking unattended-upgrades (1.1ubuntu1.18.04.7~16.04.4) over (1.1ubuntu1.18.04.7~16.04.3) ...

The nodes were cordoned and rebooted one-by-one after updates. Everything on the cluster was healthy afterwards.
We have deployed the same application about 15-20 times today to the cluster, and all deployments were without issue except the last one that failed to come up because the application could not resolve DNS for its database. That was the first unhealthy application we noticed on the cluster, after which the remaining applications on the cluster seemed to encounter the same issue and become unhealthy. The application we deployed is a PHP web application that exists in its own namespace on the cluster, and should not affect DNS at all.

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.7", GitCommit:"8fca2ec50a6133511b771a11559e2419
1b1aa2b4", GitTreeState:"clean", BuildDate:"2019-09-18T14:47:22Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"win
dows/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc
77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:04:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"l
inux/amd64"}

Size of cluster (how many worker nodes are in the cluster?): 3
General description of workloads in the cluster: Perl, PHP, and Go web applications and micro-services

The text was updated successfully, but these errors were encountered:

GuyPaddock · 2019-11-18T14:45:08Z

Azure support had me take the following actions:

Action 1:
Please scale number of coredns pods to at least two and delete all existing coredns pods so that k8s can deploy new one for you.
kubectl scale --replicas 2 deployments/coredns
Action 2:
If coredns-autoscaler is not disabled knowingly then kindly enable it using below command:
kubectl scale deployment --replicas=1 coredns-autoscaler --namespace=kube-system

Not sure how this would restore DNS to a functional state, but I tried it. It did not have the desired effect. DNS inside the pods still wasn't working.

As a last resort, I drained, rebooted, and un-cordoned each node in our three-node cluster one-by-one. After the last node came up and was un-cordoned, DNS seems to have come back up with it!

Although I am happy that a reboot of the nodes have mitigated this issue, I am a bit uneasy about the situation because I don't understand what caused it. Hence, I don't know if/when it will happen again...

GuyPaddock · 2019-11-18T16:48:44Z

I did come across a post of another user on another provider having similar issues:
https://www.digitalocean.com/community/questions/kubernetes-coredns-not-working-for-external-address

They, too, fixed the issue by rebooting nodes but for them it was only a temporary fix. I am hoping that the issue does not become recurring.

AlenversFr · 2019-11-19T21:25:30Z

Hi,
We've experienced exactly the same DNS problem last october 2019 on a production cluster with 5 nodes.
Kubernetes version was 1.11.x and still supported at the time.

Our reboot strategy was not perfect because the 5 nodes had to reboot the same night.
In the morning, the client call us indicating "my pods cannot connect to the database".
{Edited : remove part with some recent nodes in cluster still ok, it was not the case}

We've opened a case with Microsoft support. After 7 hours of investigations with exactly the same kind of tests, we choose to perform a rolling update on the nodes in order to get 5 brand new.

Problem solved after regenerating the nodes.
All actions on resolv.conf nor kube-dns were meaningless.

It's a very tricky situation because all the pods in kube-system seems to be fully operationnal with no ERROR. We're still working on an alarming rule in order to check this.

At this point what we think is the system upgrade after rebooting a node can cause low level failure and incompatibility with kubelet or whatever cloud template is used.
The support always indicates to keep AKS up to date meaning systematic rolling update with minor version. Check this #1303, it's kind of a key feature to achieve this

jnoller · 2019-11-19T21:36:02Z

Please see this comment thread: #1326 (comment) - I suspect given the managed pods are fully functional, the latency spikes are due to IOPS quota throttling on the OS disk. I am working on full guidance for this.

mleneveut · 2019-11-20T08:33:43Z

We just experienced the same kind of issue.

Our Datadog monitoring was not working any more, because the metrics-server was not able to find the API server.

metrics-server (AKS native pod) : 
Error: Get https://xxx.hcp.northeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp: lookup xxx.hcp.northeurope.azmk8s.io on 100.66.204.10:53: read udp 100.66.200.117:50378->100.66.204.10:53: read: connection refused

panic: Get https://xxx.hcp.northeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp: lookup xxx.hcp.northeurope.azmk8s.io on 100.66.204.10:53: read udp 100.66.200.117:50378->100.66.204.10:53: read: connection refused

goroutine 1 [running]:
main.main()
        /go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b

datadog-kube-state-metrics :
F1119 08:24:04.538515       1 main.go:148] Failed to create client: error while trying to communicate with apiserver: Get https://xxx.hcp.northeurope.azmk8s.io:443/version?timeout=32s: dial tcp: lookup xxx.hcp.northeurope.azmk8s.io on 100.66.204.10:53: read udp 100.66.200.151:54701->100.66.204.10:53: read: connection refused

We upgraded to 1.15.5 (North Europe) on 8th November, and it was working until recently (16th november looking at the lack of data in Datadog). We started to see the problem on Monday 18th November. I think there was Azure issues on Monday and Tuesday, not sure if it is linked.

What we noticed is that all the kube-system services had no endpoints :

kubectl get ep -n kube-system
NAME                      ENDPOINTS   AGE
kube-controller-manager   <none>      189d
kube-dns                              189d
kube-scheduler            <none>      189d
kubernetes-dashboard                  189d
metrics-server                        189d
tiller-deploy                         189d

The coredns pods were Running and with no error logs :

kubectl get pods -n kube-system --selector k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-596b664bb4-bsg7d   1/1     Running   0          11d
coredns-596b664bb4-w2dcl   1/1     Running   0          11d

But looking at the deployments there was a problem :

kubectl get deploy -n kube-system
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
coredns                0/2     2            0           77d
coredns-autoscaler     0/1     1            0           77d
kubernetes-dashboard   0/1     1            0           189d
metrics-server         0/1     1            0           189d
tiller-deploy          0/1     1            0           189d
tunnelfront            0/1     1            0           189d

After deleting all the kube-system pods, they came back alive and the problem was solved (took us 2 days to resolve this...)

We were not in IOPS throtteling :

jnoller · 2019-11-20T14:38:01Z

@mleneveut please see this comment from this morning: #1326 (comment)

You need io queue depth on the vm worker level

mleneveut · 2019-11-20T16:03:26Z

@jnoller Our max disk queue depth on this cluster is 1.65 for the last 7 days.

AlenversFr · 2019-11-21T10:58:29Z

hello,
I cannot confirm it 100% but I don't think we had IOPS throttling with our case.

guitmz · 2019-12-04T16:48:15Z

is there a way we can run coredns in AKS as daemonset to mitigate this (at least try)?

param3sh · 2019-12-12T10:58:15Z

Hi I also facing same issue with Version "1.14.8" and with default version "1.13.12". Is there any solution for this.
Can some one tell me how to reboot nodes? do we need to ssh to them and reboot if so how to ssh to nodes?

Thanks in Advance.

bhicks329 · 2019-12-12T11:14:35Z

I'm seeing this as well at the moment. Random problems with DNS suddenly not resolving inside and outside the cluster.

guitmz · 2019-12-12T11:21:03Z

btw, this issue seems to only be present if you are using azure-cni, I've recreated my cluster without it (using only kubenet) and its working well now (got this hint from here #667)

param3sh · 2019-12-13T07:01:07Z

Hi @guitmz, i have been using "kubenet" only. But i still have issue.

TomsonOne · 2019-12-13T08:38:31Z

Same problem here... and also using Kubenet...

nslookup hostnames returns errror with Kubernetes 1.14.8

AlenversFr · 2019-12-13T08:47:20Z

Same here we use Kubenet.

guitmz · 2019-12-13T11:32:59Z

Interesting. No idea why its working for me, will take a closer look

brudnyhenry · 2019-12-16T15:22:46Z

Hi,
In my case situation is now that DNS resolution works only in pods scheduled on the same node where CoreDNS pods are running (I have two nodes in AKS cluster).
As you can see two coredns pods are running on the same node

kgp -n kube-system  -o wide                  
NAME                                    READY   STATUS    RESTARTS   AGE     IP             NODE                         NOMINATED NODE   READINESS GATES
coredns-7fc597cc45-8gdnc                1/1     Running   0          106s    10.244.2.28    aks-computingpl-37691802-2   <none>           <none>
coredns-7fc597cc45-zw4cd                1/1     Running   0          104s    10.244.2.29    aks-computingpl-37691802-2   <none>           <none>

Shouldn't CoreDNS deployment be distributed across all nodes ? t

guitmz · 2019-12-16T17:08:24Z

@brudnyhenry That would be a better implementation IMO but no way to do it from our side.

Since I've change my cluster to kubenet, I was not seeing this issue anymore but since 1h ago it started happening again suddenly.

Funny thing is that even coredns-autoscaler pod is having DNS issues itself so I'm not even sure autoscaling of coreDNS is working at all:

I1206 13:19:55.864375 1 autoscaler_server.go:133] ConfigMap not found: Get https://my-cluster.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/coredns-autoscaler: read tcp 10.244.0.5:58664->52.136.224.228:443: read: connection timed out, will create one with default params

AKS does not seem to be production ready by any means.

jnoller · 2019-12-16T17:44:47Z

The issue is still Azure Disk throttling of your OS disks. AKS defaults to 100gb OS disks for all clusters, you can increase this via ARM however the maximum is 2tb. This means that docker IO, logging, metrics, monitoring, the kernel, etc all share the single OS disk IO path.

As Azure uses network attached storage for OS disks these have both a hard quota on bandwidth and file operations / sec (IOPS). Small file write patterns such as docker will exhaust the quota of the OS disk at which point the storage system throttles at the IO and cache level leading to high VM latency, networking failures, DNS failures, etc.

This is common IaaS sizing/mismatch issue and we are working as we speak on fleet-wide mitigation and analysis of this issue to provide full guidance.

Until then, you can test offloading the Docker/container IO from your OS disk using this utility: https://github.com/juan-lee/knode

# Install knode with defaults
curl -L https://github.com/juan-lee/knode/releases/download/v0.1.2/knode-default.yaml | kubectl apply -f -
kubectl rollout status daemonset -n knode-system knode-daemon

# Update knode to move /var/lib/docker to /mnt/docker
curl -L https://github.com/juan-lee/knode/releases/download/v0.1.2/knode-tmpdir.yaml | kubectl apply -f -
kubectl rollout status daemonset -n knode-system knode-daemon

This moves the docker IO to the VMs temp disk - this means that the docker data dir becomes ephemeral and the size changes (Seem temp disk GiB here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-storage). Deployment and rebooting of the nodes can take some time.

You will need to fix any tools like splunk, etc to point to the new path. This is not 100% - high logging levels, security tools, etc can also trip the IOPs throttle. However moving docker IO has consistently shown better runtime stability in systems I'm working with.

Additionally, we will be exposing the webhook auth flag in January. Until then, you can run the following on your worker nodes (will not persist scale/upgrade) to enable the webhook:

sed -i 's/--authorization-mode=Webhook/--authorization-mode=Webhook --authentication-token-webhook=true/g' /etc/default/kubelet

This will allow you to install the prometheus operator: https://github.com/helm/charts/tree/master/stable/prometheus-operator

The operator includes rich grafana dashboards that will help you see the issue clearly, you need to look at the coreDNS report, USE (cluster and Node) metrics, and node-host-level metrics for IO saturation on Linux device SDA/etc

We are working with the Azure monitoring and storage teams to expose these limits/events on AKS worker nodes and other AKS changes to make this clear and easily manageable by customers.

Every disk IO saturation event directly correlates to latency across all customer worker nodes:

brudnyhenry · 2019-12-18T05:15:04Z

Hi,
Small result of the testing. When pod is running on the same node as CoreDNS pods:

kubectl run -it --rm aks-ssh --image=busybox:1.28 --overrides='{"apiVersion":"apps/v1","spec":{"template":{"spec":{"nodeSelector":{"beta.kubernetes.io/os":"linux"}}}}}'
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
/ # cat /etc/resolv.conf 
nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local slrbnnk0hehedap5f4hd422u0b.ax.internal.cloudapp.net
options ndots:5
/ # nslookup kubernetes.default
Server:    10.0.0.10
Address 1: 10.0.0.10
nslookup: can't resolve 'kubernetes.default'


kgp --all-namespaces -o wide | egrep 'ssh|dns'

default         aks-ssh-7d8b6fd4c5-kcb75                                          1/1     Running     0          7m13s   10.244.1.25    aks-computingpl-37691802-2   <none>           <none>
kube-system     coredns-7fc597cc45-d7nmb                                          1/1     Running     2          37h     10.244.0.13    aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-7fc597cc45-wqckk                                          1/1     Running     2          37h     10.244.0.9     aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-autoscaler-7ccc76bfbd-wjrzz                               1/1     Running     2          3d20h   10.244.0.213   aks-computingpl-37691802-1   <none>           <none>

But when pod is running on the same node as CoreDNS then everything works fine

cat /etc/resolv.conf 
nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local slrbnnk0hehedap5f4hd422u0b.ax.internal.cloudapp.net
options ndots:5
/ # nslookup kubernetes.default
Server:    10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.0.0.1 kubernetes.default.svc.cluster.local

kgp --all-namespaces -o wide | egrep 'ssh|dns'

default         aks-ssh-1-78885744d9-8ks6f                                        1/1     Running     0          16s     10.244.0.25    aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-7fc597cc45-d7nmb                                          1/1     Running     2          37h     10.244.0.13    aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-7fc597cc45-wqckk                                          1/1     Running     2          37h     10.244.0.9     aks-computingpl-37691802-1   <none>           <none>
kube-system     coredns-autoscaler-7ccc76bfbd-wjrzz                               1/1     Running     2          3d20h   10.244.0.213   aks-computingpl-37691802-1   <none>           <none>

We are using basic Networking and we don't use VMSS.

mihsoft · 2020-01-08T13:17:45Z

We are experiencing this same issue when using the Kured daemon. When the node scheduled for reboot is cordoned and drained, the CoreDNS container fails due to the PodDisruptionBudget, this then causes multiple CoreDNS pods to be scheduled on the same node and hitting the replica count, thus when the other node reboots the coreDNS does not get scheduled on the node, the SchedulingDisabled flag is not removed.

Manually deleting the second coreDNS pod and rebooting the affected node resolves the issue.

Obviously this is a significant issue as it essentially rules out Kured as a viable option on AKS, for the moment at least.

@jnoller is it worth updating the documentation for AKS regarding the issue (apologies if it has already been updated) as this can be reproduced reliably for new and existing clusters alike even when the clusters themselves are all but empty?

jnoller · 2020-01-08T13:38:20Z

The documentation will not be changed, we will work to fix the scheduling issue

jnoller · 2020-01-08T16:02:24Z

Please also see this issue for intermittent nodenotready, DNS latency and other crashed related to system load: #1373

mihsoft · 2020-01-09T13:58:48Z

Thanks @jnoller - Took a look through the guidance, our disk queue length peaked at 10 (during creation of the node itself) but stayed <1 until now. Reads/Writes are minimal. Wasn't sure if you wanted me to log this feedback in #1373

In our scenario we created a new cluster as follows:
Location: EastUS
Scale: 2 Nodes - D2s_v3
Scale sets: Off
Virtual nodes: Off
Advanced networking: linked to existing VNET, using kubelet for the network plugin
Kubernetes - 1.14.8

We then installed the kured daemonset and set a period of 1m to bring about the issue quickly.

To force the scenario, we ran sudo touch /var/run/reboot-required on the node,

During the cordon and drain observed from the kured pod we get a few warnings but the reboot takes place - time="2020-01-09T13:36:44Z" level=warning msg="WARNING: Deleting pods with local storage: coredns- Ignoring DaemonSet-managed pods: kube-proxy-466sg, kured-7qj54" cmd=/usr/bin/kubectl std=err

Upon reboot - both coredns pods were running on the same node

Because the node that was rebooted has not been uncordoned, kured now fails and is unable to resolve DNS, coreDNS is now running both pods on a single node.

Here's the queue length

Manually running the uncordon on the node,

Pods on the rebooted node fail to resolve DNS.

To resolve we did the following: (As mentioned previously by others in this issue)

Manually run uncordon on the node
Delete the coredns pod
Wait for the coredns pod to come online
Wait for the kured pod to resume

Update 10/01/2020:

At present this issue is now affecting all 3 of our clusters, all on 1.14.8 - in order to keep them up to date we have to manually update them, drain them, uncordon them etc which is proving quite time consuming. All 3 work normally if we manually kill off the coredns pod and ensure that 1 instance runs on each node.

We have now created several other clusters in various configurations to hopefully provide further diagnostic information to Microsoft.

Update 10/01/2020: 16:58pm

Created a new AKS cluster, with VMSS, 1 nodepool, 2 nodes, DS2_v2, East US, advanced networking - kubenet - existing subnet in VNET
Deployed the kured daemonset - immediately received DNS issues in line with the previous failures - time="2020-01-10T16:57:58Z" level=fatal msg="Error testing lock: Get https://sacha-eus--sacha-purefitnes-31fa67-95f3ca9e.hcp.eastus.azmk8s.io:443/apis/extensions/v1beta1/namespaces/kube-system/daemonsets/kured: dial tcp: i/o timeout"

Here's the az cli used (some data removed for security)

az aks create `
    --resource-group *** `
    --name *** `
    --node-count 2 `
    --network-plugin kubenet `
    --service-cidr 10.3.0.0/16 `
    --dns-service-ip 10.3.0.10 `
    --pod-cidr 10.52.0.0/16 `
    --docker-bridge-address 172.17.0.1/16 `
    --vnet-subnet-id ***
    --service-principal ***
    --client-secret ***
    --ssh-key-value ***

Update 10/01/2020: 17:19pm

Created a new AKS cluster, with VMSS, 1 nodepool, 2 nodes, DS2_v2, East US, advanced networking - kubenet

Ran the kubed daemon script, both nodes drained and rebooted successfully. No DNS issues.

So, at present, i've not noticed the issue on clusters with Azure CNI, or those using kubenet but not integrated into an existing VNET Subnet

Update 10/01/2020: 17:30pm

Been trying to see if any of the existing clusters can be repaired, tried the following:

Power off all nodes
Attempt reboot

At present, once the issue presents itself, draining any node within the cluster and rebooting it causes the issue to reoccur.

Errors from kubenet-proxy, trouble saving endpoints for kube-DNS

I0110 17:25:30.315041       1 endpoints.go:277] Setting endpoints for "kube-system/kube-dns:dns" to [10.129.2.16:53 10.129.3.19:53]
I0110 17:25:30.315065       1 endpoints.go:277] Setting endpoints for "kube-system/kube-dns:dns-tcp" to [10.129.2.16:53 10.129.3.19:53]
I0110 17:25:30.315086       1 endpoints.go:277] Setting endpoints for "kube-system/kube-dns:dns" to [10.129.2.16:53 10.129.3.19:53]
I0110 17:25:30.315099       1 endpoints.go:277] Setting endpoints for "kube-system/kube-dns:dns-tcp" to [10.129.2.16:53 10.129.3.19:53]

I0110 17:37:05.225987       1 proxier.go:701] Syncing iptables rules
I0110 17:37:05.255950       1 healthcheck.go:235] Not saving endpoints for unknown healthcheck "kube-system/kube-dns"
I0110 17:37:05.255987       1 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s

mihsoft · 2020-01-11T20:38:47Z

Update: 11/01/2020
Looks like same issue as #592 - unfortunately that issue never got resolved.

Checking the previously created clusters to see if they too are now failing after overnight updates and reboots.

erewok · 2020-04-21T19:11:30Z

We are seeing this issue as well and somewhat disappointed there isn't any obvious fix for it. I believe we are probably suffering from the IOPS throttling/performance problem that @jnoller describes here and also in #1373, but I don't see any clear recommendations on how to fix them. That's unfortunate for such a serious problem that has been known for such a lengthy period of time.

guitmz · 2020-05-12T14:59:04Z

@jnoller is azure doing anything to enable us to mitigate this issue? k8s released this https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/ and other people are working on similar tools (with coredns support https://github.com/contentful-labs/coredns-nodecache) but we cannot make use of these tools because AKS will remove them from the cluster (like stated in #1435).

Right now we are out of options since we tested all workarounds already and discarded the disk performance as being the cause of the issue. IPV6 is disabled in our cluster. The only thing that helps (but don't solve 100%) is single-request-reopen but we cannot use this because we rely on some Alpine images.

EDIT: Azure support mentioned they have a workaround to achieve similar results (dns cache in the nodes) in AKS, we will test it and report back.

jnoller · 2020-05-12T15:46:10Z

@guitmz I would re visit the IO issue leading. To the dns outages, as I mentioned I can recreate these failures regardless of coredns and kernel versions. While racy conntrack, SNAT etc all appear in this failure stack, it’s usually caused by terminal OS disk IO latency.

@andyzhangx and @juan-lee can comment on engineering mitigations

guitmz · 2020-05-12T16:00:01Z

@jnoller interesting. Can you share more details about replicating this issue? We are having a hard time with Azure support as they don't understand the IO issue and kernel bug are separated things and this could help moving the situation forward. Thanks

erewok · 2020-05-12T16:14:12Z

It turned out that in our case, we had a failure due to a misconfiguration in the static IP used for egress in an ARM template. AKS kept automatically creating an unwanted static IP to use for egress and it turns out that we were manually resolving it by removing it and replacing it with our preferred static IP.

Everything worked fine, but when we later scaled our clusters with this configuration, DNS broke (due to something with the expected egress static IP that had been setup when provisioning our cluster vs the one we had supplanted it with). Disk saturation and DNS failures were both symptomatic of our issue but we think that's because there was lots of thrashing as various kube-system pods started failing. We isolated our own problem, we were able to replicate it, and we have fixed it. I believe that our situation is probably uncommon (but I don't really know 🤷). We will be keeping an eye on disk saturation in the future, though, now that we have all the USE metrics in our prometheus dashboards.

srihas619 · 2020-05-13T13:37:52Z

@jnoller can you help in recreating the failures? Is there any gist that helps in reproducing this reliably? Thank you.

jnoller · 2020-05-13T14:06:39Z

@srihas619 i mention this in issue #1373 but any sufficiently complex helm chart - such as istio, the Prometheus operator, telegraf etc all triggers this failure, sometimes it’s at a lower or intermittent level and does not cross into a clear failure. You will need to confirm all metrics with the physical logs on the worker VMs

srihas619 · 2020-05-13T17:45:51Z

@jnoller Thank you. I have watched your video explaining about this scenario in case of istio implementation. However I am unable to replicate it reliably to test on fresh clusters.

jnoller · 2020-05-13T17:52:18Z

@srihas619 The video lacks more of the details: https://github.com/jnoller/kubernaughty/blob/master/docs/part-4-how-you-kill-a-container-runtime.md - You will not see 100% reproduction due to the nature of scale / load exhaustion. Additionally, Kubernetes will attempt to re-start and heal containers orphaned and lost due to this.

This means that under high IO latency the cgroup and container itself is lost at the VM level, but not always (because of system load). When it is lost, usually the write cache that is enabled on all of the hosts disks are also lost/flushed losing all in-flight writes.

I would SSH into the VMs and start watching the disk IO latency using bpf / bcc tools and the docker / kubelet logs. You may be 'seeing it' - but unless the IO load lands on the nodes running the coreDNS pods, you won't see the connection drops (unless CNI is trapped in IOWait)

almegren · 2020-05-14T11:42:45Z

We also have this issue. I can recreate it in a brand new AKS cluster with version 1.16.7. I slowly start deploying services and when a second node is needed dns lookup fails on that new node. It still works on pods running on the first node. Restarting coredns with
kubectl rollout restart deployment/coredns -n kube-system can result both in fixing the issue temporarily as well as breaking dns functionality for pods on both nodes.

srihas619 · 2020-05-14T14:37:54Z

@jnoller Thanks for the kubernaughty, it helped to understand the plot linearly.
We have a relatively big pod (1024m CPU and 2048 MiB mem) in our AKS cluster (with Standard_D4s_v3 nodes), which when run with 10 parallel threads, to test our services, is experiencing timeouts. However, when I run it with 1 or upto 4 threads, there are no timeouts so far. IPV6 is disabled in our cluster and this particular pod has single-request-reopen dnsConfig applied. Currently we have node local dns cache running as DaemonSet (as per MS support's suggestion).
We tried to offload /var/lib/docker to /mnt/docker by using knode. Unfortunately it didn't help us, so we had to revert it.
This makes me think that we can increase the test threads in our pod, to reproduce the timeouts, and trace things with BPF. I am waiting for your part 5 to dive deep.

juan-lee · 2020-05-14T23:31:11Z

Can you check to see if your node-local-dns configuration include the log plugin? Also, are there limits set on your node-local-dns daemonset? If so, you should remove them. They were inadvertently included and can cause excessive logging and premature OOMKills of the node-local-dns pods.

jeffwilcox · 2020-06-03T03:30:59Z

Our team hit all the fun here... end result: our service principal had expired. JFYI in case anyone else hits this. As soon as we updated the SP, everything was happy again. I'm so excited to move to MSI at some point...

srihas619 · 2020-06-03T08:25:44Z

@juan-lee I have checked them, we don't have log plugin included and there are no resource limits for the DaemonSet.

@jeffwilcox may you give some more info on the service pricipal in context? It will be helpful. Wondering how this will impact the DNS resolution though.

juan-lee · 2020-06-03T23:56:15Z

Indeed @jeffwilcox an expired service principal will cause dns to fail. I don't understand where there's a dependency between coredns and the service principal. I will report back when I figure it out.

GuyPaddock · 2020-06-11T20:21:03Z

Is it possible that the expired principal causes log spamming, which causes excessive IO, which causes IOPS throttling, which brings down the node, etc? (and then the the cat, that killed the rat, that ate the malt, that lay in the house that Jack built, etc.)

fabiodoaraujo · 2020-07-20T13:20:08Z

I got the same issues. Pods stopped resolving Kubernetes service names.
During troubleshooting, the name resolution started working again and we got these events on kube-system namespaces:

$ k -n kube-system get events
LAST SEEN   TYPE      REASON                 OBJECT                                    MESSAGE
17m         Normal    TaintManagerEviction   pod/coredns-698c77c5d7-5rn2v              Cancelling deletion of Pod kube-system/coredns-698c77c5d7-5rn2v
7m52s       Normal    TaintManagerEviction   pod/coredns-698c77c5d7-5rn2v              Cancelling deletion of Pod kube-system/coredns-698c77c5d7-5rn2v
16m         Warning   Unhealthy              pod/coredns-autoscaler-5468748cfb-htlqm   Liveness probe failed: HTTP probe failed with statuscode: 500
16m         Normal    LeaderElection         endpoints/kube-controller-manager         kube-controller-manager-649dd7dc8d-57xvd_b9b0ce5c-e19f-4fed-af91-1f50305f1535 became leader
16m         Normal    LeaderElection         endpoints/kube-scheduler                  kube-scheduler-9c78b49df-cbr4z_5467a7ff-2f6e-4923-b639-fdf30a6633c0 became leader

ghost · 2020-07-27T19:01:22Z

Action required from @Azure/aks-pm

palma21 · 2020-08-06T18:12:22Z

This issue is no longer accurate since it touched

Manual tweaks to the RT, which is not supported. Today AKS automatically uses existing RT for kubenet. https://docs.microsoft.com/en-us/azure/aks/configure-kubenet#bring-your-own-subnet-and-route-table-with-kubenet
It also touched IO contention which is explained on a separate issue.
Additional misc problems that are unrelated or unique to one cluster.

As such, I'm closing this and I'll ask that if you're still having issues that you create a specific issue and describe your problem/symptom. Also feel free to paste any support ticket numbers if you already have opened them so we may sync with the internal support teams.

c4m4 · 2020-08-18T14:13:27Z

I am having this issue using the default OS disk size, I would like to know if is there any fix other create the cluster with a bigger disk?

triage-new-issues bot added the triage label Nov 18, 2019

GuyPaddock mentioned this issue Nov 18, 2019

Tunnelfront pod can't reach API server nor via DNS, neither via IP, so the whole cluster is broken #1322

Closed

greenpau mentioned this issue Dec 9, 2019

[ERROR] plugin/errors: 2 me.com. MX: dns: overflow unpacking uint16 coredns/coredns#3367

Closed

jnoller mentioned this issue Dec 17, 2019

DNS and network timeout in AKS #1041

Closed

jnoller mentioned this issue Dec 20, 2019

Container creation intermittently takes a long time #1362

Closed

ghost added the action-required label Jul 22, 2020

triage-new-issues bot removed the triage label Jul 22, 2020

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Jul 27, 2020

TomGeske added the coreDNS label Jul 27, 2020

TomGeske assigned juan-lee Jul 27, 2020

a-rt-works mentioned this issue Aug 5, 2020

Support Azure CNI with Kured reboot Azure/azure-container-networking#639

Closed

palma21 closed this as completed Aug 6, 2020

palma21 removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Aug 6, 2020

ghost locked as resolved and limited conversation to collaborators Sep 17, 2020

Existing AKS Cluster Suddenly not Resolving DNS #1320

Existing AKS Cluster Suddenly not Resolving DNS #1320

Comments

GuyPaddock commented Nov 18, 2019 • edited Loading

GuyPaddock commented Nov 18, 2019 • edited Loading

GuyPaddock commented Nov 18, 2019

AlenversFr commented Nov 19, 2019 • edited Loading

jnoller commented Nov 19, 2019

mleneveut commented Nov 20, 2019

jnoller commented Nov 20, 2019

mleneveut commented Nov 20, 2019

AlenversFr commented Nov 21, 2019

guitmz commented Dec 4, 2019

param3sh commented Dec 12, 2019

bhicks329 commented Dec 12, 2019

guitmz commented Dec 12, 2019

param3sh commented Dec 13, 2019

TomsonOne commented Dec 13, 2019

AlenversFr commented Dec 13, 2019

guitmz commented Dec 13, 2019

brudnyhenry commented Dec 16, 2019 • edited Loading

guitmz commented Dec 16, 2019 • edited Loading

jnoller commented Dec 16, 2019 • edited Loading

brudnyhenry commented Dec 18, 2019

mihsoft commented Jan 8, 2020

jnoller commented Jan 8, 2020

jnoller commented Jan 8, 2020

mihsoft commented Jan 9, 2020 • edited Loading

mihsoft commented Jan 11, 2020

erewok commented Apr 21, 2020

guitmz commented May 12, 2020 • edited Loading

jnoller commented May 12, 2020

guitmz commented May 12, 2020 • edited Loading

erewok commented May 12, 2020

srihas619 commented May 13, 2020

jnoller commented May 13, 2020

srihas619 commented May 13, 2020

jnoller commented May 13, 2020

almegren commented May 14, 2020

srihas619 commented May 14, 2020

juan-lee commented May 14, 2020 • edited Loading

jeffwilcox commented Jun 3, 2020

srihas619 commented Jun 3, 2020

juan-lee commented Jun 3, 2020

GuyPaddock commented Jun 11, 2020 • edited Loading

fabiodoaraujo commented Jul 20, 2020

ghost commented Jul 27, 2020

palma21 commented Aug 6, 2020

c4m4 commented Aug 18, 2020

GuyPaddock commented Nov 18, 2019 •

edited

Loading

GuyPaddock commented Nov 18, 2019 •

edited

Loading

AlenversFr commented Nov 19, 2019 •

edited

Loading

brudnyhenry commented Dec 16, 2019 •

edited

Loading

guitmz commented Dec 16, 2019 •

edited

Loading

jnoller commented Dec 16, 2019 •

edited

Loading

mihsoft commented Jan 9, 2020 •

edited

Loading

guitmz commented May 12, 2020 •

edited

Loading

guitmz commented May 12, 2020 •

edited

Loading

juan-lee commented May 14, 2020 •

edited

Loading

GuyPaddock commented Jun 11, 2020 •

edited

Loading