-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Existing AKS Cluster Suddenly not Resolving DNS #1320
Comments
Azure support had me take the following actions:
Not sure how this would restore DNS to a functional state, but I tried it. It did not have the desired effect. DNS inside the pods still wasn't working. As a last resort, I drained, rebooted, and un-cordoned each node in our three-node cluster one-by-one. After the last node came up and was un-cordoned, DNS seems to have come back up with it! Although I am happy that a reboot of the nodes have mitigated this issue, I am a bit uneasy about the situation because I don't understand what caused it. Hence, I don't know if/when it will happen again... |
I did come across a post of another user on another provider having similar issues: They, too, fixed the issue by rebooting nodes but for them it was only a temporary fix. I am hoping that the issue does not become recurring. |
Hi, Our reboot strategy was not perfect because the 5 nodes had to reboot the same night. We've opened a case with Microsoft support. After 7 hours of investigations with exactly the same kind of tests, we choose to perform a rolling update on the nodes in order to get 5 brand new. Problem solved after regenerating the nodes. It's a very tricky situation because all the pods in kube-system seems to be fully operationnal with no ERROR. We're still working on an alarming rule in order to check this. At this point what we think is the system upgrade after rebooting a node can cause low level failure and incompatibility with kubelet or whatever cloud template is used. |
Please see this comment thread: #1326 (comment) - I suspect given the managed pods are fully functional, the latency spikes are due to IOPS quota throttling on the OS disk. I am working on full guidance for this. |
We just experienced the same kind of issue. Our Datadog monitoring was not working any more, because the metrics-server was not able to find the API server.
We upgraded to 1.15.5 (North Europe) on 8th November, and it was working until recently (16th november looking at the lack of data in Datadog). We started to see the problem on Monday 18th November. I think there was Azure issues on Monday and Tuesday, not sure if it is linked. What we noticed is that all the kube-system services had no endpoints :
The coredns pods were Running and with no error logs :
But looking at the deployments there was a problem :
After deleting all the kube-system pods, they came back alive and the problem was solved (took us 2 days to resolve this...) |
@mleneveut please see this comment from this morning: #1326 (comment) You need io queue depth on the vm worker level |
@jnoller Our max disk queue depth on this cluster is 1.65 for the last 7 days. |
hello, |
is there a way we can run coredns in AKS as daemonset to mitigate this (at least try)? |
Hi I also facing same issue with Version "1.14.8" and with default version "1.13.12". Is there any solution for this. Thanks in Advance. |
I'm seeing this as well at the moment. Random problems with DNS suddenly not resolving inside and outside the cluster. |
btw, this issue seems to only be present if you are using azure-cni, I've recreated my cluster without it (using only kubenet) and its working well now (got this hint from here #667) |
Hi @guitmz, i have been using "kubenet" only. But i still have issue. |
Same problem here... and also using Kubenet... nslookup hostnames returns errror with Kubernetes 1.14.8 |
Same here we use Kubenet. |
Interesting. No idea why its working for me, will take a closer look |
Hi,
Shouldn't CoreDNS deployment be distributed across all nodes ? t |
@brudnyhenry That would be a better implementation IMO but no way to do it from our side. Since I've change my cluster to kubenet, I was not seeing this issue anymore but since 1h ago it started happening again suddenly. Funny thing is that even coredns-autoscaler pod is having DNS issues itself so I'm not even sure autoscaling of coreDNS is working at all:
AKS does not seem to be production ready by any means. |
The issue is still Azure Disk throttling of your OS disks. AKS defaults to 100gb OS disks for all clusters, you can increase this via ARM however the maximum is 2tb. This means that docker IO, logging, metrics, monitoring, the kernel, etc all share the single OS disk IO path. As Azure uses network attached storage for OS disks these have both a hard quota on bandwidth and file operations / sec (IOPS). Small file write patterns such as docker will exhaust the quota of the OS disk at which point the storage system throttles at the IO and cache level leading to high VM latency, networking failures, DNS failures, etc. This is common IaaS sizing/mismatch issue and we are working as we speak on fleet-wide mitigation and analysis of this issue to provide full guidance. Until then, you can test offloading the Docker/container IO from your OS disk using this utility: https://github.com/juan-lee/knode
This moves the docker IO to the VMs temp disk - this means that the docker data dir becomes ephemeral and the size changes (Seem temp disk GiB here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-storage). Deployment and rebooting of the nodes can take some time. You will need to fix any tools like splunk, etc to point to the new path. This is not 100% - high logging levels, security tools, etc can also trip the IOPs throttle. However moving docker IO has consistently shown better runtime stability in systems I'm working with. Additionally, we will be exposing the webhook auth flag in January. Until then, you can run the following on your worker nodes (will not persist scale/upgrade) to enable the webhook:
This will allow you to install the prometheus operator: https://github.com/helm/charts/tree/master/stable/prometheus-operator The operator includes rich grafana dashboards that will help you see the issue clearly, you need to look at the coreDNS report, USE (cluster and Node) metrics, and node-host-level metrics for IO saturation on Linux device SDA/etc We are working with the Azure monitoring and storage teams to expose these limits/events on AKS worker nodes and other AKS changes to make this clear and easily manageable by customers. Every disk IO saturation event directly correlates to latency across all customer worker nodes: |
Hi,
But when pod is running on the same node as CoreDNS then everything works fine
We are using basic Networking and we don't use VMSS. |
We are experiencing this same issue when using the Kured daemon. When the node scheduled for reboot is cordoned and drained, the CoreDNS container fails due to the PodDisruptionBudget, this then causes multiple CoreDNS pods to be scheduled on the same node and hitting the replica count, thus when the other node reboots the coreDNS does not get scheduled on the node, the SchedulingDisabled flag is not removed. Manually deleting the second coreDNS pod and rebooting the affected node resolves the issue. Obviously this is a significant issue as it essentially rules out Kured as a viable option on AKS, for the moment at least. @jnoller is it worth updating the documentation for AKS regarding the issue (apologies if it has already been updated) as this can be reproduced reliably for new and existing clusters alike even when the clusters themselves are all but empty? |
The documentation will not be changed, we will work to fix the scheduling issue |
Please also see this issue for intermittent nodenotready, DNS latency and other crashed related to system load: #1373 |
Thanks @jnoller - Took a look through the guidance, our disk queue length peaked at 10 (during creation of the node itself) but stayed <1 until now. Reads/Writes are minimal. Wasn't sure if you wanted me to log this feedback in #1373 In our scenario we created a new cluster as follows: We then installed the kured daemonset and set a period of 1m to bring about the issue quickly. To force the scenario, we ran sudo touch /var/run/reboot-required on the node, During the cordon and drain observed from the kured pod we get a few warnings but the reboot takes place - Upon reboot - both coredns pods were running on the same node Because the node that was rebooted has not been uncordoned, kured now fails and is unable to resolve DNS, coreDNS is now running both pods on a single node. Manually running the uncordon on the node, Pods on the rebooted node fail to resolve DNS. To resolve we did the following: (As mentioned previously by others in this issue)
Update 10/01/2020: At present this issue is now affecting all 3 of our clusters, all on 1.14.8 - in order to keep them up to date we have to manually update them, drain them, uncordon them etc which is proving quite time consuming. All 3 work normally if we manually kill off the coredns pod and ensure that 1 instance runs on each node. We have now created several other clusters in various configurations to hopefully provide further diagnostic information to Microsoft. Update 10/01/2020: 16:58pm Created a new AKS cluster, with VMSS, 1 nodepool, 2 nodes, DS2_v2, East US, advanced networking - kubenet - existing subnet in VNET Here's the az cli used (some data removed for security)
Update 10/01/2020: 17:19pm Created a new AKS cluster, with VMSS, 1 nodepool, 2 nodes, DS2_v2, East US, advanced networking - kubenet Ran the kubed daemon script, both nodes drained and rebooted successfully. No DNS issues. So, at present, i've not noticed the issue on clusters with Azure CNI, or those using kubenet but not integrated into an existing VNET Subnet Update 10/01/2020: 17:30pm Been trying to see if any of the existing clusters can be repaired, tried the following:
At present, once the issue presents itself, draining any node within the cluster and rebooting it causes the issue to reoccur. Errors from kubenet-proxy, trouble saving endpoints for kube-DNS
|
Update: 11/01/2020 Checking the previously created clusters to see if they too are now failing after overnight updates and reboots. |
We are seeing this issue as well and somewhat disappointed there isn't any obvious fix for it. I believe we are probably suffering from the IOPS throttling/performance problem that @jnoller describes here and also in #1373, but I don't see any clear recommendations on how to fix them. That's unfortunate for such a serious problem that has been known for such a lengthy period of time. |
@jnoller is azure doing anything to enable us to mitigate this issue? k8s released this https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/ and other people are working on similar tools (with coredns support https://github.com/contentful-labs/coredns-nodecache) but we cannot make use of these tools because AKS will remove them from the cluster (like stated in #1435). Right now we are out of options since we tested all workarounds already and discarded the disk performance as being the cause of the issue. IPV6 is disabled in our cluster. The only thing that helps (but don't solve 100%) is EDIT: Azure support mentioned they have a workaround to achieve similar results (dns cache in the nodes) in AKS, we will test it and report back. |
@guitmz I would re visit the IO issue leading. To the dns outages, as I mentioned I can recreate these failures regardless of coredns and kernel versions. While racy conntrack, SNAT etc all appear in this failure stack, it’s usually caused by terminal OS disk IO latency. @andyzhangx and @juan-lee can comment on engineering mitigations |
@jnoller interesting. Can you share more details about replicating this issue? We are having a hard time with Azure support as they don't understand the IO issue and kernel bug are separated things and this could help moving the situation forward. Thanks |
It turned out that in our case, we had a failure due to a misconfiguration in the static IP used for egress in an ARM template. AKS kept automatically creating an unwanted static IP to use for egress and it turns out that we were manually resolving it by removing it and replacing it with our preferred static IP. Everything worked fine, but when we later scaled our clusters with this configuration, DNS broke (due to something with the expected egress static IP that had been setup when provisioning our cluster vs the one we had supplanted it with). Disk saturation and DNS failures were both symptomatic of our issue but we think that's because there was lots of thrashing as various kube-system pods started failing. We isolated our own problem, we were able to replicate it, and we have fixed it. I believe that our situation is probably uncommon (but I don't really know 🤷). We will be keeping an eye on disk saturation in the future, though, now that we have all the USE metrics in our prometheus dashboards. |
@jnoller can you help in recreating the failures? Is there any gist that helps in reproducing this reliably? Thank you. |
@srihas619 i mention this in issue #1373 but any sufficiently complex helm chart - such as istio, the Prometheus operator, telegraf etc all triggers this failure, sometimes it’s at a lower or intermittent level and does not cross into a clear failure. You will need to confirm all metrics with the physical logs on the worker VMs |
@jnoller Thank you. I have watched your video explaining about this scenario in case of istio implementation. However I am unable to replicate it reliably to test on fresh clusters. |
@srihas619 The video lacks more of the details: https://github.com/jnoller/kubernaughty/blob/master/docs/part-4-how-you-kill-a-container-runtime.md - You will not see 100% reproduction due to the nature of scale / load exhaustion. Additionally, Kubernetes will attempt to re-start and heal containers orphaned and lost due to this. This means that under high IO latency the cgroup and container itself is lost at the VM level, but not always (because of system load). When it is lost, usually the write cache that is enabled on all of the hosts disks are also lost/flushed losing all in-flight writes. I would SSH into the VMs and start watching the disk IO latency using bpf / bcc tools and the docker / kubelet logs. You may be 'seeing it' - but unless the IO load lands on the nodes running the coreDNS pods, you won't see the connection drops (unless CNI is trapped in IOWait) |
We also have this issue. I can recreate it in a brand new AKS cluster with version 1.16.7. I slowly start deploying services and when a second node is needed dns lookup fails on that new node. It still works on pods running on the first node. Restarting coredns with |
@jnoller Thanks for the kubernaughty, it helped to understand the plot linearly. |
Can you check to see if your node-local-dns configuration include the log plugin? Also, are there limits set on your node-local-dns daemonset? If so, you should remove them. They were inadvertently included and can cause excessive logging and premature OOMKills of the node-local-dns pods. |
Our team hit all the fun here... end result: our service principal had expired. JFYI in case anyone else hits this. As soon as we updated the SP, everything was happy again. I'm so excited to move to MSI at some point... |
@juan-lee I have checked them, we don't have log plugin included and there are no resource limits for the DaemonSet. @jeffwilcox may you give some more info on the service pricipal in context? It will be helpful. Wondering how this will impact the DNS resolution though. |
Indeed @jeffwilcox an expired service principal will cause dns to fail. I don't understand where there's a dependency between coredns and the service principal. I will report back when I figure it out. |
Is it possible that the expired principal causes log spamming, which causes excessive IO, which causes IOPS throttling, which brings down the node, etc? (and then the the cat, that killed the rat, that ate the malt, that lay in the house that Jack built, etc.) |
I got the same issues. Pods stopped resolving Kubernetes service names.
|
Action required from @Azure/aks-pm |
This issue is no longer accurate since it touched
As such, I'm closing this and I'll ask that if you're still having issues that you create a specific issue and describe your problem/symptom. Also feel free to paste any support ticket numbers if you already have opened them so we may sync with the internal support teams. |
I am having this issue using the default OS disk size, I would like to know if is there any fix other create the cluster with a bigger disk? |
What happened:
We've had an AKS cluster deployed since February 2019 that's been stable until tonight. As of 11:30 PM ET on 2019-11-17, it seems as though all DNS requests -- both for hosts inside the cluster (e.g.
redis.myapp-dev
) as well as hosts outside the cluster (e.g.myapp.mysql.database.azure.com
) -- have stopped being resolved.If I SSH into a node in the cluster, DNS queries to outside hostnames like google.com will resolve.
Here's what I've tried so far:
coredns
pod so that it would respawn, but that did not resolve the issue.nslookup kubernetes.default
comes back with:/etc/resolv.conf
looks correct:Because of the outage, the AKS dashboard for this cluster is also down.
What you expected to happen:
kubernetes.default
should resolve inside pods.redis.myapp-dev
) should resolve inside pods.myapp.mysql.database.azure.com
) should resolve inside pods.How to reproduce it (as minimally and precisely as possible):
kubectl exec -it name-of-a-pod-in-cluster
nslookup kubernetes.default
Anything else we need to know?:
Environment:
kubectl version
):The text was updated successfully, but these errors were encountered: