-
Notifications
You must be signed in to change notification settings - Fork 558
Reason why node is marked as NotReady #863
Comments
Can you look up these log on the agents that are marked not ready and see if there are any details in there that might be helpful? |
This just happened. A node is almost at random marked as NotReady by my master. However, I was also unable to ssh into my node when it was marked as notready. Could a very high CPU process be running on the node at this time? I'm using only a 2 core machine. I've had a look at the logs on the node around this time, and there was nothing useful there. Can you tell me how to proceed from here
|
I am seeing exactly this behavior as well, and this morning we had it on a small production cluster, consisting of 3 agent DS12_v2 machines, where two machines were taken out with Looking at the
It also I still have the logs for the machines, where should I look, what could I look for? |
It may be related to #906, and that the |
I can confirm this issue. It happens twice a day on a single machine, and my disk read graphs are through the roof. Still trying to figure what process is doing this. |
@DonMartin76 We've had a similar issue yesterday. @JackQuincy Is there a way to externally monitor the reasons why a node is marked as NotReady? |
@colemickens @brendandburns Is there a way like @bogdangrigg is requesting? |
To add a case to that problem, i'm using ACS with K8S 1.7.5 and since 9 P.M, all my nodes are running randomly down, kubelet is stopping posting node status one after another. Surprisingly, it happend on my three different clusters at the same time, take a look at my grafana monitoring status on the 3 different clusters : Is that linked to a possible Azure API connection problem ? Kubelet is stopping positing status because Azure API is unreachable for a while ? Since the problem started at the same time on 3 different clusters it must be linked to something external to Kubernetes. My clusters are not OOMed ou CPU overloaded : |
We are also facing similar issues since this morning/yesterday evening (GMT 20:30) on our production cluster. Fortunately, not all nodes go down at the same time (this time), but rather one after the other, so that the application is still stable. Which logs would make sense to look into to find a possible cause for this behavior? Update: Checked Update 2: I sifted through all available |
This behavior is continuing for the entire day right now; we will reprovision the cluster with |
@DonMartin76 I really don't want to use the "reprovision" solution since i deployed all my clusters just 1 week ago and since i've a full Production and criticals applications running right now ... Let me know about your tests Don should be great. :D BTW all my clusters are still flaky ... Production / Staging / etc ... edit : I used ACS engine v0.7.0 with Kubernetes 1.7.5 to deploy my clusters |
@theobolo Will let you know how it works out. We reprovision production every week with zero downtime deployments anyway, so it is not a huge issue to do it one time out of the ordinary for us. Which region are you in? We are in North Europe. |
Oh really every week ! Did you automate all the migration process ? like migrate Data Disks provisionned on Azure, DNS rules, ingress ? Because i've a lot of critical applications Kafka, Mongo etc etc. I'm able to migrate with 0 downtime too but i'm taking 1/2 day to do it on my Three clusters. Because the only point where i'm struggling is migrate my managed Disks i opened an issue to ask if it's possible to mount a specific Managed Disk on a pod or a Deployment : #1484 |
We halt some processing for the time of the provisioning, which does not affect the customer; and customer data is kept on Cosmos, which we can just use from both the "old" and "new" cluster. We clone data disks to a backup resource group first, then clone them back into the new resource group, in short. Fully scripted. We don't use mounts of managed disks to Pods, but implement an NFS server which has them mounted to take the state out of the cluster completely (but we reprovision that thing as well). Works for us, but YMMV. Nodes are still flaky on the current production cluster (the 0.5.0 acs-engine and k8s 1.7.2 one) FWIW, and I have encountered some breaking changes going from 0.5.0 to 0.8.0 (Master nodes are tainted instead of cordoned, breaking our DaemonSets). Will get back with results as soon as the new cluster with 1.7.7 has taken over and has been running for a while. |
Update: After an update to acs-engine 0.8.0, and Kubernetes 1.7.7, the problem still occurs.
Similar error messages cycle through for the other nodes as well, and there seem to be no apparent reasons. The nodes are available via I have no clue where to dig here. More logs from
And you can see by the Update: The problem ceased Saturday morning (Oct 21st) at around 02:30 GMT, and hasn't occurred since. This strongly suggests it was a problem in some underlying Azure infrastructure, but it still is very worrying. |
Hey @DonMartin76 my clusters are back to a normal state since 5 A.M this morning, no more NotReady > Ready switches. Look at the Grafana kubelet metrics : Should confirms that was an Azure problem ... I'll take a look at my kube-controller-manager logs to confirm your toughts. BTW thanks for having described your migration process, i should take a look to add a NFS layer for my apps. |
Same here, problems just ceased Saturday morning at around 0430 GMT. I opened a support ticket to try to find the root cause on this one; I will return with more information if they find something useful. |
@DonMartin76 Same here, opened an Azure issue to find what was wrong ... 👍 |
@theobolo Would you mind sharing the Azure region(s) your cluster(s) run on? Also North Europe, or somewhere else? |
@DonMartin76 Also North Europe... |
@DonMartin76 Not sure if you can see it : https://app.azure.com/h/RHDP-V40/377c3f but the bug is probably related to this... |
@theobolo Thanks for the link, and yes, I can see it. Let's see whether my support issue renders the same result, they are currently investigating. It's interesting though that the same issue was reported on a non-acs-engine provisioned cluster as well (CoreOS with k8s 1.6.4), but also in the North Europe region. Happy they are really looking into it. |
Hello @theobolo I currently work on the SR DonMartin76 had opended with us. Is it possible that you share the SR number with me so I can link both SRs together and can use your information as well to get more details about this beahviour. Thanks in advance |
I also back to the logs from our previous incident where we lost 3 out of 4 nodes. k8s-agent-B3E1E1AA-1 was the only node to survive the incident, the other entry is from our master node
|
Deleted on the Azure side.
Can you send me the subscription/resource id of a VM that corresponds to k8s-master-B3E1E1AA-0
I'd like to cross-check with the ARM logs.
Thanks
--brendan
…________________________________________
From: philipedwards <notifications@github.com>
Sent: Wednesday, December 20, 2017 2:33 PM
To: Azure/acs-engine
Cc: Brendan Burns; Mention
Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863)
I also back to the logs from our previous incident where we lost 3 out of 4 nodes.
k8s-agent-B3E1E1AA-1 was the only node to survive the incident, the other entry is from our master node
Dec 7 22:26:49 k8s-agent-B3E1E1AA-1 docker[23202]: E1207 22:26:49.770011 23237 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found
Dec 7 22:26:49 k8s-master-B3E1E1AA-0 docker[1806]: E1207 22:26:49.770756 1874 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-353202378&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=uca%2F7guqLlqnLXNOkB4EvpXCIwg%2FHueThPtxD88H7A8%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgmIjLEHOyyzVVsY-jM9k2ebZrgfiks5tCYs9gaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=XpqJV7%2Fy4beQixl%2F4dlhkWQYGP6wBxBBF2aZkKfiSWo%3D&reserved=0>.
|
Ah, I took a deeper look at the code, I think I see the bug...
…________________________________________
From: Brendan Burns
Sent: Wednesday, December 20, 2017 2:50 PM
To: philipedwards; Azure/acs-engine; Azure/acs-engine
Cc: Mention
Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863)
Deleted on the Azure side.
Can you send me the subscription/resource id of a VM that corresponds to k8s-master-B3E1E1AA-0
I'd like to cross-check with the ARM logs.
Thanks
--brendan
________________________________________
From: philipedwards <notifications@github.com>
Sent: Wednesday, December 20, 2017 2:33 PM
To: Azure/acs-engine
Cc: Brendan Burns; Mention
Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863)
I also back to the logs from our previous incident where we lost 3 out of 4 nodes.
k8s-agent-B3E1E1AA-1 was the only node to survive the incident, the other entry is from our master node
Dec 7 22:26:49 k8s-agent-B3E1E1AA-1 docker[23202]: E1207 22:26:49.770011 23237 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found
Dec 7 22:26:49 k8s-master-B3E1E1AA-0 docker[1806]: E1207 22:26:49.770756 1874 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-353202378&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=uca%2F7guqLlqnLXNOkB4EvpXCIwg%2FHueThPtxD88H7A8%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgmIjLEHOyyzVVsY-jM9k2ebZrgfiks5tCYs9gaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=XpqJV7%2Fy4beQixl%2F4dlhkWQYGP6wBxBBF2aZkKfiSWo%3D&reserved=0>.
|
(it would still be useful to get the resource id...)
…________________________________________
From: Brendan Burns
Sent: Wednesday, December 20, 2017 2:59 PM
To: philipedwards; Azure/acs-engine; Azure/acs-engine
Cc: Mention
Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863)
Ah, I took a deeper look at the code, I think I see the bug...
________________________________________
From: Brendan Burns
Sent: Wednesday, December 20, 2017 2:50 PM
To: philipedwards; Azure/acs-engine; Azure/acs-engine
Cc: Mention
Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863)
Deleted on the Azure side.
Can you send me the subscription/resource id of a VM that corresponds to k8s-master-B3E1E1AA-0
I'd like to cross-check with the ARM logs.
Thanks
--brendan
________________________________________
From: philipedwards <notifications@github.com>
Sent: Wednesday, December 20, 2017 2:33 PM
To: Azure/acs-engine
Cc: Brendan Burns; Mention
Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863)
I also back to the logs from our previous incident where we lost 3 out of 4 nodes.
k8s-agent-B3E1E1AA-1 was the only node to survive the incident, the other entry is from our master node
Dec 7 22:26:49 k8s-agent-B3E1E1AA-1 docker[23202]: E1207 22:26:49.770011 23237 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found
Dec 7 22:26:49 k8s-master-B3E1E1AA-0 docker[1806]: E1207 22:26:49.770756 1874 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-353202378&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=uca%2F7guqLlqnLXNOkB4EvpXCIwg%2FHueThPtxD88H7A8%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgmIjLEHOyyzVVsY-jM9k2ebZrgfiks5tCYs9gaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=XpqJV7%2Fy4beQixl%2F4dlhkWQYGP6wBxBBF2aZkKfiSWo%3D&reserved=0>.
|
@brendandburns email sent. |
@brendandburns - were you able to find anything in the logs relating to our last outage? Is it possible to go back to the previous incident and confirm it was the same issue? I would like to add this detail to our incident reports. |
@brendanburns So, in short, this may be due to an unresponsive Azure API, or API calls to Azure which do not reply the way which is expected? Is it likely that this happens a lot so that it can cause these kinds of outages? We had periods of 24h+ where this happened on and off. Now this hasn't happened for a while, and the periods are usually shorter than an hour, and the scheduler usually doesn't start to reschedule things just yet. |
I don't know exactly what errors would cause the sdk to not return a autorest.DetailedError, but at the very least I would suspect that something in the transport layer (eg http timeout) would return a different error.
I'll follow up with the sdk team to better understand the conditions that would trigger this and follow up on this thread.
But nonetheless, we're pushing to get this cherry picked ASAP so that people can update to a fixed binary.
…--brendan
________________________________
From: Martin Danielsson <notifications@github.com>
Sent: Thursday, December 21, 2017 7:46:07 AM
To: Azure/acs-engine
Cc: Brendan Burns; Mention
Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863)
@brendanburns<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fbrendanburns&data=02%7C01%7Cbburns%40microsoft.com%7C4a30e74b74b64c44cc9b08d54889f4c3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494679721054833&sdata=wXxnHbMqkjayIq7fBCS7kBoG0obnu8uav%2FhMwZKsRUw%3D&reserved=0> So, in short, this may be due to an unresponsive Azure API, or API calls to Azure which do not reply the way which is expected? Is it likely that this happens a lot so that it can cause these kinds of outages?
We had periods of 24h+ where this happened on and off. Now this hasn't happened for a while, and the periods are usually shorter than an hour, and the scheduler usually doesn't start to reschedule things just yet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-353383849&data=02%7C01%7Cbburns%40microsoft.com%7C4a30e74b74b64c44cc9b08d54889f4c3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494679721054833&sdata=PyJgArsb94sJLVDvB0vbIuETLn%2BDAZPBAVo4YcA2do8%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgl4NlnEcG5H-ilqxneYmnLF9vK_Mks5tCn0_gaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7C4a30e74b74b64c44cc9b08d54889f4c3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494679721054833&sdata=h%2Bcnr4dWW7kiNnvN7d1aSTCua8lquKo1aUxp2ud7Apc%3D&reserved=0>.
|
Happy to hear this. We have been chasing for the cause of these things for over a month now (MS support in Munich and I), and until now we haven't been able to really pin it down. We had it down to something between Kubernetes and the Azure API, but couldn't reproduce, and thus it's difficult to get a real diagnosis on it. REALLY appreciate you pitching in on this, Brendan. |
I'd like to understand this better, if possible. The bugfix makes sense (sorry, that'd be my bug), but I don't yet understand how it manifests itself as "Node NotReady" (though I'd agree it seems too coincidental to not be related). It looks like the errors above are coming from kubelet failing to get its own node status to self report IPs. I would have assumed that only KCM would actually handle "instance not found" and consider it deletion of a node. If Additionally, if this is enough to cause things to get "stuck", I'm not sure I understand how things are recovering without the |
What I believe is happening is this:
In order to post status, kubelet needs it's IP address
Lookup of the node fails due to some transient reason, kubelet sees "not
found" and the entire node status update to master is aborted.
Do this enough times in a row, and the node transitions into NotReady
state.
If eventually, the transient error fixes itself, and the node starts
posting status again, node goes back to Ready state.
If on the other hand, the issue goes on long enough, the NodeController
on the master is going to delete the Node, which will require a Kubelet
restart (though that's a bug imho) to recover from.
Switching to IMDS will also fix this in general, since kubelet won't
need an API call to get it's IP address.
…On Thu, Dec 21, 2017 at 11:26 AM Cole Mickens ***@***.***> wrote:
I'd like to understand this better, if possible. The bugfix makes sense
(sorry, that'd be my bug), but I don't yet understand how it manifests
itself as "Node NotReady" (though I'd agree it seems too coincidental to
not be related).
It looks like the errors above are coming from kubelet failing to get its
own node status to self report IPs. I would have assumed that only KCM
would actually handle "instance not found" and consider it deletion of a
node.
If kubelet were handling instance not found and treating it as a
deletion, then I would've expected to see the Node object disappear
entirely, and would have expected to see errors for kubelet trying to PUT
it's node status for a node object that no longer exists.
Additionally, if this is enough to cause things to get "stuck", I'm not
sure I understand how things are recovering without the kubelet being
restarted such that it tries to do an initial POST to create its own node
object.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#863 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABOYXgkIXAXHx5P0X2ZD_69IO2TLsmMZks5tCrDZgaJpZM4OETr0>
.
|
I think we're suffering from this issue as well, every 2 weeks approx causing production cluster disruption. recovering on its own after a while and together some networking interface ready/not ready errors in the logs, I could provide further details if required... Thanks |
@snebel The only workaround I'm aware off to get a more reliable IP detection , till the merge is available in the k8s version we ship with ACS or AKS, is to enable the MetadataService. |
Hi @malachma, I don't see that flag set into the
Seems to return back good instance metadata results, or I'm missing something? |
@snebel29 That is interesting that you don't see the flag as mentioned. Adding this flag to this config-file and set it to true should help in this case. By default the metadata services is not used if take 1.8 as reference. What is your k8s version by the way? |
Hi,
and the
Thanks |
I don't think 1.6.6 supports IMDS... Any chance you can upgrade?
…--brendan
On Tue, Jan 2, 2018 at 6:41 AM Sven Nebel ***@***.***> wrote:
Hi,
My k8s version is
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.1", GitCommit:"1dc5c66f5dd61da08412a74221ecc79208c2165b", GitTreeState:"clean", BuildDate:"2017-07-14T02:00:46Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}
and the acs-engine version used to create the cluster was
commit 24a478e
Author: Adam Greene ***@***.***>
Date: Wed Jul 5 16:48:46 2017 -0700
update kubernetes doc (#860)
Thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#863 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABOYXlcLI6EF3E5RUg11B-qjLHA9YX_oks5tGj__gaJpZM4OETr0>
.
|
Hi @brendanburns , Thanks in advance |
Hi @brendanburns, Thanks! |
I hit this same problem on 1.7.9 version:
When will the fix be cherry picked into 1.7.x versions? |
Still facing the same issue with 1.7.7, can you suggest how to upgrade on Azure Container Service. |
I'm closing this old issue since acs-engine is deprecated in favor of aks-engine. |
Is this a request for help?:
Yes
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes
What happened:
Node is marked as NotReady and a cascading effect takes places on my k8s cluster
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Not sure
Anything else we need to know:
In /var/log/kern.log of my k8s master server, I can see that one node has been marked as NotReady.
Is there a reason logged somewhere as to why this was done? This is a recurring issue every few days.
The text was updated successfully, but these errors were encountered: