Reason why node is marked as NotReady #863

rushabhnagda11 · 2017-06-24T12:10:16Z

Is this a request for help?:
Yes

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:
Node is marked as NotReady and a cascading effect takes places on my k8s cluster

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Not sure

Anything else we need to know:
In /var/log/kern.log of my k8s master server, I can see that one node has been marked as NotReady.
Is there a reason logged somewhere as to why this was done? This is a recurring issue every few days.

JackQuincy · 2017-06-26T18:25:14Z

Can you look up these log on the agents that are marked not ready and see if there are any details in there that might be helpful?
cat /var/log/azure/cluster-provision.log
sudo journalctl -u kubelet

rushabhnagda11 · 2017-06-28T13:24:34Z

This just happened. A node is almost at random marked as NotReady by my master. However, I was also unable to ssh into my node when it was marked as notready. Could a very high CPU process be running on the node at this time? I'm using only a 2 core machine.

I've had a look at the logs on the node around this time, and there was nothing useful there. Can you tell me how to proceed from here

Jun 28 12:18:37 k8s-master-18420427-0 dockerd[22509]: I0628 12:18:37.775805 1 controller_utils.go:285] Recording status change NodeNotReady event message for node k8s-agentpool1-18420427-2

Jun 28 12:18:37 k8s-master-18420427-0 dockerd[22509]: I0628 12:18:37.775935 1 controller_utils.go:203] Update ready status of pods on node [k8s-agentpool1-18420427-2]

Jun 28 12:18:37 k8s-master-18420427-0 dockerd[22509]: I0628 12:18:37.776570 1 event.go:217] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"k8s-agentpool1-18420427-2", UID:"6b8de3e3-04d8-11e7-9280-000d3af03aaf", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node k8s-agentpool1-18420427-2 status is now: NodeNotReady

DonMartin76 · 2017-07-04T13:57:23Z

I am seeing exactly this behavior as well, and this morning we had it on a small production cluster, consisting of 3 agent DS12_v2 machines, where two machines were taken out with NotReady status almost simultaneously, taking down the third in the wake of that, due to too many containers being run on it simlutaneously (THAT was our fault, the other, we don't think so).

Looking at the journalctl -u kubelet, I see that at one point, the kubelet starts to take down containers, but I can't understand why:

Jul 04 07:37:40 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:40.506695    1772 server.go:740] GET /metrics: (83.688577ms) 200 [[Prometheus/1.7.1] 10.240.255.5:54372]
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.423130    1772 kubelet.go:1797] SyncLoop (DELETE, "api"): "aurora-mobile-green-1188259835-v9lrd_default(79f859fc-
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.423706    1772 docker_manager.go:1559] Killing container "0c2db2806017d231e7f2df23d477531d08fdae727904df6935f6780
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.487155    1772 kubelet.go:1797] SyncLoop (DELETE, "api"): "auth-redis-3305003549-cdz0p_default(8e25d914-6070-11e7
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.487681    1772 docker_manager.go:1559] Killing container "83d5e69f08d1dc7502d4f4cbee29e566924c816a3482312234c9d5c
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.547828    1772 kubelet.go:1797] SyncLoop (DELETE, "api"): "auth-server-1912670786-4p2gx_default(55062242-608a-11e
J
(this goes on for quite a while)

It also DELETEs e.g. kube-dns and heapster, for no reason I can see.

I still have the logs for the machines, where should I look, what could I look for?

DonMartin76 · 2017-07-05T06:50:55Z

It may be related to #906, and that the kubelet thinks it has to evict pods for not responding. How could I see that? We increased the size of our NFS server to a DS3, and since then it hasn't occurred.

rushabhnagda11 · 2017-07-05T10:51:10Z

I can confirm this issue. It happens twice a day on a single machine, and my disk read graphs are through the roof. Still trying to figure what process is doing this.

bogdangrigg · 2017-07-05T13:24:23Z

@DonMartin76 We've had a similar issue yesterday.
For us it seemed to be related to an Azure networking issue (Intermittent connection failures in VPN Gateway)

@JackQuincy Is there a way to externally monitor the reasons why a node is marked as NotReady?

JackQuincy · 2017-07-05T18:02:32Z

@colemickens @brendandburns Is there a way like @bogdangrigg is requesting?

theobolo · 2017-10-19T22:33:55Z

To add a case to that problem, i'm using ACS with K8S 1.7.5 and since 9 P.M, all my nodes are running randomly down, kubelet is stopping posting node status one after another.

Surprisingly, it happend on my three different clusters at the same time, take a look at my grafana monitoring status on the 3 different clusters :

1 -

2 -

3 -

Is that linked to a possible Azure API connection problem ? Kubelet is stopping positing status because Azure API is unreachable for a while ?

Since the problem started at the same time on 3 different clusters it must be linked to something external to Kubernetes.

My clusters are not OOMed ou CPU overloaded :

See : kubernetes/kubernetes#43516

DonMartin76 · 2017-10-20T06:49:39Z

We are also facing similar issues since this morning/yesterday evening (GMT 20:30) on our production cluster. Fortunately, not all nodes go down at the same time (this time), but rather one after the other, so that the application is still stable.

Which logs would make sense to look into to find a possible cause for this behavior? kubectl get events does not render anything useful (just states the obvious).

Update: Checked journalctl -u kubelet, and there is not a single occurrence of NotReady in it. FWIW. In /var/log/azure/cluster-provision.log the last entry is from when we provisioned the cluster.

Update 2: I sifted through all available journalctl units, and couldn't find anything suspicious in any log which would correlate with the time of NotReady.

DonMartin76 · 2017-10-20T10:08:29Z

This behavior is continuing for the entire day right now; we will reprovision the cluster with acs-engine 0.8.0 this afternoon and see whether that helps. The cluster showing this behavior was created with acs-engine v0.5.0 for Kubernetes 1.7.2 three days ago.

theobolo · 2017-10-20T10:16:38Z

@DonMartin76 I really don't want to use the "reprovision" solution since i deployed all my clusters just 1 week ago and since i've a full Production and criticals applications running right now ... Let me know about your tests Don should be great. :D

BTW all my clusters are still flaky ... Production / Staging / etc ...

edit : I used ACS engine v0.7.0 with Kubernetes 1.7.5 to deploy my clusters

DonMartin76 · 2017-10-20T10:47:08Z

@theobolo Will let you know how it works out. We reprovision production every week with zero downtime deployments anyway, so it is not a huge issue to do it one time out of the ordinary for us.

Which region are you in? We are in North Europe.

theobolo · 2017-10-20T11:24:40Z

Oh really every week ! Did you automate all the migration process ? like migrate Data Disks provisionned on Azure, DNS rules, ingress ? Because i've a lot of critical applications Kafka, Mongo etc etc. I'm able to migrate with 0 downtime too but i'm taking 1/2 day to do it on my Three clusters.

Because the only point where i'm struggling is migrate my managed Disks i opened an issue to ask if it's possible to mount a specific Managed Disk on a pod or a Deployment : #1484

DonMartin76 · 2017-10-20T11:44:49Z

We halt some processing for the time of the provisioning, which does not affect the customer; and customer data is kept on Cosmos, which we can just use from both the "old" and "new" cluster.

We clone data disks to a backup resource group first, then clone them back into the new resource group, in short. Fully scripted. We don't use mounts of managed disks to Pods, but implement an NFS server which has them mounted to take the state out of the cluster completely (but we reprovision that thing as well).

Works for us, but YMMV.

Nodes are still flaky on the current production cluster (the 0.5.0 acs-engine and k8s 1.7.2 one) FWIW, and I have encountered some breaking changes going from 0.5.0 to 0.8.0 (Master nodes are tainted instead of cordoned, breaking our DaemonSets). Will get back with results as soon as the new cluster with 1.7.7 has taken over and has been running for a while.

DonMartin76 · 2017-10-20T15:33:49Z

Update: After an update to acs-engine 0.8.0, and Kubernetes 1.7.7, the problem still occurs.

2017-10-20 17:26:23 +0200 CEST   2017-10-20 16:02:53 +0200 CEST   10        k8s-agent-33434527-0                   Node                                                   Normal    NodeNotReady              controllermanager                 Node k8s-agent-33434527-0 status is now: NodeNotReady

Similar error messages cycle through for the other nodes as well, and there seem to be no apparent reasons. The nodes are available via ssh when remoting into them via the master as a jump host, and do not seem to lose the container workload either, except if the nodes aren't reachable for more than 5 minutes - which triggers an eviction by the master and they are rescheduled somewhere else.

I have no clue where to dig here.

More logs from kubectl get events --watch:

2017-10-20 17:34:14 +0200 CEST   2017-10-20 15:45:28 +0200 CEST   9         k8s-master-33434527-0   Node                Normal    NodeNotReady   controllermanager   Node k8s-master-33434527-0 status is now: NodeNotReady
2017-10-20 17:34:23 +0200 CEST   2017-10-20 16:08:30 +0200 CEST   10        k8s-monitor-33434527-1   Node                Normal    NodeNotReady   controllermanager   Node k8s-monitor-33434527-1 status is now: NodeNotReady
2017-10-20 17:34:39 +0200 CEST   2017-10-20 15:41:12 +0200 CEST   30        k8s-master-33434527-0   Node                Normal    NodeHasSufficientDisk   kubelet, k8s-master-33434527-0   Node k8s-master-33434527-0 status is now: NodeHasSufficientDisk
2017-10-20 17:34:39 +0200 CEST   2017-10-20 15:41:12 +0200 CEST   30        k8s-master-33434527-0   Node                Normal    NodeHasNoDiskPressure   kubelet, k8s-master-33434527-0   Node k8s-master-33434527-0 status is now: NodeHasNoDiskPressure
2017-10-20 17:34:39 +0200 CEST   2017-10-20 15:41:12 +0200 CEST   30        k8s-master-33434527-0   Node                Normal    NodeHasSufficientMemory   kubelet, k8s-master-33434527-0   Node k8s-master-33434527-0 status is now: NodeHasSufficientMemory
2017-10-20 17:34:39 +0200 CEST   2017-10-20 15:42:05 +0200 CEST   10        k8s-master-33434527-0   Node                Normal    NodeReady   kubelet, k8s-master-33434527-0   Node k8s-master-33434527-0 status is now: NodeReady
2017-10-20 17:34:47 +0200 CEST   2017-10-20 15:40:04 +0200 CEST   23        k8s-monitor-33434527-1   Node                Normal    NodeHasSufficientDisk   kubelet, k8s-monitor-33434527-1   Node k8s-monitor-33434527-1 status is now: NodeHasSufficientDisk
2017-10-20 17:34:47 +0200 CEST   2017-10-20 15:40:04 +0200 CEST   23        k8s-monitor-33434527-1   Node                Normal    NodeHasSufficientMemory   kubelet, k8s-monitor-33434527-1   Node k8s-monitor-33434527-1 status is now: NodeHasSufficientMemory
2017-10-20 17:34:47 +0200 CEST   2017-10-20 15:40:04 +0200 CEST   23        k8s-monitor-33434527-1   Node                Normal    NodeHasNoDiskPressure   kubelet, k8s-monitor-33434527-1   Node k8s-monitor-33434527-1 status is now: NodeHasNoDiskPressure
2017-10-20 17:34:47 +0200 CEST   2017-10-20 15:42:02 +0200 CEST   11        k8s-monitor-33434527-1   Node                Normal    NodeReady   kubelet, k8s-monitor-33434527-1   Node k8s-monitor-33434527-1 status is now: NodeReady

And you can see by the COUNT this has happened a lot of times since the cluster was created, which was just one to two hours ago - it goes to NodeNotReady and back to NodeReady again, repeatedly.

Update: The problem ceased Saturday morning (Oct 21st) at around 02:30 GMT, and hasn't occurred since. This strongly suggests it was a problem in some underlying Azure infrastructure, but it still is very worrying.

DonMartin76 · 2017-10-21T06:59:43Z

I think this may be the same as this issue: #220 - @theobolo Can you check your kube-controller-manager container logs (in the kube-system namespace) for FullDisruption messages? I have tons of those in my logs.

theobolo · 2017-10-21T10:24:09Z

Hey @DonMartin76 my clusters are back to a normal state since 5 A.M this morning, no more NotReady > Ready switches.

Look at the Grafana kubelet metrics :

PRODUCTION :

STAGING :

Should confirms that was an Azure problem ... I'll take a look at my kube-controller-manager logs to confirm your toughts.

BTW thanks for having described your migration process, i should take a look to add a NFS layer for my apps.

DonMartin76 · 2017-10-23T07:52:30Z

Same here, problems just ceased Saturday morning at around 0430 GMT. I opened a support ticket to try to find the root cause on this one; I will return with more information if they find something useful.

theobolo · 2017-10-23T08:05:59Z

@DonMartin76 Same here, opened an Azure issue to find what was wrong ... 👍

DonMartin76 · 2017-10-23T08:23:08Z

@theobolo Would you mind sharing the Azure region(s) your cluster(s) run on? Also North Europe, or somewhere else?

theobolo · 2017-10-23T08:25:38Z

@DonMartin76 Also North Europe...

theobolo · 2017-10-23T09:15:41Z

@DonMartin76 Not sure if you can see it : https://app.azure.com/h/RHDP-V40/377c3f but the bug is probably related to this...

DonMartin76 · 2017-10-23T09:36:54Z

@theobolo Thanks for the link, and yes, I can see it. Let's see whether my support issue renders the same result, they are currently investigating. It's interesting though that the same issue was reported on a non-acs-engine provisioned cluster as well (CoreOS with k8s 1.6.4), but also in the North Europe region. Happy they are really looking into it.

malachma · 2017-10-25T12:47:35Z

Hello @theobolo I currently work on the SR DonMartin76 had opended with us. Is it possible that you share the SR number with me so I can link both SRs together and can use your information as well to get more details about this beahviour.

Thanks in advance

philipedwards · 2017-12-20T22:33:31Z

I also back to the logs from our previous incident where we lost 3 out of 4 nodes.

k8s-agent-B3E1E1AA-1 was the only node to survive the incident, the other entry is from our master node

Dec 7 22:26:49 k8s-agent-B3E1E1AA-1 docker[23202]: E1207 22:26:49.770011 23237 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found
Dec 7 22:26:49 k8s-master-B3E1E1AA-0 docker[1806]: E1207 22:26:49.770756 1874 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found

brendandburns · 2017-12-20T22:50:51Z

Deleted on the Azure side. Can you send me the subscription/resource id of a VM that corresponds to k8s-master-B3E1E1AA-0 I'd like to cross-check with the ARM logs. Thanks --brendan

…

________________________________________ From: philipedwards <notifications@github.com> Sent: Wednesday, December 20, 2017 2:33 PM To: Azure/acs-engine Cc: Brendan Burns; Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) I also back to the logs from our previous incident where we lost 3 out of 4 nodes. k8s-agent-B3E1E1AA-1 was the only node to survive the incident, the other entry is from our master node Dec 7 22:26:49 k8s-agent-B3E1E1AA-1 docker[23202]: E1207 22:26:49.770011 23237 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found Dec 7 22:26:49 k8s-master-B3E1E1AA-0 docker[1806]: E1207 22:26:49.770756 1874 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-353202378&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=uca%2F7guqLlqnLXNOkB4EvpXCIwg%2FHueThPtxD88H7A8%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgmIjLEHOyyzVVsY-jM9k2ebZrgfiks5tCYs9gaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=XpqJV7%2Fy4beQixl%2F4dlhkWQYGP6wBxBBF2aZkKfiSWo%3D&reserved=0>.

brendandburns · 2017-12-20T22:59:33Z

Ah, I took a deeper look at the code, I think I see the bug...

…

________________________________________ From: Brendan Burns Sent: Wednesday, December 20, 2017 2:50 PM To: philipedwards; Azure/acs-engine; Azure/acs-engine Cc: Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) Deleted on the Azure side. Can you send me the subscription/resource id of a VM that corresponds to k8s-master-B3E1E1AA-0 I'd like to cross-check with the ARM logs. Thanks --brendan

________________________________________ From: philipedwards <notifications@github.com> Sent: Wednesday, December 20, 2017 2:33 PM To: Azure/acs-engine Cc: Brendan Burns; Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) I also back to the logs from our previous incident where we lost 3 out of 4 nodes. k8s-agent-B3E1E1AA-1 was the only node to survive the incident, the other entry is from our master node Dec 7 22:26:49 k8s-agent-B3E1E1AA-1 docker[23202]: E1207 22:26:49.770011 23237 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found Dec 7 22:26:49 k8s-master-B3E1E1AA-0 docker[1806]: E1207 22:26:49.770756 1874 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-353202378&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=uca%2F7guqLlqnLXNOkB4EvpXCIwg%2FHueThPtxD88H7A8%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgmIjLEHOyyzVVsY-jM9k2ebZrgfiks5tCYs9gaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=XpqJV7%2Fy4beQixl%2F4dlhkWQYGP6wBxBBF2aZkKfiSWo%3D&reserved=0>.

brendandburns · 2017-12-20T22:59:48Z

(it would still be useful to get the resource id...)

…

________________________________________ From: Brendan Burns Sent: Wednesday, December 20, 2017 2:59 PM To: philipedwards; Azure/acs-engine; Azure/acs-engine Cc: Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) Ah, I took a deeper look at the code, I think I see the bug...

________________________________________ From: Brendan Burns Sent: Wednesday, December 20, 2017 2:50 PM To: philipedwards; Azure/acs-engine; Azure/acs-engine Cc: Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) Deleted on the Azure side. Can you send me the subscription/resource id of a VM that corresponds to k8s-master-B3E1E1AA-0 I'd like to cross-check with the ARM logs. Thanks --brendan

________________________________________ From: philipedwards <notifications@github.com> Sent: Wednesday, December 20, 2017 2:33 PM To: Azure/acs-engine Cc: Brendan Burns; Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) I also back to the logs from our previous incident where we lost 3 out of 4 nodes. k8s-agent-B3E1E1AA-1 was the only node to survive the incident, the other entry is from our master node Dec 7 22:26:49 k8s-agent-B3E1E1AA-1 docker[23202]: E1207 22:26:49.770011 23237 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found Dec 7 22:26:49 k8s-master-B3E1E1AA-0 docker[1806]: E1207 22:26:49.770756 1874 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-353202378&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=uca%2F7guqLlqnLXNOkB4EvpXCIwg%2FHueThPtxD88H7A8%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgmIjLEHOyyzVVsY-jM9k2ebZrgfiks5tCYs9gaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7C044c9faf7c6a49dfecc108d547f9b4ef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494060167503772&sdata=XpqJV7%2Fy4beQixl%2F4dlhkWQYGP6wBxBBF2aZkKfiSWo%3D&reserved=0>.

philipedwards · 2017-12-20T23:18:14Z

@brendandburns email sent.

brendandburns · 2017-12-20T23:45:40Z

Bug fix is here: kubernetes/kubernetes#57484 Once that's merged we'll pick back to at least 1.8.x

…

--brendan

________________________________________ From: philipedwards <notifications@github.com> Sent: Wednesday, December 20, 2017 3:18 PM To: Azure/acs-engine Cc: Brendan Burns; Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) @brendandburns<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fbrendandburns&data=02%7C01%7Cbburns%40microsoft.com%7Cbd05c688ac224ff2ea0a08d547fff367%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494086981611458&sdata=wtbbwJUrNxSgZGYp6DrJT8YLiWqc7STMDzQtSgqUUps%3D&reserved=0> email sent. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-353211209&data=02%7C01%7Cbburns%40microsoft.com%7Cbd05c688ac224ff2ea0a08d547fff367%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494086981611458&sdata=UtakGAgxnM7tXgTGIcpBRcgPue5F8uKJPnp8u5SuSY4%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgufRM_YRi-Af_14nEsVJKntJD-ADks5tCZW3gaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7Cbd05c688ac224ff2ea0a08d547fff367%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494086981611458&sdata=M0%2F58prLZpEAmx%2BslDTvsBdsifKKnF%2FpEGw716bb59c%3D&reserved=0>.

philipedwards · 2017-12-21T00:40:49Z

@brendandburns - were you able to find anything in the logs relating to our last outage? Is it possible to go back to the previous incident and confirm it was the same issue?

I would like to add this detail to our incident reports.

DonMartin76 · 2017-12-21T15:46:06Z

@brendanburns So, in short, this may be due to an unresponsive Azure API, or API calls to Azure which do not reply the way which is expected? Is it likely that this happens a lot so that it can cause these kinds of outages?

We had periods of 24h+ where this happened on and off. Now this hasn't happened for a while, and the periods are usually shorter than an hour, and the scheduler usually doesn't start to reschedule things just yet.

brendandburns · 2017-12-21T16:19:22Z

I don't know exactly what errors would cause the sdk to not return a autorest.DetailedError, but at the very least I would suspect that something in the transport layer (eg http timeout) would return a different error. I'll follow up with the sdk team to better understand the conditions that would trigger this and follow up on this thread. But nonetheless, we're pushing to get this cherry picked ASAP so that people can update to a fixed binary.

…

--brendan

________________________________ From: Martin Danielsson <notifications@github.com> Sent: Thursday, December 21, 2017 7:46:07 AM To: Azure/acs-engine Cc: Brendan Burns; Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) @brendanburns<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fbrendanburns&data=02%7C01%7Cbburns%40microsoft.com%7C4a30e74b74b64c44cc9b08d54889f4c3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494679721054833&sdata=wXxnHbMqkjayIq7fBCS7kBoG0obnu8uav%2FhMwZKsRUw%3D&reserved=0> So, in short, this may be due to an unresponsive Azure API, or API calls to Azure which do not reply the way which is expected? Is it likely that this happens a lot so that it can cause these kinds of outages? We had periods of 24h+ where this happened on and off. Now this hasn't happened for a while, and the periods are usually shorter than an hour, and the scheduler usually doesn't start to reschedule things just yet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-353383849&data=02%7C01%7Cbburns%40microsoft.com%7C4a30e74b74b64c44cc9b08d54889f4c3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494679721054833&sdata=PyJgArsb94sJLVDvB0vbIuETLn%2BDAZPBAVo4YcA2do8%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgl4NlnEcG5H-ilqxneYmnLF9vK_Mks5tCn0_gaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7C4a30e74b74b64c44cc9b08d54889f4c3%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636494679721054833&sdata=h%2Bcnr4dWW7kiNnvN7d1aSTCua8lquKo1aUxp2ud7Apc%3D&reserved=0>.

DonMartin76 · 2017-12-21T19:24:06Z

Happy to hear this. We have been chasing for the cause of these things for over a month now (MS support in Munich and I), and until now we haven't been able to really pin it down. We had it down to something between Kubernetes and the Azure API, but couldn't reproduce, and thus it's difficult to get a real diagnosis on it. REALLY appreciate you pitching in on this, Brendan.

colemickens · 2017-12-21T19:26:12Z

I'd like to understand this better, if possible. The bugfix makes sense (sorry, that'd be my bug), but I don't yet understand how it manifests itself as "Node NotReady" (though I'd agree it seems too coincidental to not be related).

It looks like the errors above are coming from kubelet failing to get its own node status to self report IPs. I would have assumed that only KCM would actually handle "instance not found" and consider it deletion of a node.

If kubelet were handling instance not found and treating it as a deletion, then I would've expected to see the Node object disappear entirely, and would have expected to see errors for kubelet trying to PUT it's node status for a node object that no longer exists.

Additionally, if this is enough to cause things to get "stuck", I'm not sure I understand how things are recovering without the kubelet being restarted such that it tries to do an initial POST to create its own node object.

brendanburns · 2017-12-21T19:31:12Z

What I believe is happening is this: In order to post status, kubelet needs it's IP address Lookup of the node fails due to some transient reason, kubelet sees "not found" and the entire node status update to master is aborted. Do this enough times in a row, and the node transitions into NotReady state. If eventually, the transient error fixes itself, and the node starts posting status again, node goes back to Ready state. If on the other hand, the issue goes on long enough, the NodeController on the master is going to delete the Node, which will require a Kubelet restart (though that's a bug imho) to recover from. Switching to IMDS will also fix this in general, since kubelet won't need an API call to get it's IP address.

…

On Thu, Dec 21, 2017 at 11:26 AM Cole Mickens ***@***.***> wrote: I'd like to understand this better, if possible. The bugfix makes sense (sorry, that'd be my bug), but I don't yet understand how it manifests itself as "Node NotReady" (though I'd agree it seems too coincidental to not be related). It looks like the errors above are coming from kubelet failing to get its own node status to self report IPs. I would have assumed that only KCM would actually handle "instance not found" and consider it deletion of a node. If kubelet were handling instance not found and treating it as a deletion, then I would've expected to see the Node object disappear entirely, and would have expected to see errors for kubelet trying to PUT it's node status for a node object that no longer exists. Additionally, if this is enough to cause things to get "stuck", I'm not sure I understand how things are recovering without the kubelet being restarted such that it tries to do an initial POST to create its own node object. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#863 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOYXgkIXAXHx5P0X2ZD_69IO2TLsmMZks5tCrDZgaJpZM4OETr0> .

snebel29 · 2018-01-02T10:16:39Z

I think we're suffering from this issue as well, every 2 weeks approx causing production cluster disruption. recovering on its own after a while and together some networking interface ready/not ready errors in the logs, I could provide further details if required...

Thanks

malachma · 2018-01-02T12:07:56Z

@snebel The only workaround I'm aware off to get a more reliable IP detection , till the merge is available in the k8s version we ship with ACS or AKS, is to enable the MetadataService.
To do this alter the /etc/kubernetes/azure.json
Change the flag from : "useInstanceMetadata": false
to : "useInstanceMetadata": true
And restart the kublet instance. Perform this modification on all of your agents.

snebel29 · 2018-01-02T12:55:56Z

Hi @malachma,
First all thanks for the hint, it's really appreciated!

I don't see that flag set into the azure.json file "It just doesn't exist", I guess the default is true? since this command

curl -H Metadata:true "http://169.254.169.254/metadata/instance?api-version=2017-04-02"

Seems to return back good instance metadata results, or I'm missing something?

malachma · 2018-01-02T13:53:20Z

@snebel29 That is interesting that you don't see the flag as mentioned. Adding this flag to this config-file and set it to true should help in this case. By default the metadata services is not used if take 1.8 as reference. What is your k8s version by the way?

snebel29 · 2018-01-02T14:41:00Z

Hi,
My k8s version is

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.1", GitCommit:"1dc5c66f5dd61da08412a74221ecc79208c2165b", GitTreeState:"clean", BuildDate:"2017-07-14T02:00:46Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

and the acs-engine version used to create the cluster was

commit 24a478e20260850ef610b4297b077b3eb21638e9
Author: Adam Greene <adam.greene@gmail.com>
Date:   Wed Jul 5 16:48:46 2017 -0700

    update kubernetes doc (#860)

Thanks

brendanburns · 2018-01-03T19:38:40Z

I don't think 1.6.6 supports IMDS... Any chance you can upgrade?

…

--brendan

On Tue, Jan 2, 2018 at 6:41 AM Sven Nebel ***@***.***> wrote: Hi, My k8s version is $ kubectl version Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.1", GitCommit:"1dc5c66f5dd61da08412a74221ecc79208c2165b", GitTreeState:"clean", BuildDate:"2017-07-14T02:00:46Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"} and the acs-engine version used to create the cluster was commit 24a478e Author: Adam Greene ***@***.***> Date: Wed Jul 5 16:48:46 2017 -0700 update kubernetes doc (#860) Thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#863 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOYXlcLI6EF3E5RUg11B-qjLHA9YX_oks5tGj__gaJpZM4OETr0> .

snebel29 · 2018-01-04T08:47:12Z

Hi @brendanburns ,
While not funny if that will solve/workaround the issue I'm willing to... is there a reasonable level of certainty with that workaround? Is there any further information on the fix?

Thanks in advance

brendandburns · 2018-01-04T16:19:01Z

IMDS will definitely make things better. The correct fix is being cherry picked into the 1.7, 1.8 and 1.9 release branches. So I think you have two choices: Upgrade now to get IMDS support. Wait for the cherry pick release with the correct fix (1-2 weeks) and upgrade then to the complete fix. Sadly 1.6.x is nearly a year old and outside of the window for patching at this point.

…

--brendan

________________________________ From: Sven Nebel <notifications@github.com> Sent: Thursday, January 4, 2018 12:47:14 AM To: Azure/acs-engine Cc: Brendan Burns; Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) Hi @brendanburns<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fbrendanburns&data=02%7C01%7Cbburns%40microsoft.com%7C05bd4c2c9ef84fb0e12a08d5534fc0b6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636506524357017407&sdata=PSjZr0i2NqfthF1ecJ3StSxU%2BGtVrezosstK1M1tMoc%3D&reserved=0> , While not funny if that will solve/workaround the issue I'm willing to... is there a reasonable level of certainty with that workaround? Is there any further information on the fix? Thanks in advance — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-355228171&data=02%7C01%7Cbburns%40microsoft.com%7C05bd4c2c9ef84fb0e12a08d5534fc0b6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636506524357017407&sdata=kRQP8DahGpWEqeqiZYyLh9DgGI1lISFTovRC1r1gMEE%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgugk1KCmrsLj9L74L3cIJuee256Pks5tHJARgaJpZM4OETr0&data=02%7C01%7Cbburns%40microsoft.com%7C05bd4c2c9ef84fb0e12a08d5534fc0b6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636506524357017407&sdata=7o5pvfAKZhf9tP%2BhnUs0HLgXMnZUqKFRfN4OvGLRFFo%3D&reserved=0>.

snebel29 · 2018-01-04T16:32:44Z

Hi @brendanburns,
I really appreciate that information, that will allow us to plan accordingly to our internal roadmap of changes, I think this will probably be worth for us to move to the last stable release and might try the workaround in the meanwhile if that upgrade gets internally delayed.

Thanks!

jamoham · 2018-01-30T02:36:16Z

I hit this same problem on 1.7.9 version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.9", GitCommit:"19fe91923d584c30bd6db5c5a21e9f0d5f742de8", GitTreeState:"clean", BuildDate:"2017-10-19T16:55:06Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}


$ kubectl get nodes
NAME                        STATUS     AGE       VERSION
k8s-agentpool1-14934812-0   Ready      6d        v1.7.9
k8s-agentpool1-14934812-1   NotReady   6d        v1.7.9
k8s-agentpool1-14934812-2   Ready      6d        v1.7.9
k8s-master-14934812-0       Ready      6d        v1.7.9

When will the fix be cherry picked into 1.7.x versions?

nealabh · 2018-02-12T07:13:28Z

Still facing the same issue with 1.7.7, can you suggest how to upgrade on Azure Container Service.

brendandburns · 2018-02-12T16:20:28Z

ACS engine has instructions on how to upgrade here: https://github.com/Azure/acs-engine/blob/master/examples/k8s-upgrade/README.md

…

--brendan

________________________________ From: Nealabh <notifications@github.com> Sent: Sunday, February 11, 2018 11:13:35 PM To: Azure/acs-engine Cc: Brendan Burns; Mention Subject: Re: [Azure/acs-engine] Reason why node is marked as NotReady (#863) Still facing the same issue with 1.7.7, can you suggest how to upgrade on Azure Container Service. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2FAzure%2Facs-engine%2Fissues%2F863%23issuecomment-364843293&data=04%7C01%7Cbburns%40microsoft.com%7C07c6f5ed38134cfcf99908d571e82254%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636540164180800860%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=oAqB0xIq17gLCpjtOFg8tCuGXHP98WATn2%2F0AeaKABk%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.luolix.top%2Fnotifications%2Funsubscribe-auth%2FAFfDgmP0M0_qqbGX1jua84lGHXNnTxS0ks5tT-SfgaJpZM4OETr0&data=04%7C01%7Cbburns%40microsoft.com%7C07c6f5ed38134cfcf99908d571e82254%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636540164180800860%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=A3pCPwDC1EB3KRAwlQJF56yVz2N%2BZ31np2izUtBUzUg%3D&reserved=0>.

mboersma · 2019-01-10T18:45:07Z

I'm closing this old issue since acs-engine is deprecated in favor of aks-engine.

JackQuincy added the needs more information label Jun 26, 2017

JackQuincy added kind/bug orchestrator/k8s priority/P1 and removed needs more information labels Jul 5, 2017

theobolo mentioned this issue Oct 23, 2017

Kubelet stops reporting node status kubernetes/kubernetes#43516

Closed

jamoham mentioned this issue Jan 30, 2018

K8s agents entering a NotReady State #2153

Closed

cehoffman mentioned this issue Feb 14, 2018

Docker unregister_netdevice hang, reopen #254 coreos/bugs#2359

Closed

hmarcelodn mentioned this issue May 21, 2018

Nodes are getting Lost everytime with High Disk IO usage. Cant ssh is available either. #3025

Closed

cehoffman mentioned this issue May 26, 2018

Azure cloud provider configuration and upstream direction of external cloud configuration coreos/tectonic-installer#3246

Open

mboersma closed this as completed Jan 10, 2019

Reason why node is marked as NotReady #863

Reason why node is marked as NotReady #863

Comments

rushabhnagda11 commented Jun 24, 2017

Is this a request for help?: Yes

JackQuincy commented Jun 26, 2017

rushabhnagda11 commented Jun 28, 2017 • edited Loading

DonMartin76 commented Jul 4, 2017

DonMartin76 commented Jul 5, 2017

rushabhnagda11 commented Jul 5, 2017

bogdangrigg commented Jul 5, 2017

JackQuincy commented Jul 5, 2017

theobolo commented Oct 19, 2017

DonMartin76 commented Oct 20, 2017 • edited Loading

DonMartin76 commented Oct 20, 2017

theobolo commented Oct 20, 2017 • edited Loading

DonMartin76 commented Oct 20, 2017

theobolo commented Oct 20, 2017

DonMartin76 commented Oct 20, 2017 • edited Loading

DonMartin76 commented Oct 20, 2017 • edited Loading

DonMartin76 commented Oct 21, 2017

theobolo commented Oct 21, 2017

DonMartin76 commented Oct 23, 2017 • edited Loading

theobolo commented Oct 23, 2017

DonMartin76 commented Oct 23, 2017

theobolo commented Oct 23, 2017

theobolo commented Oct 23, 2017

DonMartin76 commented Oct 23, 2017

malachma commented Oct 25, 2017

philipedwards commented Dec 20, 2017

brendandburns commented Dec 20, 2017 via email

brendandburns commented Dec 20, 2017 via email

brendandburns commented Dec 20, 2017 via email

philipedwards commented Dec 20, 2017

brendandburns commented Dec 20, 2017 via email

philipedwards commented Dec 21, 2017 • edited Loading

DonMartin76 commented Dec 21, 2017

brendandburns commented Dec 21, 2017 via email

DonMartin76 commented Dec 21, 2017

colemickens commented Dec 21, 2017

brendanburns commented Dec 21, 2017 via email

snebel29 commented Jan 2, 2018 • edited Loading

malachma commented Jan 2, 2018

snebel29 commented Jan 2, 2018

malachma commented Jan 2, 2018

snebel29 commented Jan 2, 2018

brendanburns commented Jan 3, 2018 via email

snebel29 commented Jan 4, 2018

brendandburns commented Jan 4, 2018 via email

snebel29 commented Jan 4, 2018

jamoham commented Jan 30, 2018

nealabh commented Feb 12, 2018

brendandburns commented Feb 12, 2018 via email

mboersma commented Jan 10, 2019

Is this a request for help?:
Yes

rushabhnagda11 commented Jun 28, 2017 •

edited

Loading

DonMartin76 commented Oct 20, 2017 •

edited

Loading

theobolo commented Oct 20, 2017 •

edited

Loading

DonMartin76 commented Oct 20, 2017 •

edited

Loading

DonMartin76 commented Oct 20, 2017 •

edited

Loading

DonMartin76 commented Oct 23, 2017 •

edited

Loading

philipedwards commented Dec 21, 2017 •

edited

Loading

snebel29 commented Jan 2, 2018 •

edited

Loading