Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Reason why node is marked as NotReady #863

Closed
rushabhnagda11 opened this issue Jun 24, 2017 · 62 comments
Closed

Reason why node is marked as NotReady #863

rushabhnagda11 opened this issue Jun 24, 2017 · 62 comments

Comments

@rushabhnagda11
Copy link

Is this a request for help?:
Yes

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:
Node is marked as NotReady and a cascading effect takes places on my k8s cluster

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Not sure

Anything else we need to know:
In /var/log/kern.log of my k8s master server, I can see that one node has been marked as NotReady.
Is there a reason logged somewhere as to why this was done? This is a recurring issue every few days.

@JackQuincy
Copy link
Contributor

Can you look up these log on the agents that are marked not ready and see if there are any details in there that might be helpful?
cat /var/log/azure/cluster-provision.log
sudo journalctl -u kubelet

@rushabhnagda11
Copy link
Author

rushabhnagda11 commented Jun 28, 2017

This just happened. A node is almost at random marked as NotReady by my master. However, I was also unable to ssh into my node when it was marked as notready. Could a very high CPU process be running on the node at this time? I'm using only a 2 core machine.

I've had a look at the logs on the node around this time, and there was nothing useful there. Can you tell me how to proceed from here

Jun 28 12:18:37 k8s-master-18420427-0 dockerd[22509]: I0628 12:18:37.775805 1 controller_utils.go:285] Recording status change NodeNotReady event message for node k8s-agentpool1-18420427-2

Jun 28 12:18:37 k8s-master-18420427-0 dockerd[22509]: I0628 12:18:37.775935 1 controller_utils.go:203] Update ready status of pods on node [k8s-agentpool1-18420427-2]

Jun 28 12:18:37 k8s-master-18420427-0 dockerd[22509]: I0628 12:18:37.776570 1 event.go:217] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"k8s-agentpool1-18420427-2", UID:"6b8de3e3-04d8-11e7-9280-000d3af03aaf", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node k8s-agentpool1-18420427-2 status is now: NodeNotReady

@DonMartin76
Copy link
Contributor

I am seeing exactly this behavior as well, and this morning we had it on a small production cluster, consisting of 3 agent DS12_v2 machines, where two machines were taken out with NotReady status almost simultaneously, taking down the third in the wake of that, due to too many containers being run on it simlutaneously (THAT was our fault, the other, we don't think so).

Looking at the journalctl -u kubelet, I see that at one point, the kubelet starts to take down containers, but I can't understand why:

Jul 04 07:37:40 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:40.506695    1772 server.go:740] GET /metrics: (83.688577ms) 200 [[Prometheus/1.7.1] 10.240.255.5:54372]
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.423130    1772 kubelet.go:1797] SyncLoop (DELETE, "api"): "aurora-mobile-green-1188259835-v9lrd_default(79f859fc-
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.423706    1772 docker_manager.go:1559] Killing container "0c2db2806017d231e7f2df23d477531d08fdae727904df6935f6780
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.487155    1772 kubelet.go:1797] SyncLoop (DELETE, "api"): "auth-redis-3305003549-cdz0p_default(8e25d914-6070-11e7
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.487681    1772 docker_manager.go:1559] Killing container "83d5e69f08d1dc7502d4f4cbee29e566924c816a3482312234c9d5c
Jul 04 07:37:42 k8s-agentpool1-22420894-1 docker[1528]: I0704 07:37:42.547828    1772 kubelet.go:1797] SyncLoop (DELETE, "api"): "auth-server-1912670786-4p2gx_default(55062242-608a-11e
J
(this goes on for quite a while)

It also DELETEs e.g. kube-dns and heapster, for no reason I can see.

I still have the logs for the machines, where should I look, what could I look for?

@DonMartin76
Copy link
Contributor

It may be related to #906, and that the kubelet thinks it has to evict pods for not responding. How could I see that? We increased the size of our NFS server to a DS3, and since then it hasn't occurred.

@rushabhnagda11
Copy link
Author

I can confirm this issue. It happens twice a day on a single machine, and my disk read graphs are through the roof. Still trying to figure what process is doing this.

@bogdangrigg
Copy link

@DonMartin76 We've had a similar issue yesterday.
For us it seemed to be related to an Azure networking issue (Intermittent connection failures in VPN Gateway)

@JackQuincy Is there a way to externally monitor the reasons why a node is marked as NotReady?

@JackQuincy
Copy link
Contributor

@colemickens @brendandburns Is there a way like @bogdangrigg is requesting?

@theobolo
Copy link

To add a case to that problem, i'm using ACS with K8S 1.7.5 and since 9 P.M, all my nodes are running randomly down, kubelet is stopping posting node status one after another.

Surprisingly, it happend on my three different clusters at the same time, take a look at my grafana monitoring status on the 3 different clusters :

1 -
image

2 -
image

3 -
image

Is that linked to a possible Azure API connection problem ? Kubelet is stopping positing status because Azure API is unreachable for a while ?

Since the problem started at the same time on 3 different clusters it must be linked to something external to Kubernetes.

My clusters are not OOMed ou CPU overloaded :

image

See : kubernetes/kubernetes#43516

@DonMartin76
Copy link
Contributor

DonMartin76 commented Oct 20, 2017

We are also facing similar issues since this morning/yesterday evening (GMT 20:30) on our production cluster. Fortunately, not all nodes go down at the same time (this time), but rather one after the other, so that the application is still stable.

bildschirmfoto 2017-10-20 um 08 47 26

Which logs would make sense to look into to find a possible cause for this behavior? kubectl get events does not render anything useful (just states the obvious).

Update: Checked journalctl -u kubelet, and there is not a single occurrence of NotReady in it. FWIW. In /var/log/azure/cluster-provision.log the last entry is from when we provisioned the cluster.

Update 2: I sifted through all available journalctl units, and couldn't find anything suspicious in any log which would correlate with the time of NotReady.

@DonMartin76
Copy link
Contributor

This behavior is continuing for the entire day right now; we will reprovision the cluster with acs-engine 0.8.0 this afternoon and see whether that helps. The cluster showing this behavior was created with acs-engine v0.5.0 for Kubernetes 1.7.2 three days ago.

@theobolo
Copy link

theobolo commented Oct 20, 2017

@DonMartin76 I really don't want to use the "reprovision" solution since i deployed all my clusters just 1 week ago and since i've a full Production and criticals applications running right now ... Let me know about your tests Don should be great. :D

BTW all my clusters are still flaky ... Production / Staging / etc ...

edit : I used ACS engine v0.7.0 with Kubernetes 1.7.5 to deploy my clusters

@DonMartin76
Copy link
Contributor

@theobolo Will let you know how it works out. We reprovision production every week with zero downtime deployments anyway, so it is not a huge issue to do it one time out of the ordinary for us.

Which region are you in? We are in North Europe.

@theobolo
Copy link

Oh really every week ! Did you automate all the migration process ? like migrate Data Disks provisionned on Azure, DNS rules, ingress ? Because i've a lot of critical applications Kafka, Mongo etc etc. I'm able to migrate with 0 downtime too but i'm taking 1/2 day to do it on my Three clusters.

Because the only point where i'm struggling is migrate my managed Disks i opened an issue to ask if it's possible to mount a specific Managed Disk on a pod or a Deployment : #1484

@DonMartin76
Copy link
Contributor

DonMartin76 commented Oct 20, 2017

We halt some processing for the time of the provisioning, which does not affect the customer; and customer data is kept on Cosmos, which we can just use from both the "old" and "new" cluster.

We clone data disks to a backup resource group first, then clone them back into the new resource group, in short. Fully scripted. We don't use mounts of managed disks to Pods, but implement an NFS server which has them mounted to take the state out of the cluster completely (but we reprovision that thing as well).

Works for us, but YMMV.

Nodes are still flaky on the current production cluster (the 0.5.0 acs-engine and k8s 1.7.2 one) FWIW, and I have encountered some breaking changes going from 0.5.0 to 0.8.0 (Master nodes are tainted instead of cordoned, breaking our DaemonSets). Will get back with results as soon as the new cluster with 1.7.7 has taken over and has been running for a while.

@DonMartin76
Copy link
Contributor

DonMartin76 commented Oct 20, 2017

Update: After an update to acs-engine 0.8.0, and Kubernetes 1.7.7, the problem still occurs.

2017-10-20 17:26:23 +0200 CEST   2017-10-20 16:02:53 +0200 CEST   10        k8s-agent-33434527-0                   Node                                                   Normal    NodeNotReady              controllermanager                 Node k8s-agent-33434527-0 status is now: NodeNotReady

Similar error messages cycle through for the other nodes as well, and there seem to be no apparent reasons. The nodes are available via ssh when remoting into them via the master as a jump host, and do not seem to lose the container workload either, except if the nodes aren't reachable for more than 5 minutes - which triggers an eviction by the master and they are rescheduled somewhere else.

I have no clue where to dig here.

More logs from kubectl get events --watch:

2017-10-20 17:34:14 +0200 CEST   2017-10-20 15:45:28 +0200 CEST   9         k8s-master-33434527-0   Node                Normal    NodeNotReady   controllermanager   Node k8s-master-33434527-0 status is now: NodeNotReady
2017-10-20 17:34:23 +0200 CEST   2017-10-20 16:08:30 +0200 CEST   10        k8s-monitor-33434527-1   Node                Normal    NodeNotReady   controllermanager   Node k8s-monitor-33434527-1 status is now: NodeNotReady
2017-10-20 17:34:39 +0200 CEST   2017-10-20 15:41:12 +0200 CEST   30        k8s-master-33434527-0   Node                Normal    NodeHasSufficientDisk   kubelet, k8s-master-33434527-0   Node k8s-master-33434527-0 status is now: NodeHasSufficientDisk
2017-10-20 17:34:39 +0200 CEST   2017-10-20 15:41:12 +0200 CEST   30        k8s-master-33434527-0   Node                Normal    NodeHasNoDiskPressure   kubelet, k8s-master-33434527-0   Node k8s-master-33434527-0 status is now: NodeHasNoDiskPressure
2017-10-20 17:34:39 +0200 CEST   2017-10-20 15:41:12 +0200 CEST   30        k8s-master-33434527-0   Node                Normal    NodeHasSufficientMemory   kubelet, k8s-master-33434527-0   Node k8s-master-33434527-0 status is now: NodeHasSufficientMemory
2017-10-20 17:34:39 +0200 CEST   2017-10-20 15:42:05 +0200 CEST   10        k8s-master-33434527-0   Node                Normal    NodeReady   kubelet, k8s-master-33434527-0   Node k8s-master-33434527-0 status is now: NodeReady
2017-10-20 17:34:47 +0200 CEST   2017-10-20 15:40:04 +0200 CEST   23        k8s-monitor-33434527-1   Node                Normal    NodeHasSufficientDisk   kubelet, k8s-monitor-33434527-1   Node k8s-monitor-33434527-1 status is now: NodeHasSufficientDisk
2017-10-20 17:34:47 +0200 CEST   2017-10-20 15:40:04 +0200 CEST   23        k8s-monitor-33434527-1   Node                Normal    NodeHasSufficientMemory   kubelet, k8s-monitor-33434527-1   Node k8s-monitor-33434527-1 status is now: NodeHasSufficientMemory
2017-10-20 17:34:47 +0200 CEST   2017-10-20 15:40:04 +0200 CEST   23        k8s-monitor-33434527-1   Node                Normal    NodeHasNoDiskPressure   kubelet, k8s-monitor-33434527-1   Node k8s-monitor-33434527-1 status is now: NodeHasNoDiskPressure
2017-10-20 17:34:47 +0200 CEST   2017-10-20 15:42:02 +0200 CEST   11        k8s-monitor-33434527-1   Node                Normal    NodeReady   kubelet, k8s-monitor-33434527-1   Node k8s-monitor-33434527-1 status is now: NodeReady

And you can see by the COUNT this has happened a lot of times since the cluster was created, which was just one to two hours ago - it goes to NodeNotReady and back to NodeReady again, repeatedly.

Update: The problem ceased Saturday morning (Oct 21st) at around 02:30 GMT, and hasn't occurred since. This strongly suggests it was a problem in some underlying Azure infrastructure, but it still is very worrying.

@DonMartin76
Copy link
Contributor

I think this may be the same as this issue: #220 - @theobolo Can you check your kube-controller-manager container logs (in the kube-system namespace) for FullDisruption messages? I have tons of those in my logs.

@theobolo
Copy link

Hey @DonMartin76 my clusters are back to a normal state since 5 A.M this morning, no more NotReady > Ready switches.

Look at the Grafana kubelet metrics :

PRODUCTION :
image

STAGING :
image

Should confirms that was an Azure problem ... I'll take a look at my kube-controller-manager logs to confirm your toughts.

BTW thanks for having described your migration process, i should take a look to add a NFS layer for my apps.

@DonMartin76
Copy link
Contributor

DonMartin76 commented Oct 23, 2017

Same here, problems just ceased Saturday morning at around 0430 GMT. I opened a support ticket to try to find the root cause on this one; I will return with more information if they find something useful.

@theobolo
Copy link

@DonMartin76 Same here, opened an Azure issue to find what was wrong ... 👍

@DonMartin76
Copy link
Contributor

@theobolo Would you mind sharing the Azure region(s) your cluster(s) run on? Also North Europe, or somewhere else?

@theobolo
Copy link

@DonMartin76 Also North Europe...

@theobolo
Copy link

@DonMartin76 Not sure if you can see it : https://app.azure.com/h/RHDP-V40/377c3f but the bug is probably related to this...

@DonMartin76
Copy link
Contributor

@theobolo Thanks for the link, and yes, I can see it. Let's see whether my support issue renders the same result, they are currently investigating. It's interesting though that the same issue was reported on a non-acs-engine provisioned cluster as well (CoreOS with k8s 1.6.4), but also in the North Europe region. Happy they are really looking into it.

@malachma
Copy link
Member

Hello @theobolo I currently work on the SR DonMartin76 had opended with us. Is it possible that you share the SR number with me so I can link both SRs together and can use your information as well to get more details about this beahviour.

Thanks in advance

@philipedwards
Copy link

I also back to the logs from our previous incident where we lost 3 out of 4 nodes.

k8s-agent-B3E1E1AA-1 was the only node to survive the incident, the other entry is from our master node

Dec 7 22:26:49 k8s-agent-B3E1E1AA-1 docker[23202]: E1207 22:26:49.770011 23237 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found
Dec 7 22:26:49 k8s-master-B3E1E1AA-0 docker[1806]: E1207 22:26:49.770756 1874 kubelet_node_status.go:302] Error updating node status, will retry: failed to get node address from cloud provider: instance not found

@brendandburns
Copy link
Member

brendandburns commented Dec 20, 2017 via email

@brendandburns
Copy link
Member

brendandburns commented Dec 20, 2017 via email

@brendandburns
Copy link
Member

brendandburns commented Dec 20, 2017 via email

@philipedwards
Copy link

@brendandburns email sent.

@brendandburns
Copy link
Member

brendandburns commented Dec 20, 2017 via email

@philipedwards
Copy link

philipedwards commented Dec 21, 2017

@brendandburns - were you able to find anything in the logs relating to our last outage? Is it possible to go back to the previous incident and confirm it was the same issue?

I would like to add this detail to our incident reports.

@DonMartin76
Copy link
Contributor

@brendanburns So, in short, this may be due to an unresponsive Azure API, or API calls to Azure which do not reply the way which is expected? Is it likely that this happens a lot so that it can cause these kinds of outages?

We had periods of 24h+ where this happened on and off. Now this hasn't happened for a while, and the periods are usually shorter than an hour, and the scheduler usually doesn't start to reschedule things just yet.

@brendandburns
Copy link
Member

brendandburns commented Dec 21, 2017 via email

@DonMartin76
Copy link
Contributor

Happy to hear this. We have been chasing for the cause of these things for over a month now (MS support in Munich and I), and until now we haven't been able to really pin it down. We had it down to something between Kubernetes and the Azure API, but couldn't reproduce, and thus it's difficult to get a real diagnosis on it. REALLY appreciate you pitching in on this, Brendan.

@colemickens
Copy link
Contributor

I'd like to understand this better, if possible. The bugfix makes sense (sorry, that'd be my bug), but I don't yet understand how it manifests itself as "Node NotReady" (though I'd agree it seems too coincidental to not be related).

It looks like the errors above are coming from kubelet failing to get its own node status to self report IPs. I would have assumed that only KCM would actually handle "instance not found" and consider it deletion of a node.

If kubelet were handling instance not found and treating it as a deletion, then I would've expected to see the Node object disappear entirely, and would have expected to see errors for kubelet trying to PUT it's node status for a node object that no longer exists.

Additionally, if this is enough to cause things to get "stuck", I'm not sure I understand how things are recovering without the kubelet being restarted such that it tries to do an initial POST to create its own node object.

@brendanburns
Copy link
Contributor

brendanburns commented Dec 21, 2017 via email

@snebel29
Copy link

snebel29 commented Jan 2, 2018

I think we're suffering from this issue as well, every 2 weeks approx causing production cluster disruption. recovering on its own after a while and together some networking interface ready/not ready errors in the logs, I could provide further details if required...

Thanks

@malachma
Copy link
Member

malachma commented Jan 2, 2018

@snebel The only workaround I'm aware off to get a more reliable IP detection , till the merge is available in the k8s version we ship with ACS or AKS, is to enable the MetadataService.
To do this alter the /etc/kubernetes/azure.json
Change the flag from : "useInstanceMetadata": false
to : "useInstanceMetadata": true
And restart the kublet instance. Perform this modification on all of your agents.

@snebel29
Copy link

snebel29 commented Jan 2, 2018

Hi @malachma,
First all thanks for the hint, it's really appreciated!

I don't see that flag set into the azure.json file "It just doesn't exist", I guess the default is true? since this command

curl -H Metadata:true "http://169.254.169.254/metadata/instance?api-version=2017-04-02"

Seems to return back good instance metadata results, or I'm missing something?

@malachma
Copy link
Member

malachma commented Jan 2, 2018

@snebel29 That is interesting that you don't see the flag as mentioned. Adding this flag to this config-file and set it to true should help in this case. By default the metadata services is not used if take 1.8 as reference. What is your k8s version by the way?

@snebel29
Copy link

snebel29 commented Jan 2, 2018

Hi,
My k8s version is

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.1", GitCommit:"1dc5c66f5dd61da08412a74221ecc79208c2165b", GitTreeState:"clean", BuildDate:"2017-07-14T02:00:46Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

and the acs-engine version used to create the cluster was

commit 24a478e20260850ef610b4297b077b3eb21638e9
Author: Adam Greene <adam.greene@gmail.com>
Date:   Wed Jul 5 16:48:46 2017 -0700

    update kubernetes doc (#860)

Thanks

@brendanburns
Copy link
Contributor

brendanburns commented Jan 3, 2018 via email

@snebel29
Copy link

snebel29 commented Jan 4, 2018

Hi @brendanburns ,
While not funny if that will solve/workaround the issue I'm willing to... is there a reasonable level of certainty with that workaround? Is there any further information on the fix?

Thanks in advance

@brendandburns
Copy link
Member

brendandburns commented Jan 4, 2018 via email

@snebel29
Copy link

snebel29 commented Jan 4, 2018

Hi @brendanburns,
I really appreciate that information, that will allow us to plan accordingly to our internal roadmap of changes, I think this will probably be worth for us to move to the last stable release and might try the workaround in the meanwhile if that upgrade gets internally delayed.

Thanks!

@jamoham
Copy link

jamoham commented Jan 30, 2018

I hit this same problem on 1.7.9 version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.9", GitCommit:"19fe91923d584c30bd6db5c5a21e9f0d5f742de8", GitTreeState:"clean", BuildDate:"2017-10-19T16:55:06Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}


$ kubectl get nodes
NAME                        STATUS     AGE       VERSION
k8s-agentpool1-14934812-0   Ready      6d        v1.7.9
k8s-agentpool1-14934812-1   NotReady   6d        v1.7.9
k8s-agentpool1-14934812-2   Ready      6d        v1.7.9
k8s-master-14934812-0       Ready      6d        v1.7.9

When will the fix be cherry picked into 1.7.x versions?

@nealabh
Copy link

nealabh commented Feb 12, 2018

Still facing the same issue with 1.7.7, can you suggest how to upgrade on Azure Container Service.

@brendandburns
Copy link
Member

brendandburns commented Feb 12, 2018 via email

@mboersma
Copy link
Member

I'm closing this old issue since acs-engine is deprecated in favor of aks-engine.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests