Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Controller node not being properly tainted #199

Merged
merged 2 commits into from
Jan 11, 2017

Conversation

artushin
Copy link
Contributor

@artushin artushin commented Jan 4, 2017

/opt/bin/taint-and-uncordon had a syntax issue with the kubectl taint command. Was failing with error: at least one taint update is required and the controller node was still getting pods scheduled to it. PR fixes the syntax as per http://kubernetes.io/docs/user-guide/kubectl/kubectl_taint/

@codecov-io
Copy link

codecov-io commented Jan 4, 2017

Current coverage is 68.87% (diff: 100%)

Merging #199 into master will decrease coverage by 0.12%

@@             master       #199   diff @@
==========================================
  Files             4          4          
  Lines          1132       1134     +2   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits            781        781          
- Misses          262        263     +1   
- Partials         89         90     +1   

Powered by Codecov. Last update 2b29fc3...fb34129

@mumoshu
Copy link
Contributor

mumoshu commented Jan 6, 2017

@artushin Thanks for your help!
I'm not sure how I'd made a mistake like this slipped into the v0.9.2-rc.2 release but your fix does seems to be correct.
Please give me a day or two to test before/after merging your fix by running our e2e tests.

@mumoshu mumoshu added this to the v0.9.3-rc.3 milestone Jan 6, 2017
@mumoshu
Copy link
Contributor

mumoshu commented Jan 6, 2017

Update: I've confirmed that @artushin's fix does work as expected:

$ kubectl describe node ip-10-0-0-193.ap-northeast-1.compute.internal
Name:  			ip-10-0-0-193.ap-northeast-1.compute.internal
Labels:			beta.kubernetes.io/arch=amd64
       			beta.kubernetes.io/instance-type=t2.medium
       			beta.kubernetes.io/os=linux
       			failure-domain.beta.kubernetes.io/region=ap-northeast-1
       			failure-domain.beta.kubernetes.io/zone=ap-northeast-1a
       			kube-aws.coreos.com/autoscalinggroup=kubeawstest1-AutoScaleController-1MXSGQDT5DGKF
       			kube-aws.coreos.com/launchconfiguration=kubeawstest1-LaunchConfigurationController-1EGZPO0LU1JJO
       			kubernetes.io/hostname=ip-10-0-0-193.ap-northeast-1.compute.internal
Taints:			node.alpha.kubernetes.io/role=master:NoSchedule
*snip*

@artushin
Copy link
Contributor Author

artushin commented Jan 6, 2017

Taint-and-cordon still fails:

taint-and-uncordon[25536]: [14820.578848] hyperkube[5]: error: Node 'ip-10-0-0-224.us-west-2.compute.internal' already has a taint with key (node.alpha.kubernetes.io/role) and effect (NoSchedule), and --overwrite is false

With that fixed, I'm seeing a failure with

taint-and-uncordon[26251]: rm: unable to remove pod "1696c88c-99f4-4810-bb77-d406315f1452": remove /var/lib/rkt/pods/exited-garbage/1696c88c-99f4-4810-bb77-d406315f1452/stage1/rootfs: device or resource busy

Haven't used rkt, so that'll require a bit of investigation. Also my controller is becoming CPU bound eventually. Maybe because of this issue and a ton of un-garbage-collected rkt images? I dunno.

@artushin
Copy link
Contributor Author

artushin commented Jan 6, 2017

sudo rkt run
...
 --uuid-file-save=/var/run/coreos/taint-and-uncordon.uuid \
...
sudo rkt rm --uuid-file=/var/run/coreos/taint-and-uncordon.uuid

seems broken. Can you just rely on regular gc?

Getting rid of that seems to have fixed my CPU throttle issue too:
screen shot 2017-01-06 at 2 38 58 pm
screen shot 2017-01-06 at 2 51 41 pm

@reiinakano
Copy link
Contributor

reiinakano commented Jan 8, 2017

Any updates on this? Is the snippet

sudo rkt run
...
--uuid-file-save=/var/run/coreos/taint-and-uncordon.uuid
...
sudo rkt rm --uuid-file=/var/run/coreos/taint-and-uncordon.uuid

safe to remove without affecting other parts?

I'm having problems with the CPU usage (exactly as artushin experienced). Turns out that if I reboot the instance, the CPU usage resets, however, weird things happen after rebooting e.g. pods not being able to see each other, multiple restarts of DNS and apiserver containers, etc.

Update:
Went ahead and removed it, then rebuilt my cluster. Running smoothly now.

@redbaron
Copy link
Contributor

redbaron commented Jan 8, 2017

@reiinakano , @artushin these commands delete container which rkt created to run command, do you have an idea how it might be related to high CPU time?

@artushin
Copy link
Contributor Author

artushin commented Jan 8, 2017

@redbaron When I was seeing the issue, rkt list had a ton of errors like

list: Unable to read pod fb7842e7-ab0a-49fb-845a-2399b36a8151 manifest:
  error reading pod manifest

so I assume those failed taint-and-uncordon containers were being kept around possible increasing CPU usage from failing GC? Someone with more rkt knowledge should look into the failure case of running the taint-and-uncordon job. In the meantime, if the script passes, the CPU usage issue shouldn't occur.

Don't understand why simply getting that script to run without error relieved so much cpu though, because unlike @reiinakano, I didn't rebuild the cluster, I just edited the script in /opt.

@whereisaaron
Copy link
Contributor

Even after I editing the '/opt/bin' script so that the taint works, the task still just fails over and over. As @artushin noted the sudo rkt rm --uuid-file=/var/run/coreos/taint-and-uncordon.uuid always seems to fail with rm: unable to remove pod.

Jan 08 21:15:57 ip-1.1.1.1.ap-southeast-2.compute.internal taint-and-uncordon[13188]: rm: unable to remove pod "53f4b000-7782-4b60-81b5-2997$
Jan 08 21:15:57 ip-1.1.1.1.ap-southeast-2.compute.internal systemd[1]: kube-node-taint-and-uncordon.service: Main process exited, code=exite$
Jan 08 21:15:57 ip-1.1.1.1.ap-southeast-2.compute.internal systemd[1]: kube-node-taint-and-uncordon.service: Unit entered failed state.
Jan 08 21:15:57 ip-1.1.1.1.ap-southeast-2.compute.internal systemd[1]: kube-node-taint-and-uncordon.service: Failed with result 'exit-code'. 

These failed, but not cleaned up rkt pods seem to mount up and grow a docker clean-up task that runs about every 60s. I don't know how rkt works either. But it may be this clean up task that gradually grows and increase the load as the number of failed containers mount.

You can see the slow, but inevitable death-spiral on this controller I just made. The taint-and-uncordon is patched to always work, but the rkt task fails each time, and the number pods in the 60s clean-up task gets longer and longer, and the CPU load higher and higher.

taint-problem

@whereisaaron
Copy link
Contributor

Hi remade the same cluster as above, this time fully removing the taint-and-uncordon task from the cloud-init file. Compare the CPU profiles, it is back to normal and comparable with my other clusters, settling at 15% soon after launch. Compare that with when the taint-and-uncordon task was still there, the CPU is never below 60% and slowly climbs to 100% over time.

taint-removed

@whereisaaron
Copy link
Contributor

Although the controller is stable with taint-and-uncordon removed, that problem with the periodic clean-up task persists, it just doesn't get any worse. Every time is it just eight containers mentioned in the errors every 60 seconds, plus the error opening '/run/docker/libcontainerd/docker-containerd.pid'.

I guess there is a deeper problem with the rkt container tasks that we just don't notice until one goes awry like taint-and-uncordon does. It could be we could reduce the controller CPU load further by solving this problem

Jan 08 23:51:33 ip-1-1-1-1.region.compute.internal kubelet-wrapper[1537]: E0108 23:51:33.217869    1537 container_manager_linux
.go:625] error opening pid file /run/docker/libcontainerd/docker-containerd.pid: open /run/docker/libcontainerd/docker-containerd.pid: no suc
h file or directory
Jan 08 23:51:33 ip-1-1-1-1.region.compute.internal dockerd[1421]: time="2017-01-08T23:51:33.233938247Z" level=error msg="Handle
r for GET /containers/3da5a08fd693749466ccfd1ed1b52164619bf86c4a2a39f1c0caafdf6c450ccd/json returned error: No such container: 3da5a08fd69374
9466ccfd1ed1b52164619bf86c4a2a39f1c0caafdf6c450ccd"
Jan 08 23:51:33 ip-1-1-1-1.region.compute.internal dockerd[1421]: time="2017-01-08T23:51:33.267776846Z" level=error msg="Handler for GET /containers/73ddc461d2b1f0746853940eadb7951b939d0ae4e0a5257a41e17bbf48646293/json returned error: No such container: 73ddc461d2b1f0746853940eadb7951b939d0ae4e0a5257a41e17bbf48646293"
Jan 08 23:51:33 ip-1-1-1-1.region.compute.internal dockerd[1421]: time="2017-01-08T23:51:33.273305898Z" level=error msg="Handler for GET /containers/1bc2af7f3283cd951ad93ed8918761bac247cef4d6b366fa0e071addd3583003/json returned error: No such container: 1bc2af7f3283cd951ad93ed8918761bac247cef4d6b366fa0e071addd3583003"
Jan 08 23:51:33 ip-1-1-1-1.region.compute.internal dockerd[1421]: time="2017-01-08T23:51:33.282636512Z" level=error msg="Handler for GET /containers/f68bdb07afd94c87b2c55060894bef4b4e30a9a9b5e56c4e068de74ac34d7e8e/json returned error: No such container: f68bdb07afd94c87b2c55060894bef4b4e30a9a9b5e56c4e068de74ac34d7e8e"
Jan 08 23:51:33 ip-1-1-1-1.region.compute.internal dockerd[1421]: time="2017-01-08T23:51:33.283499144Z" level=error msg="Handler for GET /containers/e18bc18d22fc7cf5722ed3bf39ade10e59b78809ef5786c8fc71d77fd0869132/json returned error: No such container: e18bc18d22fc7cf5722ed3bf39ade10e59b78809ef5786c8fc71d77fd0869132"
Jan 08 23:51:33 ip-1-1-1-1.region.compute.internal dockerd[1421]: time="2017-01-08T23:51:33.284049871Z" level=error msg="Handler for GET /containers/e071cceebc9c4a2813fd3c61be60e4388b6d98bbee77092fd1a3439fb64af2ff/json returned error: No such container: e071cceebc9c4a2813fd3c61be60e4388b6d98bbee77092fd1a3439fb64af2ff"
Jan 08 23:51:33 ip-1-1-1-1.region.compute.internal dockerd[1421]: time="2017-01-08T23:51:33.284594161Z" level=error msg="Handler for GET /containers/3755d84cb13cc852423bb29635476a7a4f3d9053d254c376432add0af64b61d0/json returned error: No such container: 3755d84cb13cc852423bb29635476a7a4f3d9053d254c376432add0af64b61d0"
Jan 08 23:51:33 ip-1-1-1-1.region.compute.internal dockerd[1421]: time="2017-01-08T23:51:33.285088451Z" level=error msg="Handler for GET /containers/a30eb84ec2e8f61118b9a4715b11d640db1d1dc91d86e4a871f4657697ee2e1d/json returned error: No such container: a30eb84ec2e8f61118b9a4715b11d640db1d1dc91d86e4a871f4657697ee2e1d"

@redbaron
Copy link
Contributor

redbaron commented Jan 9, 2017

do these container IDs have anything in common? are they from same failed/exited container?

Update: also previous errors you reported were from rkt list, these ones are from dockerd

@whereisaaron
Copy link
Contributor

@redbaron I don't know what those containers IDs are from, there was no obvious mention of them in journalctl since boot. The 60s clean-up errors were always dockerd errors, and that list grew when the rkt task was repeatedly failing. I have no clue how/if they are related. It may be a red herring; I don't actually know the growing clean-up errors are behind the growing CPU load. I just know that taint-and-uncordon failing continuously is bad news for the stability of my controller nodes :-)

@mumoshu
Copy link
Contributor

mumoshu commented Jan 10, 2017

@redbaron @artushin @whereisaaron @reiinakano I'd been seeing the rkt rm errors while testing my cluster w/ the fix, but I had no luck knowing the implications/effects of those. Thanks for all the investigations you've done 🙇

AFAIK the rkt rm issue is caused by rkt/rkt#3181 and workarounds I'm aware of are:

Edit: The first option doesn't work in this case.

cleaning pod resources.
Jan 10 00:56:56 ip-10-0-0-113.ap-northeast-1.compute.internal taint-and-uncordon[21244]: rm: unable to remove pod "f2b04f46-6b60-4967-956c-59167d377079": remove /var/lib/rkt/pods/exited-garbage/f2b04f46-6b60-4967-956c-59167d377079/stage1/rootfs: device or resource busy

@mumoshu
Copy link
Contributor

mumoshu commented Jan 10, 2017

I'd rather like to avoid rkt here so that we can completely solve this issue until it is fixed in rkt v1.22.0 via rkt/rkt#3486. Therefore, for now, I'd suggest changing /opt/bin/taint-and-uncordon to something like:

      #!/bin/bash -e

      hostname=$(hostname)

      docker run --rm --net=host \
        -v /etc/kubernetes:/etc/kubernetes \
        -v /etc/resolv.conf:/etc/resolv.conf \
        {{.HyperkubeImageRepo}}:{{.K8sVer}} /bin/bash \
          -vxc \
          'echo tainting this node; \
           hostname="'${hostname}'"; \
           kubectl="/kubectl --server=http://127.0.0.1:8080"; \
           taint="$kubectl taint node $hostname"; \
           $taint "node.alpha.kubernetes.io/role=master:NoSchedule"; \
           echo done. ;\
           echo uncordoning this node; \
           $kubectl uncordon $hostname;\
           echo done.'

@whereisaaron
Copy link
Contributor

whereisaaron commented Jan 10, 2017

Agree, best to avoid rkt until that gnaly issue is fixed.

  • Suggest '-vxec' for better error handling and to avoid uncordoning an untainted node by mistake if the taint command throws an error (as happened with the '=' typo).
  • Suggest '--overwrite' just in case of a re-run due to the uncordon throwing an error the first time or some other error (as happened with the rkt task).
  • Suggest a bunch of pedantic and unnecessary quoting and message changes for consistency 😸
#!/bin/bash -e

      hostname=$(hostname)

      docker run --rm --net=host \
        -v /etc/kubernetes:/etc/kubernetes \
        -v /etc/resolv.conf:/etc/resolv.conf \
        {{.HyperkubeImageRepo}}:{{.K8sVer}} /bin/bash \
          -vxec \
          'echo "tainting this node."; \
           hostname="'${hostname}'"; \
           kubectl="/kubectl --server=http://127.0.0.1:8080"; \
           taint="$kubectl taint node --overwrite"; \
           $taint "$hostname" "node.alpha.kubernetes.io/role=master:NoSchedule"; \
           echo "done."; \
           echo "uncordoning this node."; \
           $kubectl uncordon "$hostname"; \
           echo "done."'

@mumoshu
Copy link
Contributor

mumoshu commented Jan 10, 2017

@whereisaaron I agree on every point 👍

@mumoshu
Copy link
Contributor

mumoshu commented Jan 10, 2017

Hi @artushin, would you mind updating your PR to use docker instead of rkt as @whereisaaron suggested in #199 (comment) if you're also ok with it?

using script supplied in @whereisarron's comment in PR 199.
@artushin
Copy link
Contributor Author

lgtm. @whereisaaron, switching to exactly what you provided.

whereisaaron added a commit to whereisaaron/kube-aws that referenced this pull request Jan 10, 2017
rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.
@mumoshu mumoshu merged commit 18c5c03 into kubernetes-retired:master Jan 11, 2017
@mumoshu
Copy link
Contributor

mumoshu commented Jan 11, 2017

LGTM. Thanks to all for your contribution! 🙇 @redbaron @artushin @whereisaaron @reiinakano

This was referenced Jan 11, 2017
@reiinakano
Copy link
Contributor

Quick question, but could I pull in this new merge and do kube-aws update without having to rebuild my cluster? Sorry, don't really understand the internals of kube-aws

@mumoshu
Copy link
Contributor

mumoshu commented Jan 11, 2017

@reiinakano Yes, it is designed to work in this case. However please beware of the fact that it isn't strictly tested with every combination of changes. Could you mind taking a look at the "full update" (which is your case) section of the relevant kube-aws doc for more info regarding what kind of updates kube-aws is intended to support and how.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 11, 2017

Also note that if you are going to update worker nodes and you had not yet enabled experimental.nodeDrainer and correct grace periods for your pods with enough replicas, you may encounter some down-time.

kube-aws does replace your nodes one-by-one hence you have to teach kube-aws, your kubernetes cluster and your pods how to tolerate single node replacement.

@reiinakano
Copy link
Contributor

I see. Thanks @mumoshu !

kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this pull request Mar 27, 2018
rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.
kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this pull request Mar 27, 2018
Controller node not being properly tainted
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants