-
Notifications
You must be signed in to change notification settings - Fork 294
Controller node not being properly tainted #199
Conversation
Current coverage is 68.87% (diff: 100%)@@ master #199 diff @@
==========================================
Files 4 4
Lines 1132 1134 +2
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 781 781
- Misses 262 263 +1
- Partials 89 90 +1
|
@artushin Thanks for your help! |
Update: I've confirmed that @artushin's fix does work as expected:
|
Taint-and-cordon still fails:
With that fixed, I'm seeing a failure with
Haven't used rkt, so that'll require a bit of investigation. Also my controller is becoming CPU bound eventually. Maybe because of this issue and a ton of un-garbage-collected rkt images? I dunno. |
Any updates on this? Is the snippet
safe to remove without affecting other parts? I'm having problems with the CPU usage (exactly as artushin experienced). Turns out that if I reboot the instance, the CPU usage resets, however, weird things happen after rebooting e.g. pods not being able to see each other, multiple restarts of DNS and apiserver containers, etc. Update: |
@reiinakano , @artushin these commands delete container which |
@redbaron When I was seeing the issue,
so I assume those failed taint-and-uncordon containers were being kept around possible increasing CPU usage from failing GC? Someone with more rkt knowledge should look into the failure case of running the taint-and-uncordon job. In the meantime, if the script passes, the CPU usage issue shouldn't occur. Don't understand why simply getting that script to run without error relieved so much cpu though, because unlike @reiinakano, I didn't rebuild the cluster, I just edited the script in |
Even after I editing the '/opt/bin' script so that the taint works, the task still just fails over and over. As @artushin noted the
These failed, but not cleaned up rkt pods seem to mount up and grow a docker clean-up task that runs about every 60s. I don't know how rkt works either. But it may be this clean up task that gradually grows and increase the load as the number of failed containers mount. You can see the slow, but inevitable death-spiral on this controller I just made. The taint-and-uncordon is patched to always work, but the rkt task fails each time, and the number pods in the 60s clean-up task gets longer and longer, and the CPU load higher and higher. |
Hi remade the same cluster as above, this time fully removing the |
Although the controller is stable with I guess there is a deeper problem with the rkt container tasks that we just don't notice until one goes awry like
|
do these container IDs have anything in common? are they from same failed/exited container? Update: also previous errors you reported were from rkt list, these ones are from dockerd |
@redbaron I don't know what those containers IDs are from, there was no obvious mention of them in journalctl since boot. The 60s clean-up errors were always dockerd errors, and that list grew when the rkt task was repeatedly failing. I have no clue how/if they are related. It may be a red herring; I don't actually know the growing clean-up errors are behind the growing CPU load. I just know that |
@redbaron @artushin @whereisaaron @reiinakano I'd been seeing the AFAIK the
Edit: The first option doesn't work in this case.
|
I'd rather like to avoid rkt here so that we can completely solve this issue until it is fixed in rkt v1.22.0 via rkt/rkt#3486. Therefore, for now, I'd suggest changing #!/bin/bash -e
hostname=$(hostname)
docker run --rm --net=host \
-v /etc/kubernetes:/etc/kubernetes \
-v /etc/resolv.conf:/etc/resolv.conf \
{{.HyperkubeImageRepo}}:{{.K8sVer}} /bin/bash \
-vxc \
'echo tainting this node; \
hostname="'${hostname}'"; \
kubectl="/kubectl --server=http://127.0.0.1:8080"; \
taint="$kubectl taint node $hostname"; \
$taint "node.alpha.kubernetes.io/role=master:NoSchedule"; \
echo done. ;\
echo uncordoning this node; \
$kubectl uncordon $hostname;\
echo done.' |
Agree, best to avoid rkt until that gnaly issue is fixed.
|
@whereisaaron I agree on every point 👍 |
Hi @artushin, would you mind updating your PR to use docker instead of rkt as @whereisaaron suggested in #199 (comment) if you're also ok with it? |
using script supplied in @whereisarron's comment in PR 199.
lgtm. @whereisaaron, switching to exactly what you provided. |
rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.
LGTM. Thanks to all for your contribution! 🙇 @redbaron @artushin @whereisaaron @reiinakano |
Quick question, but could I pull in this new merge and do |
@reiinakano Yes, it is designed to work in this case. However please beware of the fact that it isn't strictly tested with every combination of changes. Could you mind taking a look at the "full update" (which is your case) section of the relevant kube-aws doc for more info regarding what kind of updates kube-aws is intended to support and how. |
Also note that if you are going to update worker nodes and you had not yet enabled kube-aws does replace your nodes one-by-one hence you have to teach kube-aws, your kubernetes cluster and your pods how to tolerate single node replacement. |
I see. Thanks @mumoshu ! |
rkt has a gnarly bug (rkt/rkt#3181) that won't be fixed in a hurry (rkt/rkt#3486). It least to continuous task failures that eventually totally wreak worker nodes (kubernetes-retired#244). In the meantime we can use docker just as easily for this simple task. This work around was discussed in kubernetes-retired#199.
Controller node not being properly tainted
/opt/bin/taint-and-uncordon had a syntax issue with the kubectl taint command. Was failing with
error: at least one taint update is required
and the controller node was still getting pods scheduled to it. PR fixes the syntax as per http://kubernetes.io/docs/user-guide/kubectl/kubectl_taint/