-
Notifications
You must be signed in to change notification settings - Fork 674
Weave not working correctly leads to containers stuck in ContainerCreating #3384
Comments
The symptom looks very like #2797, which was fixed before 2.3.0. |
It's unfortunate that I lost the logs of the weave container, but it should be possible to replicate it (just startup and shutdown nodes, should be easy to script). I will give it a shot early next week. |
To be clear, it only runs cleanup on new pods starting, so if you know for sure you didn't start any new |
Weave is deployed as a daemonset so when the nodes are scaled up again a new pod was running, so it should have triggered the case, shouldn't it? |
Yes, if a new pod started we need the logs to see why it didn't clean up. |
The only option I have is try to reproduce the issue with the same configuration. If I manage to get something I will update the issue with the details. |
@Raffo I just learnt that ☝️ from #3372 (comment). Key words, "same IP", which means with AWS autoscaling groups this will never happen. Could this be the problem? I've pretty much subscribed to every issue on weave related to this problem and I've been using weave since around 1.8.x (see thread from #2797 (comment)). And we survive by cleaning up manually. It's causing serious production issues for us (we scale quite a lot!) and I'm losing confidence (actually I've kinda given up today). Unfortunately we committed to weave early and changing CNI isn't trivial but possible (I asked on If you find a solution, please do share! 😅 |
@itskingori please open a new issue with logs so we can diagnose what is causing you issues. I have commented on "same IP" which seems to be a misunderstanding. Any Weave Net pod starting up should run the reclaim process, so just post the logs from that pod and it should give a clue. |
@itskingori I will try as much as I can to reproduce it. Changing CNI provider is a possible alternative for me as well, but I'd love to stick with weave as the support has been amazing from the folks at weave and they deserve help in fixing that. Let's see what we can do 😄 |
@Raffo ...
There's someone who's made a comment saying "A few people have tried doing this and not had success. We recommend just creating a new cluster", so I'd advice against it now. 😅
Yes they have. I would love to contribute but I'm out of my depth here ... plus networking is not my strong suite and the fact that this is hitting out production cluster is really making me sweat! @bboreham I could but there are already issues that cover anything I have to say. Left a comment here because it's new and similar to #3372 which I'm tracking. There's also #3310 which is different and already filed. I've allocated some time to investigate this issue this week because of the severity and will add anything I find to the aforementioned issues. |
@itskingori as @bboreham says, we really...
|
I ask for a separate issue for each report because it keeps the conversation focused. Similar-looking issues can be very different. Yes there are people saying similar things but of all the thousands and thousands of instances running, nobody has posted a single log file of the |
@Raffo can you explain what you meant by this? You posted evidence of running out of available IPs, but nothing I can see as a “clash”. |
@bboreham you are right, maybe they didn't clash, I will update the text. |
we got bit by this problem today. We had it in the distant past-- but we thought it was resolved. We are using an ASG, so we dont have new nodes with the same IP We rotate through our cluster every night, and delete each node one at a time. So all of the weave pods are restarted, and all of the nodes are restarted. It took a while to produce this-- definitely not just one night. i suspect we might be rotating nodes often enough that they come online with ips that used to previously exist, but are now another MAC, same IP. I suspect this because i occasionally learn from ssh that the host key is wrong-- which means that this ip has been used before. |
Below is a copy of IPAM status before we used rmpeer to clean things up. There is some interesting info to be seen there. For one, it is evident that our ip range is small enough that we see the same ip address for a host multiple times with a different MAC-- even within the below logs, for example these two:
Full log ( which represents probably 3 weeks of running and recycling nodes )
|
@bboreham given that we cycle through our cluster frequently, and ran into this issue, we would like to implement a workaround until the issue is resolved. My understanding from reading most of the issues is that part of the difficulty in applying a fix is that it is hard to know when a node has gone permanently and when it is temporary. In our case, we control the process of terminating nodes, so we know when it is. Our script drains and then terminates nodes in a predictable order. If we could do something at the time we terminate a node, what would the 'right thing' to do be? would it work to simply run Or would it be better to randomly select a node and run One of our annoying details is that we can no longer run kubectl exec on the weave pods-- they are using hostNetwork true, so we can't exec on them remotely. |
Thanks for posting your update regarding this issue! In the meantime I have spin up a test that should reproduce a similar problem (I'm just spinning up and down nodes in a loop, let's see if it leads to the same issues in a reproducible way). |
@Raffo sounds good. Let's suppose that we can duplicate it... What is the fix then? It seems that even when we know it happens, the fix isn't clear. How do you know how long to wait for a node before you give up on it? For the record, in our case the suggestion I think I read above, one week, would work. It took us three weeks to get this to happen. Could we go with a configurable how long to wait value, after which unreachable nodes are removed, with a default value of one week? |
(Correction: it's a peer name, not MAC). The IP addr re-use should definitely cause problem. The IPAM reclaimer identifies nodes by their (host)name (https://github.com/weaveworks/weave/blob/v2.4.0/prog/kube-utils/main.go#L114), so a dead node cannot be rmpeer'd if there is a running node with the same IP addr. We should address this. To better understand what happens in your case, I need full logs (not ipam status) of the |
Well, in only one night running I managed to replicate the issue.
As requested by @bboreham and @brb , this is the log of a weave container (v 2.3 as stated in this issue) starting after the problem started happening:
From my understanding, and I apologize if I am wrong, any protocol that will deal with peer discovery will have to assume at some point that the peers are gone and it's totally normal on a cloud scenario that IP addresses will be reused so this problem will happen for sure, it is only a matter of time. I guess a timeout of nodes "being gone" could already address this, but to be on the safe side we have to deal with the conflicts in general. WDYT? /cc @brb |
Thanks for trying to replicate, but which issue you refer to? The log looks healthy. Also, the multiple |
@brb as I wrote, I have exactly the problem in the main issue where many nodes are unreachable even if the log looks healthy :
|
None of those are in the peer list, so they won't get cleaned up now.
Do you have logs from earlier? Presumably something went wrong but it isn't in that log. |
Hi @brb The logs below are from our development cluster, which has an artificially accelerated maintenance schedule ( nightly ) to try to make these kinds of things happen more frequently ( its working :) ) Overnight, a script rolls through all nodes, and drains and terminates them one by one. Yesterday, we cleared out all of the unreachable nodes. The output below is from a healthy cluster, after 1 'round' of upgrades. In practice it takes about a week or two before we get a broken cluster due to the accumulation of unreachable peers. My (unconfirmed) suspicion is that we have a broken cluster as soon as an ip address gets re-used. It would seem reasonable to me to automatically remove old peers when a new peer is discovered having a new host name but different MAC. its possible that would fix our case, despite all of the unreachable peers.
|
Yes, please open a new issue for each separate case, this helps to keep the threads of conversation clear. If in doubt as to whether it is a separate case, open a new issue. |
@bboreham I have created this issue and re-posted relevant stuff for our issue Please let me know if you need more information to make progress. From a technical viewpoint, I believe it is nearly certain that my new issue is in fact the exact same as this one ( which is why i commented on it). They both have the same exact root cause and scenario: AWS nodes terminating and then coming back as a part of an ASG. |
related: #3394 |
Sorry, I had missed #3394.
However this one:
has a different peer with the same hostname in the peer list:
which triggers the problem described there. |
Ok so basically on aws IP address does matter because it's part of the host name. To me it seems clear that when we see a new peer with the same host name, we should remove the old one. |
There’s a bug in the way The removal code should look more like |
Nice, so we have figured it out most of it, right? :-) |
I wrote a unit test (at #3400) that lets me create and delete hundreds of nodes. This gives me some confidence we understand the main failure mode. |
@bboreham Nice! I've been some time away, but am I right that upgrading to the newly released 2.4.1 should fix the issue? If so I will spin up a cluster and replicate my steps to reproduce the issue. |
We upgraded to release 2.5.0 and still having this issue. |
@bmihaescu please open a new issue with logs so we can diagnose. |
Hi, We have a very high churn of nodes( a ~75 nodes cluster churning through about 1.000 nodes a day) and after about 2 weeks of running weave 2.5.0 on a kops-deployed 1.10 cluster we got this issue happening again. We can't share useful logs due to the huge timeframe and the node churn. If you have any idea about how we could share relevant information please let me know. We basically got @Raffo's commands, put them in a script and have this script run every 3 hours. This solved the issue and we had no more incidents since December. The relevant part of the script, if anyone needs it: #!/bin/bash
NODES=$(kubectl get nodes -o template --template='{{range.items}}{{range.status.addresses}}{{if eq .type "InternalIP"}}{{.address}}{{end}}{{end}} {{end}}')
echo Starting NODES cleanup ...
for node in $NODES
do
#echo $node
ssh -t -o ConnectTimeout=10 -o StrictHostKeyChecking=no admin@$node "sudo rm /var/lib/weave/weave-netdata.db"
done
echo Starting WEAVE PODS cleanup ...
for weave_pod in $(kubectl get pods -n kube-system | awk '{print $1}' | grep weave)
do
kubectl delete pod -n kube-system $weave_pod;
done |
Logs from one Please open a new issue. |
It might be as silly as enabling weave-net ports 6783/tcp, 6783/udp, 6784/udp on master node(s) in your firewall |
This actually worked! |
This is in the FAQ - please could you say where you looked, so we can add it to the docs there. |
Sorry, I can't say for sure right now as it was long ago. I had a cluster running for several months with open ports for Weave 6763/tcp, 6763/udp, 6764/udp, 6781/tcp, 6782/tcp. Now I needed to recreate the cluster and opening just that ports didn't work. So additional opening of aforementioned ports solved the issue in my case. |
What you expected to happen?
Weave should not have memory of previous removed nodes as this can cause ax exhaustion of IPs.
What happened?
Some containers of the cluster were in status
ContainerCreating
and could never transion to aRunning
status. We could see by describing one of the pods that they were reportingFailed create pod sandbox
. Here is a list of similar issues which could still be unrelated:In our cluster we scale down the nodes every night to save money, by changing the size of the Autoscaling Group in AWS (it's a kops cluster).
We saw the following in weave containers:
This is NOT significant. It's fine that the nodes say that they can't connect to ourself or at least we see this error in the
status connection
command of the weave CLI even on a working cluster.What is more interesting is the output of
status ipam
:This seems to be telling us that most of the cluster is unreachable... which is making the CNI not work and containers can't start cause they can't get an IP address.
We verified that this was the case by reading the kubelet logs:
In the logs above you can see
has no IP address. Need to start a new one
.We believe that this is due to the fact that we continuously shut down the nodes of our cluster in the night by simply scaling the ASG to 0 and back to the original size in the morning. It looks like that kops/weave do not do any automatic cleanup, probably cause they don't have a chance.
From the
weave
documentation, it seems that we have to do something when the nodes exits, like mentioned in the official documentation. We still have to find a proper way to remove nodes from the Kubernetes cluster.We did the reset by doing the following:
/var/lib/weave/weave-netdata.db
. There is no need for a backup of that filefor i in $(kubectl get pods -n kube-system | awk '{print $1}' | grep weave); do kubectl delete pod -n kube-system $i; done
This brought us back to a healthy state, that we could figure by running again the
status ipam
weave command:How to reproduce it?
Not sure, probably deleting lots of nodes from the cluster in a continuous way.
Anything else we need to know?
Versions:
Logs:
I don't have other logs to paste for the moment.
The text was updated successfully, but these errors were encountered: