-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd size sometimes starts growing and grows until "mvcc: database space exceeded" #8009
Comments
FYI, restoring from all those snapshots locally and running:
always returns files of size in the order of at most tens of MB. |
@wojtek-t We got the similar issue on the DB file size increasement. What we have done is to do defrag periodically. |
@benbjohnson can you take a look at the boltdb data? We double checked the etcd txn, and do not feel we can have a long running transaction that can cause the db size continuous to grow. |
Do you know why the snapshot interval changed here? it was ~1hr per snapshot. but it is only 30min between the last two. |
@xiang90 The snapshot and Can you add logging to print the https://github.com/coreos/etcd/blob/master/mvcc/backend/backend.go#L173 https://github.com/coreos/etcd/blob/master/mvcc/backend/backend.go#L439-L443 I'm not sure how much it'll help for me to look at the DB. Also, I am pretty slammed right now and don't have the bandwidth. My suspicion is that it's the snapshot getting hung or somehow the Tx is getting dropped. There are some spots where a |
Good question. Maybe the underlying VM was restarted or sth? But that's just guessing, I don't really know. |
@wojtek-t As @benbjohnson mentioned, if there is a pending snapshot for a long time, we will have troubles. so i want you to check if there is any failed or pending snapshot. if there is not, we probably can rule out the long running txn thing. we also add quite a few warning lines around long running read only txn in etcd master. if you can reproduce the issue, i would suggest you to back port the debugging patch and give it a try. |
@xiang90 - are there any logs that may confirm it in 3.0.17? We don't have reproduction for it (and backporting is hard on our side). |
sorry. not really. :(
what we plan to do at our side is to set up a few long running k8s stressful clusters against etcd3 while doing 10minute interval snapshot and some levels of controlled failure injection. Not sure if we can get a reproduce though. But, for 3.0, we are at a bad situation here. No reproduce, not enough logging. We have some guess around what happened, but they are not super convincing. The only way to move forward is probably to get more debugging logging and reproduce it. @armstrongli if you guys can reproduce it, it would be super helpful. |
@xiang90 - do you know why between 1.56GB and 3.8GB the number of:
dropped by order of magnitude, but the size of DB more the doubled? |
Ironicaly, so far those problems are happening on small clusters (less than 10 nodes). And there isn't big churn on them (at least on some of them). So we don't even need scale for this... |
Hmm.... Is it because most of GKE users run small clusters? Or the bad percentage is indeed larger for small (low activity) clusters? |
Can be both. What I wanted to say is just potentially it may not be easier to reproduce in scale than in small clusters. |
@xiang90 - I looked into logs of our backup tasks. Unfortunately, it seems that they were log-rotated in the meantime, but we have something. In particular I'm seeing this:
Those times are UTC, so this happened during an hour when DB size grown from 1.56GB to 2.18GB. |
boltdb neve reclaims the pages. What you see there are pages in use, there is also a freelist that tracks all freed pages. for 3.8gb one, i assume most of the pages become free? one actually interesting point of your data is that the overflow pages are huge after 15.6MB one and becomes normal in the 3.8gb one. in 1.56 gb one, for example, the overflow page/logic page ~ 20, which means the avg size of the key is (4000+ 115179) * 4kb/10978. is that what you expect? |
so we do have failed snapshot? |
It failed on the client side - I don't know what happened on the server side (if it even reached the server or not). |
This is also strange. For the 500MB one, the actual usage is 97%, which is (451590110). Probably it worth to look at the data through https://github.com/br0xen/boltbrowser to see what are in there. I am curious how 10978 k8s keys take up to ~450MB. |
do you have server side logging around the same time? |
correct. we will try that. it would be helpful if you guys can also do some testing. i believe google probably has more computation resources than us :P. |
I do - what should I look for? |
any panics? any warning or error level logging? |
I don't see any panic and warnings. But it seems that I missed something important initially. It seems that etcd was restarted multiple times in the meantime. It was also restarted around that time. I'm trying to figure out why.... |
So I looked deeper into what was happening with this cluster and it was very heavily overloaded. In particular we were restarting different components every now and then do to them not responding to liveness probes (etcd) or not being able to refresh master leases. Regarding the large objects - those were huge endpoints containing O(1000) failed pods. This is k8s issue and I will file a separate issue in k8s repo for it, but it should cause etcd database space exceeded in my opinion. |
So I started looking into other clusters in which we faced the same issue, and it seems that there are some long-running "etcdctl snapshot save" operations there. In particular, examples from logs from other clusters:
[The second line means backup is done and we are starting to upload it to where it should be stored.] And then it started getting longer:
[It was still 193MB backup - and there were a bunch of backups like this, like 7 or so, each taking ~2m] Then this one 4 minutes long:
And the one more which started before we exceeded database space, but finished after (apiserver logs below):
This one produced backup of 1.29GB size.
That said, I'm not seeing any logs about those long backups in etcd log (or I don't know what to look at there). I also looked at the cluster for which I reported the issue (I managed to get all previous logs from doing backups) and the situation was pretty similar:
or
Then to 2minutes+
then to 4minutes+
or 3 minutes+ right before we hit the "mvcc: database space exceeded":
|
Though in the second case, the machine was overloaded, which might have also impact the time backups were taking. |
you mean A few minutes snapshot is not a unreasonably long running one. I was worrying about there is a hanging one or leaky one (never finished). Thanks for all the information. I will try to see what can happen when etcd is taking a snapshot + receiving some large value updates. We will keep you updated on this. |
Has this been released already? |
@sebastianwoinar it will be in 3.3. |
We experienced the same issue today in our kubernetes cluster. DB size gets from 80MB to 12GB. |
We are experiencing this issue with 3.2.7 and 3.2.9. Can this issue be re-opened and investigated, please? |
@loxo33 the patch is merged in to master, not 3.2.x |
probably a boltdb issue. |
We might be hitting this as well with etcd 3.1. So what is the recommended action to get a working cluster again? Compacting, defragmenting and disarming does not work as the alarm gets re-raised immediately |
Where are we with this ? according to the comment above, it'd should have been in 3.3, but this is still unresolved right ? |
Did you guys try to run defrag periodically, we do and it helps keep the size in check. |
Is there any recommended action yet? We are running into this cluster issue aswell. |
@ricklagerweij For me it turned out to be a configuration issue. I had to set the |
@christianhuening Thanks! That makes sense as 2 out of 3 nodes had a database which had a size of 2.1GB. I'm not familiar with etcd, how would I tweak this on an existing cluster? This page doesn't really give me much operational recommendations https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md |
@ricklagerweij Well, assuming you have it running via systemd: You go to node 1, edit the systemd service file for etcd and add |
@christianhuening makes sense, thanks again. |
@xiang90 I've encountered the same issue now, according to the discussion, I need to update the version of my etcd right? Is there any workaround so I can get a quick fix? |
mark |
In which version is this issue fixed ? |
3.2.10 later |
We have observed already few cases, where suddenly (after days of running a GKE cluster), the size of database starts growing.
As an example, we have a cluster, that was running without any issues for ~2 weeks (size of database was ~16MB), and then its database started growing. The growth wasn't immediate - it took ~2days, before it reached 4GB limit and there were steps of growing.
For the reference we have backups (snapshots) (done via etcdctl snapshot) that reflect the growth speed (the name contains the time when it was made)
We've checked that we were doing compaction very regularly every 5m for the whole time - so it doesn't seem to be the same as: #7944
I'm attaching the interested lines from etcd logs in etcd-compaction.txt
[Note time in that logs are in UTC, and time in snapshot names is PST, so 7 hours difference]
To summarize, the compaction was always at most few thousands of transactions (so it's not that we did a lot during some 5m period), though there were some longer compactions, up to ~7s.
I started digging into individual snapshots and found some strange thing (I was using bolt)
What is extremely interesting to me is that both:
dropped by order of magnitude in this 3.86GB snapshot, but the total size of database didn't drop
Unfortunately I can't provide any of those snapshots due to privacy reasons, but maybe you can see anything that we can investigate (or share results of some commands) that can help with debugging?
@xiang90 @hongchaodeng @mml @lavalamp
The text was updated successfully, but these errors were encountered: