-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd3 cluster filled disk and didn't gc old revisions #7944
Comments
Note that in the output here, we had already manually compacted against one of the revisions, but not all before this. |
@Spindel Do you do external periodical backup? How often does this happen? |
is this a cluster migrated from etcd2? or it is a fresh cluster? |
No external backups taken on this system, and this is the first time. It's powering a small kubernetes cluster ( handful of pods only) and this is the first time it's filled up. I'm no longer sure if it's migrated or not, honestly. It's been re-set quite a few times while I've been working with k8s. |
good to know.
Can you keep on monitoring the database size growth? It should not grow even up to 100MB for your workload in theory. Do you keep all your etcd log? I am interested in the etcd server log of coreos03.kub.do.modio.se:2379 in 48 hours before it filled up its db size quota. |
I can keep monitoring it, yes. |
@Spindel Thanks a lot! We will check the log and keep you posted. |
@heyitsanthony I am not sure if the db sent failure triggered the huge db size or the db huge db size triggered the snapshot sending failure. Probably we should check how etcd handles snapshot sent failure first. |
@Spindel do you know what happened to peer |
@xiang90 : Yeah, it hit OOM and the OOM-killer started reaping random processes. |
Is it possible that etcd takes a lot of RAM because of the large snapshot? What is your RAM limit? |
On that node? I think that was 1GiB, but there were some misbehaving auto-restarted pods on that node that ate RAM pretty aggressively. |
do you still have the log on this member too? |
Here you go: |
Here's the monitoring graph of available memory on the coreos02 node. As you can see, it was quite stable at 200Mib free until it wasn't, and then a big dive. |
@Spindel Ah... Glad that you are monitoring RSS. Can you also provide the RSS usage on machine |
From another member
@heyitsanthony Interesting... Two nodes with large db size all tried to send snapshot to @Spindel Being greedy... Logs from coreos01.kub.do.modio.se:2379? |
And here are the logs for coreos01.kub.do.modio.se |
Sadly our monitoring doesn't have the proper cpu resources inside the cluster ( No proper process counts, etc) since they run as a daemonset pod. A better way of running that would be nice. Anyhow, I'm hitting the sack, it's way too late in my timezone. Hope the info helps! |
Strangely enough... the good member (coreos01.kub.do.modio.se:2379) also failed to send a few snapshots on the same day.
|
Note that before I did the status and such, I had manually compacted coreos01.kub.do.modio.se as it was also quite huge. I'm not sure it should be considered a "good" node in this case, just that I didn't keep all the logs from that part. |
@Spindel Ah. Right. I looked into the log and found |
I'd think it's very plausible. To reproduce, you can probably start a three node cluster where one of them has < 500 MiB RAM, and then just put some various memory hogs on the low-RAM device. |
/subscribe |
@Spindel Does the db size increase again? |
Since I've permanently removed the suspected problematic host from the cluster, it doesn't appear to grow inconsiderately anymore: endpoint status gives the following:
Which still appears to be more than 100x bloat factor, is acceptable. |
I discussed with @heyitsanthony about this issue yesterday. We can understand snapshot issues contribute to the huge db size, but should not bring it up to 1GB+ in that short amount of time. So I investigated into the logs again, and found the root cause for this case.
The compaction somehow got stuck for 3 days. Same thing happens to all other members in the cluster. Kuberentes master should compact etcd every 5 minutes or so. Can you check your api server log to see if anything happened to it? |
Sadly, the apiserver logs have been rotated since. ;-(
…On Tue, 23 May 2017, 18:04 Xiang Li, ***@***.***> wrote:
@Spindel <https://github.com/spindel>
I discussed with @heyitsanthony <https://github.com/heyitsanthony> about
this issue yesterday. We can understand snapshot issues contribute to the
huge db size, but should not bring it up to 1GB+ in that short amount of
time.
So I investigated into the logs again, and found the root cause for this
case.
"MESSAGE" : "2017-05-14 16:07:28.319813 I | mvcc: finished scheduled compaction at 1815026 (took 1.573214ms)"
"MESSAGE" : "2017-05-14 16:12:28.454798 I | mvcc: finished scheduled compaction at 1815435 (took 912.595µs)"
"MESSAGE" : "2017-05-14 16:17:28.522506 I | mvcc: finished scheduled compaction at 1815827 (took 1.691996ms)"
"MESSAGE" : "2017-05-14 16:22:28.571924 I | mvcc: finished scheduled compaction at 1816216 (took 1.743819ms)"
// No compaction for 3 days with 30K+ entries
"MESSAGE" : "2017-05-17 11:36:45.675060 I | mvcc: finished scheduled compaction at 2169745 (took 20.935680091s)"
"MESSAGE" : "2017-05-17 11:41:24.829663 I | mvcc: finished scheduled compaction at 2170257 (took 33.872468ms)"
"MESSAGE" : "2017-05-17 11:46:24.824949 I | mvcc: finished scheduled compaction at 2170744 (took 2.118377ms)"
"MESSAGE" : "2017-05-17 11:51:24.838707 I | mvcc: finished scheduled compaction at 2171255 (took 1.376061ms)"
"MESSAGE" : "2017-05-17 11:56:24.856919 I | mvcc: finished scheduled compaction at 2171767 (took 2.451828ms)"
"MESSAGE" : "2017-05-17 12:01:24.896190 I | mvcc: finished scheduled compaction at 2172276 (took 1.796448ms)"
"MESSAGE" : "2017-05-17 12:06:24.967299 I | mvcc: finished scheduled compaction at 2172788 (took 1.330718ms)"
"MESSAGE" : "2017-05-17 12:11:25.009440 I | mvcc: finished scheduled compaction at 2173296 (took 2.343877ms)"
The compaction somehow got stuck for 3 days. Kuberentes master should
compact etcd every 5 minutes or so.
Can you check your api server log to see if anything happened to it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7944 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABmI5jL7yZhWMMR7SGTNSGBMoFpj_NWdks5r8wOogaJpZM4NeZBT>
.
|
ack. subscribing! |
Bug reporting
our kubernetes cluster stopped taking commands due to etcd3 running out of space and alarming about NOSPACE.
endpoint status, as json:
[{"Endpoint":"coreos01.kub.modio.se:2379","Status":{"header":{"cluster_id":1416696078359407448,"member_id":5715034692686022789,"revision":2190654,"raft_term":5829},"version":"3.0.10","dbSize":952512512,"leader":12583913577278567278,"raftIndex":5817555,"raftTerm":5829}},{"Endpoint":"coreos01.kub.do.modio.se:2379","Status":{"header":{"cluster_id":1416696078359407448,"member_id":9911998787147328850,"revision":2190654,"raft_term":5829},"version":"3.0.10","dbSize":356352,"leader":12583913577278567278,"raftIndex":5817555,"raftTerm":5829}},{"Endpoint":"coreos03.kub.do.modio.se:2379","Status":{"header":{"cluster_id":1416696078359407448,"member_id":12583913577278567278,"revision":2190654,"raft_term":5829},"version":"3.0.10","dbSize":3777458176,"leader":12583913577278567278,"raftIndex":5817555,"raftTerm":5829}}]
After that we did a compaction:
after compact, before defrag:
after compact, after defrag:
And after clearing the alarms, we could use the kubernetes cluster again.
The text was updated successfully, but these errors were encountered: