Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD db size is extremely unbalanced among members #7116

Closed
armstrongli opened this issue Jan 9, 2017 · 47 comments
Closed

ETCD db size is extremely unbalanced among members #7116

armstrongli opened this issue Jan 9, 2017 · 47 comments
Assignees

Comments

@armstrongli
Copy link

I have one 4 members' cluster. And after running for days, one of members' DB size is extremely larger than the other ones.

https://tess-master-cbtkd-8128.dev.*.*:4001, 87e6a178ab5a0f63, 3.0.15, 11 GB, true, 2868, 59480228
https://tess-master-temb2-9053.dev.*.*:4001, b3f0b5ddc047ed37, 3.0.15, 57 MB, false, 2868, 59480355
https://tess-master-13mny-8094.dev.*.*:4001, cc83177fcd721b4f, 3.0.15, 57 MB, false, 2868, 59480510
https://salt-master-7505.dev.*.*:4001, ea015efc8847afdf, 3.0.15, 57 MB, false, 2868, 59480579

Member 87e6a178ab5a0f63 has one DB file which is 11 GB, but other ones have only 57 MB.

There're a lot of such logs in member 87e6a178ab5a0f63

...

2017-01-09 06:46:38.706113 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:38.928569 W | etcdserver: apply entries took too long [117.086886ms for 1 entries]
2017-01-09 06:46:38.928628 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:39.135635 W | etcdserver: apply entries took too long [101.388418ms for 1 entries]
2017-01-09 06:46:39.135953 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:39.371954 W | etcdserver: apply entries took too long [134.492457ms for 1 entries]
2017-01-09 06:46:39.372024 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:39.578320 W | etcdserver: apply entries took too long [105.313079ms for 1 entries]
2017-01-09 06:46:39.578539 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:39.814008 W | etcdserver: apply entries took too long [135.0285ms for 1 entries]
2017-01-09 06:46:39.814085 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:40.014153 W | etcdserver: apply entries took too long [110.19445ms for 1 entries]
2017-01-09 06:46:40.014689 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:40.223789 W | etcdserver: apply entries took too long [74.681561ms for 1 entries]
2017-01-09 06:46:40.223833 W | etcdserver: avoid queries with large range/delete range!
2017-01-09 06:46:40.412689 W | etcdserver: apply entries took too long [17.895881ms for 1 entries]
2017-01-09 06:46:40.413091 W | etcdserver: avoid queries with large range/delete range!
...

The cluster shares the same dataset, so the data among them should be almost the same. e.g. 57~65 MB.

I'm expecting response :)

@armstrongli
Copy link
Author

After checking the cluster details, I noticed that the etcdv3 should have defragment operation cause of using boltdb.
And after defragging the member, the data size decreated to the normal range.

@xiang90
Copy link
Contributor

xiang90 commented Jan 9, 2017

@armstrongli etcd should not ever get into this state. can you easily reproduce this? do you still have that 11GB data around?

@armstrongli
Copy link
Author

@xiang90 I'm trying to reproduce it. It happens for twice. So I raise it here.
The db size of the member recover back to normal size after defragment. The locale of the scenario has gone.

What I have done on the cluster is to upgrade it from 2.2.1 to 3.0.15 in one kubernetes cluster.

@xiang90
Copy link
Contributor

xiang90 commented Jan 10, 2017

@armstrongli The defrag just fixes the symptom. The size should not be unbalanced (~10 MB difference is possible) at that at the first place. It would be great if you can reproduce it.

Does the leader's db size become 11GB right after the migration? Or it grows to 11GB eventually? If the later, how long does it take to grow to 11GB? Hours? Days?

@armstrongli
Copy link
Author

@xiang90
it grows to 11GB eventually. And it takes weeks to grow to 11GB.

@xiang90
Copy link
Contributor

xiang90 commented Jan 11, 2017

@armstrongli Is it still growing now after your defrag?

@armstrongli
Copy link
Author

@xiang90 Not yet. The cluster runs for weeks until exceeds database space. It runs only 2 days.
Please keep this issue open until I encounter this issue again.
I'll write the operations I have done on the cluster to provide details to help you debug etcd .

@xiang90
Copy link
Contributor

xiang90 commented Jan 11, 2017

@armstrongli 11GB/x weeks where x < 5 ~ 2GB/week ~ 300MB/day. You should be able to notice even for 2 days it if it continues to grow, I guess.

@xiang90
Copy link
Contributor

xiang90 commented Jan 17, 2017

@armstrongli Any updates on this?

@armstrongli
Copy link
Author

@xiang90 The data size doesn't change a lot after running my cluster for 1 week. Close this issue for now.
I'll provide more details if I encounter this issue for another time.

@xiang90
Copy link
Contributor

xiang90 commented Jan 18, 2017

@armstrongli OK. I guess there is something with your initial setup or migration. Let us know if you meet this again.

@xiang90 xiang90 closed this as completed Jan 18, 2017
@xiang90 xiang90 reopened this May 4, 2017
@xiang90
Copy link
Contributor

xiang90 commented May 4, 2017

@heyitsanthony @gyuho @fanminshi

Other people reported similar behavior. Probably we should look into this problem.

@xiang90
Copy link
Contributor

xiang90 commented May 4, 2017

@armstrongli Do you remember if you took snapshot from that etcd server periodically?

@armstrongli
Copy link
Author

@xiang90 sorry for this late response.
yes . i did take snapshots periodically.

@xiang90
Copy link
Contributor

xiang90 commented May 15, 2017

@armstrongli Have you seen it recently? Any chance you can reproduce it?

@armstrongli
Copy link
Author

no. i haven't found the root cause. we are using it for kubernetes clusters in our company.
and to avoid this kind of issue, we use one side car in the ETCD pod to do defragment periodically. the defragment does work.

@xiang90
Copy link
Contributor

xiang90 commented May 16, 2017

So you are able to somehow reproduce it?

@armstrongli
Copy link
Author

I redo what I did, the unbalance data didn't get reproduced. :(

@armstrongli
Copy link
Author

The reproduce steps are a little different from the ones which happened on unbalance data. I used the backup data instead of the hot data(which contains all the tomb data/historical data).
That's the difference.

@xiang90
Copy link
Contributor

xiang90 commented May 17, 2017

@armstrongli As long as you can reproduce it, it will give us a better chance to get it fixed.

@armstrongli
Copy link
Author

@xiang90 I noticed that one of our clusters encountered data unbalance again after running for months. And it caused some very wired issue on ETCD cluster.
looks like it got split

/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+----------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                         ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-1-1046254.kube.11.****.io:4001 | 178b7f7c9122c3b5 | 3.0.15  | 385 MB  | false     |      1273 | 4261016391 |
| https://kubernetes-master-3-1680624.kube.11.****.io:4001 | 617f02413611a25b | 3.0.15  | 385 MB  | true      |      1273 | 4261016408 |
| https://kubernetes-master-2-1680526.kube.11.****.io:4001 | c17c9a8d17bc74ea | 3.0.15  | 900 MB  | false     |      1273 | 4261016407 |
+----------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

The status of kubernetes shows that the data is different among kubernetes master node. From master-1 & master-3, the data is same. Master-2 is totally different from others.

[fedora@kubernetes-master-2-1680526 ~]$ kubectl get no
NAME                                               STATUS                        AGE
kubernetes-master-1-1046254.kube.****.io   NotReady,SchedulingDisabled   1y
kubernetes-master-2-1680526.kube.****.io   Ready,SchedulingDisabled      122d
kubernetes-master-3-1680624.kube.****.io   NotReady,SchedulingDisabled   122d
tess-node-045fs-1046637.kube.****.io       NotReady                      1y
tess-node-0kpnt-1046629.kube.****.io       NotReady                      1y
tess-node-0v4wt-1065026.kube.****.io       NotReady                      358d
tess-node-0zxkw-1046657.kube.****.io       NotReady                      1y
tess-node-13hcc-1046328.kube.****.io       NotReady                      1y
tess-node-1406m-1046654.kube.****.io       NotReady                      1y
tess-node-1k8vs-1046666.kube.****.io       NotReady                      1y
tess-node-1udit-1046680.kube.****.io       NotReady                      1y
tess-node-1zzue-1046686.kube.****.io       NotReady                      1y

We didn't do anything on ETCD just let it run for months.

@armstrongli
Copy link
Author

  1. data is different in kubernetes data path (/registry/**)
  2. data doesn't get sync
  3. new data in other path gets synced. e.g. I create another one /test

Looks like the raft algorithm in ETCD encounters some corner cases it can't handle.

@xiang90
Copy link
Contributor

xiang90 commented May 19, 2017

is it a fresh v3 cluster? or migrated from v2?

@armstrongli
Copy link
Author

armstrongli commented May 19, 2017

Migrated from v2

#7116 (comment)

@xiang90
Copy link
Contributor

xiang90 commented May 19, 2017

@armstrongli I would image it is not synced from the very beginning. Are you sure they are synced?

@armstrongli
Copy link
Author

Yes. they were synced and have been running months healthily. And we got issue today.

@xiang90
Copy link
Contributor

xiang90 commented May 19, 2017

can you provide me the log of all the members around when it happened?

@armstrongli
Copy link
Author


2017-05-18 11:58:15.746418 I | etcdserver: start to snapshot (applied: 4257496962, lastsnap: 4257486961)
2017-05-18 11:58:15.895415 I | etcdserver: saved snapshot at index 4257496962
2017-05-18 11:58:15.895701 I | etcdserver: compacted raft log at 4257491962
2017-05-18 11:58:20.648673 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f6-00000000fdc2decb.snap successfully
2017-05-18 11:58:20.651068 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc305dd.snap successfully
2017-05-18 11:58:20.653365 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc32dfe.snap successfully
2017-05-18 11:58:20.655779 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc35514.snap successfully
2017-05-18 11:58:20.658079 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc37c26.snap successfully
2017-05-18 11:58:26.867468 I | wal: segmented wal file /var/etcd/data/member/wal/00000000000040f5-00000000fdc441ad.wal is created
2017-05-18 11:58:50.088632 I | fileutil: purged file /var/etcd/data/member/wal/00000000000040f0-00000000fd9e645a.wal successfully
2017-05-18 12:02:58.528159 I | etcdserver: start to snapshot (applied: 4257506963, lastsnap: 4257496962)
2017-05-18 12:02:58.682533 I | etcdserver: saved snapshot at index 4257506963
2017-05-18 12:02:58.682834 I | etcdserver: compacted raft log at 4257501963
2017-05-18 12:03:20.664157 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc3a337.snap successfully
2017-05-18 12:07:37.935104 I | wal: segmented wal file /var/etcd/data/member/wal/00000000000040f6-00000000fdc48c7c.wal is created
2017-05-18 12:07:45.970030 I | etcdserver: start to snapshot (applied: 4257516964, lastsnap: 4257506963)
2017-05-18 12:07:46.293832 I | etcdserver: saved snapshot at index 4257516964
2017-05-18 12:07:46.294261 I | etcdserver: compacted raft log at 4257511964
2017-05-18 12:07:50.101677 I | fileutil: purged file /var/etcd/data/member/wal/00000000000040f1-00000000fd9eb256.wal successfully
2017-05-18 12:07:50.669514 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc3ca4b.snap successfully
2017-05-18 12:12:15.587835 I | etcdserver: start to snapshot (applied: 4257526966, lastsnap: 4257516964)
2017-05-18 12:12:16.099798 I | etcdserver: saved snapshot at index 4257526966
2017-05-18 12:12:16.100121 I | etcdserver: compacted raft log at 4257521966
2017-05-18 12:12:20.675058 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdc3f15e.snap successfully
2017-05-18 12:16:42.161711 I | wal: segmented wal file /var/etcd/data/member/wal/00000000000040f7-00000000fdc4dab3.wal is created
2017-05-18 12:16:49.084965 I | etcdserver: start to snapshot (applied: 4257536967, lastsnap: 4257526966)
...


2017-05-18 14:30:21.103971 I | fileutil: purged file /var/etcd/data/member/snap/00000000000004f7-00000000fdd00609.snap successfully
2017-05-18 14:31:26.354837 I | raft: c17c9a8d17bc74ea is starting a new election at term 1271
2017-05-18 14:31:26.368144 I | raft: c17c9a8d17bc74ea became candidate at term 1272
2017-05-18 14:31:26.375851 I | raft: c17c9a8d17bc74ea received vote from c17c9a8d17bc74ea at term 1272
2017-05-18 14:31:26.380401 I | raft: c17c9a8d17bc74ea [logterm: 1271, index: 4258322857] sent vote request to 178b7f7c9122c3b5 at term 1272
2017-05-18 14:31:26.380417 I | raft: c17c9a8d17bc74ea [logterm: 1271, index: 4258322857] sent vote request to 617f02413611a25b at term 1272
2017-05-18 14:31:26.380429 I | raft: raft.node: c17c9a8d17bc74ea lost leader 178b7f7c9122c3b5 at term 1272
2017-05-18 14:31:26.446054 I | raft: c17c9a8d17bc74ea received vote from 617f02413611a25b at term 1272
2017-05-18 14:31:26.446116 I | raft: c17c9a8d17bc74ea [quorum:2] has received 2 votes and 0 vote rejections
2017-05-18 14:31:26.446881 I | raft: c17c9a8d17bc74ea became leader at term 1272
2017-05-18 14:31:26.446881 I | raft: raft.node: c17c9a8d17bc74ea elected leader c17c9a8d17bc74ea at term 1272
2017-05-18 14:31:26.475828 E | rafthttp: failed to write 178b7f7c9122c3b5 on stream MsgApp v2 (write tcp 10.28.14.164:2380->10.28.11.39:50686: write: broken pipe)
2017-05-18 14:31:26.475859 I | rafthttp: peer 178b7f7c9122c3b5 became inactive
2017-05-18 14:31:26.476846 W | rafthttp: lost the TCP streaming connection with peer 178b7f7c9122c3b5 (stream MsgApp v2 writer)
2017-05-18 14:31:27.645243 W | rafthttp: lost the TCP streaming connection with peer 178b7f7c9122c3b5 (stream Message reader)
2017-05-18 14:31:27.703764 I | rafthttp: peer 178b7f7c9122c3b5 became active
2017-05-18 14:31:27.703837 I | rafthttp: established a TCP streaming connection with peer 178b7f7c9122c3b5 (stream MsgApp v2 writer)
2017-05-18 14:31:27.746997 I | rafthttp: established a TCP streaming connection with peer 178b7f7c9122c3b5 (stream Message reader)
2017-05-18 14:31:29.580269 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 25.270189ms)
2017-05-18 14:31:29.580305 W | etcdserver: server is likely overloaded
2017-05-18 14:31:29.580316 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 25.341033ms)
2017-05-18 14:31:29.580322 W | etcdserver: server is likely overloaded
2017-05-18 14:31:45.943834 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 88.945841ms)
2017-05-18 14:31:45.943869 W | etcdserver: server is likely overloaded
2017-05-18 14:31:45.943880 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 89.033934ms)

.....



2017-05-19 02:11:00.268965 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 14.138642ms)
2017-05-19 02:11:00.268976 W | etcdserver: server is likely overloaded
2017-05-19 02:11:14.354939 W | raft: c17c9a8d17bc74ea stepped down to follower since quorum is not active
2017-05-19 02:11:14.355019 I | raft: c17c9a8d17bc74ea became follower at term 1272
2017-05-19 02:11:14.355030 I | raft: raft.node: c17c9a8d17bc74ea lost leader c17c9a8d17bc74ea at term 1272
2017-05-19 02:11:15.754867 I | raft: c17c9a8d17bc74ea is starting a new election at term 1272
2017-05-19 02:11:15.767292 I | raft: c17c9a8d17bc74ea became candidate at term 1273
2017-05-19 02:11:15.767318 I | raft: c17c9a8d17bc74ea received vote from c17c9a8d17bc74ea at term 1273
2017-05-19 02:11:15.767332 I | raft: c17c9a8d17bc74ea [logterm: 1272, index: 4260993216] sent vote request to 178b7f7c9122c3b5 at term 1273
2017-05-19 02:11:15.767343 I | raft: c17c9a8d17bc74ea [logterm: 1272, index: 4260993216] sent vote request to 617f02413611a25b at term 1273
2017-05-19 02:11:15.986834 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:15.997938 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072122 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072170 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072186 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
...


2017-05-19 02:11:16.072571 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072596 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072606 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072619 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072628 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072639 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072653 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072663 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 178b7f7c9122c3b5 [term: 1272]
2017-05-19 02:11:16.072684 I | raft: c17c9a8d17bc74ea received vote rejection from 178b7f7c9122c3b5 at term 1273
2017-05-19 02:11:16.072694 I | raft: c17c9a8d17bc74ea [quorum:2] has received 1 votes and 1 vote rejections
2017-05-19 02:11:16.072704 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072718 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072727 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgAppResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072740 I | raft: c17c9a8d17bc74ea [term: 1273] ignored a MsgHeartbeatResp message with lower term from 617f02413611a25b [term: 1272]
2017-05-19 02:11:16.072763 I | raft: c17c9a8d17bc74ea [logterm: 1272, index: 4260993216, vote: c17c9a8d17bc74ea] rejected vote from 617f02413611a25b [logterm: 1272, index: 4260993216] at term 1273
2017-05-19 02:11:16.072840 I | raft: c17c9a8d17bc74ea became follower at term 1273
2017-05-19 02:11:16.072874 I | raft: raft.node: c17c9a8d17bc74ea elected leader 617f02413611a25b at term 1273
2017-05-19 02:13:41.652462 I | wal: segmented wal file /var/etcd/data/member/wal/0000000000004157-00000000fdf9b006.wal is created
2017-05-19 02:13:47.720069 I | etcdserver: start to snapshot (applied: 4260999438, lastsnap: 4260989437)
2017-05-19 02:13:48.476944 I | etcdserver: saved snapshot at index 4260999438

@armstrongli
Copy link
Author

armstrongli commented May 19, 2017

2017-05-19 02:11:16.072840 I | raft: c17c9a8d17bc74ea became follower at term 1273
2017-05-19 02:11:16.072874 I | raft: raft.node: c17c9a8d17bc74ea elected leader 617f02413611a25b at term 1273

Looks like that this member didn't purge its history things. Or there can be some error on the storage part to catch the wrong history logs.

@armstrongli
Copy link
Author

The most wired thing is that the leader election kept running a long time. The maser-2 should step down as the cluster has leader and should follow the leader's steps.
It kept raising leader election again and agin.

@armstrongli
Copy link
Author

my bad about leader election.

@armstrongli
Copy link
Author

@xiang90 per my understanding, the leader election should happen if the member can't get heart beat from leader, and convert into candidate and send leader election.
And at the same time, it will stop serving traffics, and waiting for the election finish.

If the member gets the cluster leader info and follow up, then it should catch up the cluster and serve traffics.

Now the status of the member looks like that it has caught up with the cluster and served traffic for some time.
The data got chaos, it looks like.

@xiang90
Copy link
Contributor

xiang90 commented May 22, 2017

@armstrongli

Hi, if you can investigate the issue and find the root cause, then it would be great! Probably you have to look into the codebase more and understand how exactly etcd works before you start the investigation. It does not work the way you assume. Or the best option here is probably to provide enough information for etcd dev team to help.

We need the full log around (before 24 hours and after 24 hours ) when the bad thing happened. Do not truncate the log or skip log entries. Raw log would be the best.

Yes. they were synced and have been running months healthily.

How do you know they are synced? Do you have any monitoring data?

@armstrongli
Copy link
Author

@xiang90
Here are all the logs in 2017-05-18 and 2017-05-19

etcd-unsync-logs.tar.gz

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@armstrongli I read through the log. I do not find anything that can cause data inconsistent in the log. I suspect that the data already got inconsistent before 5.18.

I guess you noticed it since c17c9a8d17bc74ea (the bad one) became leader at 2017-05-18 14:31:26.446881. Then things start to fall apart.

@armstrongli
Copy link
Author

@xiang90

(the bad one) became leader

My confustion:

  1. How do you define the bad one?
  2. How does this bad one come out?

The bad one actually is extremely dangerous.

@armstrongli
Copy link
Author

armstrongli commented Jun 1, 2017

Related issue:
#8009

@xiang90 I have successfully reproduced the data unbalance state of etcd cluster. I have one test kubernetes cluster and keep it running for months. The difference between k8s and benchmark is that

  • benchmark always creating different keys and put into etcd storage
  • k8s always updating the same set of keys and keep doing compact from day to day

Then I run defrag on the cluster and you can see the differences between before and after

/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                        ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-46swx-1781998.19.tess.io:4001 | 43a1252010cf94a4 | 3.0.15  | 340 MB  | false     |      1368 |  317106015 |
| https://kubernetes-master-q0f4f-1783564.19.tess.io:4001 | 7856ae8f8e27e912 | 3.0.15  | 12 GB   | false     |      1368 |  317106015 |
| https://kubernetes-master-1-1728078.19.tess.io:4001     | 88d3f79cc39e44de | 3.0.15  | 14 GB   | false     |      1368 |  317106015 |
| https://kubernetes-master-md9kf-1783616.19.tess.io:4001 | ed565574a43c804e | 3.0.15  | 340 MB  | true      |      1368 |  317106016 |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt defrag;
Finished defragmenting etcd member[https://kubernetes-master-46swx-1781998.19.tess.io:4001]
Failed to defragment etcd member[https://kubernetes-master-q0f4f-1783564.19.tess.io:4001] (context deadline exceeded)
Finished defragmenting etcd member[https://kubernetes-master-1-1728078.19.tess.io:4001]
Failed to defragment etcd member[https://kubernetes-master-md9kf-1783616.19.tess.io:4001] (context deadline exceeded)
/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                        ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-46swx-1781998.19.tess.io:4001 | 43a1252010cf94a4 | 3.0.15  | 14 MB   | false     |      1368 |  317106529 |
| https://kubernetes-master-q0f4f-1783564.19.tess.io:4001 | 7856ae8f8e27e912 | 3.0.15  | 12 GB   | false     |      1368 |  317106529 |
| https://kubernetes-master-1-1728078.19.tess.io:4001     | 88d3f79cc39e44de | 3.0.15  | 14 MB   | false     |      1368 |  317106529 |
| https://kubernetes-master-md9kf-1783616.19.tess.io:4001 | ed565574a43c804e | 3.0.15  | 340 MB  | true      |      1368 |  317106529 |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

after defrag all the members, they have the same data size now.

/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                        ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-46swx-1781998.19.tess.io:4001 | 43a1252010cf94a4 | 3.0.15  | 14 MB   | false     |      1368 |  317106576 |
| https://kubernetes-master-q0f4f-1783564.19.tess.io:4001 | 7856ae8f8e27e912 | 3.0.15  | 12 GB   | false     |      1368 |  317106576 |
| https://kubernetes-master-1-1728078.19.tess.io:4001     | 88d3f79cc39e44de | 3.0.15  | 14 MB   | false     |      1368 |  317106576 |
| https://kubernetes-master-md9kf-1783616.19.tess.io:4001 | ed565574a43c804e | 3.0.15  | 340 MB  | true      |      1368 |  317106576 |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt defrag --command-timeout=20s;
Finished defragmenting etcd member[https://kubernetes-master-46swx-1781998.19.tess.io:4001]
Finished defragmenting etcd member[https://kubernetes-master-q0f4f-1783564.19.tess.io:4001]
Finished defragmenting etcd member[https://kubernetes-master-1-1728078.19.tess.io:4001]
Finished defragmenting etcd member[https://kubernetes-master-md9kf-1783616.19.tess.io:4001]
/ # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                        ENDPOINT                         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://kubernetes-master-46swx-1781998.19.tess.io:4001 | 43a1252010cf94a4 | 3.0.15  | 14 MB   | false     |      1368 |  317111619 |
| https://kubernetes-master-q0f4f-1783564.19.tess.io:4001 | 7856ae8f8e27e912 | 3.0.15  | 14 MB   | false     |      1368 |  317111619 |
| https://kubernetes-master-1-1728078.19.tess.io:4001     | 88d3f79cc39e44de | 3.0.15  | 14 MB   | false     |      1368 |  317111619 |
| https://kubernetes-master-md9kf-1783616.19.tess.io:4001 | ed565574a43c804e | 3.0.15  | 14 MB   | true      |      1368 |  317111619 |
+---------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
/ #

The snapshot size ranges a large scale, too.

bash-4.3# cd backup/
bash-4.3# du -hd 1
325M	./1496242169341370159
325M	./1496263769339524487
325M	./1496285369332812870
14M	./1496306969333338979
987M	.

@xiang90
Copy link
Contributor

xiang90 commented Jun 1, 2017

@xiang90 I have successfully reproduced the data unbalance state of etcd cluster.

can you provide me the steps to reproduce it? so that we can also reproduce it in our environment?

@armstrongli
Copy link
Author

I compared 2 snapshots before defrag and after defrag(after defrag and running for ~3 days). There're a lot of free object in snapshot file

snapshot file size branches meta leaf freelist free
1496138366062086842 1.7G 242 3 16343 1 389989
1496748109064920008 194M 259 2 17161 1 17436

It's obvious that there're a lot of free TYPEs analyzed by bolt pages ...

@armstrongli
Copy link
Author

I thought about the object defined by bolt DB, it defines buckets and fill in the buckets with values and expand the size until it reach the max value and expand it.

I guess(haven't reviewed code details about it): after compact and delete the old revisions, bolt DB does't reuse the blank pages(blocks) anymore and keeps allocating more space and place the data. especially the data is not solid as stone, it's porous as bread.

@xiang90
Copy link
Contributor

xiang90 commented Jun 6, 2017

@armstrongli We want to understand how your db size got unbalanced among members. Large db size is another issue. It is relevant, but not what we are talking about here. Again, we need to get a reproduce from you to confirm exactly what happened in your environment.

@xiang90
Copy link
Contributor

xiang90 commented Jun 14, 2017

@armstrongli kindly ping.

@armstrongli
Copy link
Author

@xiang90 I have been running benchmark on our cluster, and I can't reproduce it. Just keep the kubernetes running for months.
That's how it happens.

What's more, the db size can exceed the 8GB setting.

@xiang90
Copy link
Contributor

xiang90 commented Jun 16, 2017

Just keep the kubernetes running for months.

well. maybe k8s was doing something strange. or something else happened to the machine. we do not know. and there is no way to help you debug it.

I do not really believe this is an etcd issue. if it is, it is probably dup with #8009.

@xiang90 xiang90 closed this as completed Jun 16, 2017
@gjw0317
Copy link

gjw0317 commented Jun 21, 2017

@xiang90 I've had the same problem lately after I migrated from v2 and update the kubernetes's version from 1.4.7 to 1.6.4. There are many anomalies. The etcd data grew very fast, and unbalanced on menberes. And the data version has been sent back after I added the startup parameters(--quota-backend-bytes=5368709120) and restart etcd service. Because this problem can lead to periodic problems of kubernetes cluster, so the impact is very large.

I've read a few related questions: #8009,#45037

@StrongPa55word
Copy link

I am facing the similar issue. Any permanent fix for this?

@xiang90
Copy link
Contributor

xiang90 commented Jun 23, 2017

@gjw0317 @titotp

As I mentioned, this issue is probably a dup with #8009. I closed this one since it is messy and has no reproduce steps or useful information.

Please follow up on #8009.

Or if you can reproduce your issue and believe it is different from #8009, please create a new one.

Do not continue to comment on this closed issue please.

@etcd-io etcd-io locked and limited conversation to collaborators Jun 23, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

4 participants