Commit is out of range #8935

wojtek-t · 2017-11-29T11:49:48Z

We've observed a problem with clustered etcd, that after recreating one of the instances, it is entering crashloop and panicing with:

2017-11-29 05:47:07.905742 I | etcdserver: advertise client URLs = http://127.0.0.1:2379 
2017-11-29 05:47:08.076040 I | etcdserver: restarting member 9af8ed310dba8214 in cluster 488ce3f49fab5e77 at commit index 20707 
2017-11-29 05:47:08.077815 C | raft: 9af8ed310dba8214 state.commit 20707 is out of range [10001, 11535] 
panic: 9af8ed310dba8214 state.commit 20707 is out of range [10001, 11535]

What exactly is happening is that we are recreating a VM where the etcd is running (without calling remove/add member) preserving all the addresses, etcd data etc. So after recreation, it the etcd is being started as the old member (similarly as it was restarted, though with a big bigger downtime).

We are running 3.0.17 version.

I found this #5664, which seems to be a similar (same?) problem. Has this been fixed? If so, in what release?

@xiang90 @gyuho @heyitsanthony @jpbetz

The text was updated successfully, but these errors were encountered:

xiang90 · 2017-11-29T15:07:25Z

What exactly is happening is that we are recreating a VM where the etcd is running (without calling remove/add member) preserving all the addresses, etcd data etc. So after recreation, it the etcd is being started as the old member (similarly as it was restarted, though with a big bigger downtime).

Is the old etcd stopped before the new one is started? What is the exact sequence of events happening?

Also can you provide the full log instead of just the 3 lines before panicing?

wojtek-t · 2017-11-29T15:19:21Z

Yes - it is stopped. The sequence is:

all 3 instances of etcd is running
one of them is being stopped (be deleting the VM in which it is running)
the VM is being recreated (the PersistentDisk where etcd data lives is attached)
the etcd is started

Also can you provide the full log instead of just the 3 lines before panicing?

Yeah sorry.

Logs from few minutes from the initial VM (before it was deleted): https://gist.github.com/wojtek-t/22665ffaee649a9c263fea090766a1a7
Logs from recreated VM: https://gist.github.com/wojtek-t/f9b9f27fcfaf7207464d0c6b81f61f5e

xiang90 · 2017-11-29T15:49:44Z

Thanks. I am not sure if this is the same issue you linked as I have not looked into the logs. We will give this a look soon.

…

On Wed, Nov 29, 2017 at 7:19 AM Wojciech Tyczynski ***@***.***> wrote: Yes - it is stopped. The sequence is: - all 3 instances of etcd is running - one of them is being stopped (be deleting the VM in which it is running) - the VM is being recreated (the PersistentDisk where etcd data lives is attached) - the etcd is started Also can you provide the full log instead of just the 3 lines before panicing? Yeah sorry. - Logs from few minutes from the initial VM (before it was deleted): https://gist.github.com/wojtek-t/22665ffaee649a9c263fea090766a1a7 - Logs from recreated VM: https://gist.github.com/wojtek-t/f9b9f27fcfaf7207464d0c6b81f61f5e — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8935 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AERby_epjttx_OncgaEdZxFjqfyvK7gGks5s7XYFgaJpZM4Qu1ol> .

xiang90 · 2017-11-29T16:46:41Z

@wojtek-t

How large is each request you sent to etcd? Are you able to share the broken WAL file? Are you able to reproduce this problem every time you did the sequences of operations you listed?

wojtek-t · 2017-11-29T17:57:56Z

No - it's not reliably reproducible. We've seen this twice in the last few days in continuously running suite.
We no longer have WAL file unfortunately.
@mborsz - if we observe it again, we may want to try capturing this, if possible.

How large is each request you sent to etcd?

These are coming from our internal k8s tests. But the cluster is pretty small (3 nodes IIRC), so there isn't anything that is very big.

xiang90 · 2017-11-29T18:44:19Z

@wojtek-t

OK. Also can you try on latest 3.2 to see if you can reproduce it?

Thanks.

jpbetz · 2017-11-29T18:59:15Z

@wojtek-t I'll start an investigation of this shortly.

wojtek-t · 2017-12-03T20:18:12Z

And I will try running with 3.1 and potentially also 3.2 to see if any of those is fixing those.

wojtek-t · 2017-12-08T11:49:53Z

FYI - we have changed our tests to run etcd in 3.1.11 version. So far it looks promising, but we need a bit more time to validate it.

xiang90 · 2017-12-08T14:38:53Z

@jpbetz @wojtek-t

We probably should move your test to upstream. It will make sure that etcd wont introduce regression that breaks it.

It is fine for the test case to be specific to GCP environment. All our e2e descriptive tests are running against GCP: http://dash.etcd.io/dashboard/db/functional-tests?orgId=1.

@SaranBalaji90 from AWS will setup the test on AWS environment. Azure might want to do the same thing.

xiang90 · 2017-12-11T19:43:08Z

FYI - we have changed our tests to run etcd in 3.1.11 version.

I will be surprised that etcd 3.1.11 fixes the root cause. I feel the root cause is somewhere in raft pkg, and it is a pretty subtle one.

We probably should still spend sometime to understand why it happens under 3.0.10, and if 3.1.x fixes it which commit exactly fixed it.

/cc @jpbetz @wojtek-t

wojtek-t · 2017-12-11T19:50:13Z

Definitely - Joe already said that he will be looking into that.

jpbetz · 2017-12-11T19:53:34Z

Yes. As glad as I am to see the problem go away. It makes me uncomfortable not knowing why. We’ll get to the bottom of it.

…

On Mon, Dec 11, 2017 at 11:50 AM Wojciech Tyczynski < ***@***.***> wrote: Definitely - Joe already said that he will be looking into that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8935 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAf9Ru-vB6Qu67H0sSyUCIgs30UirK1jks5s_Yd6gaJpZM4Qu1ol> .

zrss · 2017-12-12T08:16:37Z

I encountered a similar case, but it is due to no space left on device, and the etcd holds v2 and v3 data

etcd version

2017-12-12 11:34:20.270243 I | etcdmain: etcd Version: 3.1.9

produce

It produced by following steps ...

I run a three node (HA) etcd cluster (such as node-1/node-2/node-3), Unfortunately, after a long time, node-1 was out of disk, in that time the remain two etcd members still running. Then I clear some useless data to free disk space, but the free space still can not match the need of etcd as it will receive a snapshot from the leader member (as its data behind far more then the cluster). Then things got worse.

The following logs shows that when i clear some data the action taken by node-1 etcd, it receive snapshot from leader and restore commitIndex to 116658132, but after that in raft progress save snapshot to .snap failed due to out of disk again as I didn't free enough space for etcd

node-1 etcd logs

2017-12-12 10:42:28.621858 I | rafthttp: receiving database snapshot [index:116658055, from 544c9d165e070a18] ...
2017-12-12 10:42:30.253747 E | rafthttp: failed to save KV snapshot (write /var/etcd-data/etcd/etcd-0/member/snap/tmp674225970: no space left on device)
2017-12-12 10:42:30.545319 I | rafthttp: receiving database snapshot [index:116658101, from 544c9d165e070a18] ...
2017-12-12 10:42:31.938653 E | rafthttp: failed to save KV snapshot (write /var/etcd-data/etcd/etcd-0/member/snap/tmp632108265: no space left on device)
2017-12-12 10:42:32.124505 I | rafthttp: receiving database snapshot [index:116658132, from 544c9d165e070a18] ...
2017-12-12 10:42:34.587907 I | snap: saved database snapshot to disk [total bytes: 47972352]
2017-12-12 10:42:34.587937 I | rafthttp: received and saved database snapshot [index: 116658132, from: 544c9d165e070a18] successfully
2017-12-12 10:42:34.588027 I | raft: e84ebc30798eec96 [commit: 113988062, lastindex: 113988062, lastterm: 28] starts to restore snapshot [index: 116658132, term: 28]
2017-12-12 10:42:34.588106 I | raft: e84ebc30798eec96 [commit: 116658132] restored snapshot [index: 116658132, term: 28]
2017-12-12 10:42:34.589685 I | etcdserver: applying snapshot at index 113979383...
2017-12-12 10:42:34.597730 I | etcdserver: recovering lessor...
2017-12-12 10:42:34.602677 C | etcdserver: raft save snapshot error: write /var/etcd-data/etcd/etcd-0/member/snap/000000000000001c-0000000006f40fd4.snap: no space left on device

however, I found this segment shows write wal and snapshot is a non-atomic operation

// SaveSnap saves the snapshot to disk and release the locked
// wal files since they will not be used.
func (st *storage) SaveSnap(snap raftpb.Snapshot) error {
	walsnap := walpb.Snapshot{
		Index: snap.Metadata.Index,
		Term:  snap.Metadata.Term,
	}
	err := st.WAL.SaveSnapshot(walsnap)
	if err != nil {
		return err
	}
	err = st.Snapshotter.SaveSnap(snap)
	if err != nil {
		return err
	}
	return st.WAL.ReleaseLockTo(snap.Metadata.Index)
}

And at this point, I dump the wal of etcd

node-1 etcd wal dump logs

Snapshot:
term=28 index=113979383 nodes=[544c9d165e070a18 bb4d3bda10d567f8 e84ebc30798eec96]
Start dupmping log entries from snapshot.
WAL metadata:
nodeID=e84ebc30798eec96 clusterID=1e81083acbac81f6 term=28 commitIndex=116658132 vote=0
WAL entries:
lastIndex=113988062

it can be seen commitIndex has been set to 116658132, but as we know snapshot was saved error in node-1 etcd logs, then node-1 etcd will encounter a panic when it restarted again

2017-12-12 11:34:20.499718 I | etcdserver: restarting member e84ebc30798eec96 in cluster 1e81083acbac81f6 at commit index 116658132
2017-12-12 11:34:20.500412 C | raft: e84ebc30798eec96 state.commit 116658132 is out of range [113979383, 113988062]
panic: e84ebc30798eec96 state.commit 116658132 is out of range [113979383, 113988062]

can this be a cause to this problem or am i misunderstanding something about my problem ?

xiang90 · 2018-02-05T18:44:32Z

@zrss

I do not think your issue is the same as the one originally reported here. You hit a ENOSPC, but this issue did not.

@wojtek-t @jpbetz any progress on investigating the root cause of this issue?

wojtek-t · 2018-02-05T19:20:17Z

@xiang90 - Joe wanted to look into that. I can only say that 3.1.11 release is fixing this issue.

xiang90 · 2018-02-05T19:22:53Z

@wojtek-t

I feel 3.1 probably includes something mitigates the issue, not fixing it. I am concerned that the root cause still exists, and eventually you will hit this bug again after a longer run.

youyiprime · 2018-06-15T02:23:42Z

Hello guys! Anyone knows which commit solves or imtigates the issue in 3.1.11 ? Any suggestions?

jpbetz · 2018-06-15T19:35:59Z

Fix is in bbolt coreos/bbolt v1.3.1-coreos.5. Main summary is on #8813 (comment) and the backport to 3.1 is #8902.

shsjshentao · 2018-06-22T16:06:36Z

So who knows how to fix it in 3.0.17 if happened this problem? Any workrounds? I tried remove member and add member again, it does not work and shows the same error.

shsjshentao · 2018-06-26T02:50:01Z

Sorry, for unknown reason, failed to remove etcd data directory in the crashed node. So it is still crashed.
I tried the commands again, remove member, remove crashed node etcd data, add member, start etcd can deal with this problem for multiple nodes.

jpbetz self-assigned this Nov 29, 2017

gyuho mentioned this issue Dec 12, 2017

etcdserver: gracefully handle ENOSPC #4744

Closed

chaithramohan mentioned this issue Dec 16, 2017

etcdmain: database file does not match with snapshot #9021

Closed

gyuho added the area/raft label Feb 25, 2018

jpbetz closed this as completed Jun 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit is out of range #8935

Commit is out of range #8935

wojtek-t commented Nov 29, 2017

xiang90 commented Nov 29, 2017

wojtek-t commented Nov 29, 2017

xiang90 commented Nov 29, 2017 via email

xiang90 commented Nov 29, 2017

wojtek-t commented Nov 29, 2017

xiang90 commented Nov 29, 2017

jpbetz commented Nov 29, 2017

wojtek-t commented Dec 3, 2017

wojtek-t commented Dec 8, 2017

xiang90 commented Dec 8, 2017

xiang90 commented Dec 11, 2017

wojtek-t commented Dec 11, 2017

jpbetz commented Dec 11, 2017 via email

zrss commented Dec 12, 2017

xiang90 commented Feb 5, 2018

wojtek-t commented Feb 5, 2018

xiang90 commented Feb 5, 2018

youyiprime commented Jun 15, 2018

jpbetz commented Jun 15, 2018

shsjshentao commented Jun 22, 2018 •

edited

Loading

shsjshentao commented Jun 26, 2018

Commit is out of range #8935

Commit is out of range #8935

Comments

wojtek-t commented Nov 29, 2017

xiang90 commented Nov 29, 2017

wojtek-t commented Nov 29, 2017

xiang90 commented Nov 29, 2017 via email

xiang90 commented Nov 29, 2017

wojtek-t commented Nov 29, 2017

xiang90 commented Nov 29, 2017

jpbetz commented Nov 29, 2017

wojtek-t commented Dec 3, 2017

wojtek-t commented Dec 8, 2017

xiang90 commented Dec 8, 2017

xiang90 commented Dec 11, 2017

wojtek-t commented Dec 11, 2017

jpbetz commented Dec 11, 2017 via email

zrss commented Dec 12, 2017

etcd version

produce

node-1 etcd logs

node-1 etcd wal dump logs

xiang90 commented Feb 5, 2018

wojtek-t commented Feb 5, 2018

xiang90 commented Feb 5, 2018

youyiprime commented Jun 15, 2018

jpbetz commented Jun 15, 2018

shsjshentao commented Jun 22, 2018 • edited Loading

shsjshentao commented Jun 26, 2018

shsjshentao commented Jun 22, 2018 •

edited

Loading