Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit is out of range #8935

Closed
wojtek-t opened this issue Nov 29, 2017 · 21 comments
Closed

Commit is out of range #8935

wojtek-t opened this issue Nov 29, 2017 · 21 comments
Assignees

Comments

@wojtek-t
Copy link
Contributor

We've observed a problem with clustered etcd, that after recreating one of the instances, it is entering crashloop and panicing with:

2017-11-29 05:47:07.905742 I | etcdserver: advertise client URLs = http://127.0.0.1:2379 
2017-11-29 05:47:08.076040 I | etcdserver: restarting member 9af8ed310dba8214 in cluster 488ce3f49fab5e77 at commit index 20707 
2017-11-29 05:47:08.077815 C | raft: 9af8ed310dba8214 state.commit 20707 is out of range [10001, 11535] 
panic: 9af8ed310dba8214 state.commit 20707 is out of range [10001, 11535] 

What exactly is happening is that we are recreating a VM where the etcd is running (without calling remove/add member) preserving all the addresses, etcd data etc. So after recreation, it the etcd is being started as the old member (similarly as it was restarted, though with a big bigger downtime).

We are running 3.0.17 version.

I found this #5664, which seems to be a similar (same?) problem. Has this been fixed? If so, in what release?

@xiang90 @gyuho @heyitsanthony @jpbetz

@xiang90
Copy link
Contributor

xiang90 commented Nov 29, 2017

What exactly is happening is that we are recreating a VM where the etcd is running (without calling remove/add member) preserving all the addresses, etcd data etc. So after recreation, it the etcd is being started as the old member (similarly as it was restarted, though with a big bigger downtime).

Is the old etcd stopped before the new one is started? What is the exact sequence of events happening?

Also can you provide the full log instead of just the 3 lines before panicing?

@wojtek-t
Copy link
Contributor Author

Yes - it is stopped. The sequence is:

  • all 3 instances of etcd is running
  • one of them is being stopped (be deleting the VM in which it is running)
  • the VM is being recreated (the PersistentDisk where etcd data lives is attached)
  • the etcd is started

Also can you provide the full log instead of just the 3 lines before panicing?

Yeah sorry.

@xiang90
Copy link
Contributor

xiang90 commented Nov 29, 2017 via email

@xiang90
Copy link
Contributor

xiang90 commented Nov 29, 2017

@wojtek-t

How large is each request you sent to etcd? Are you able to share the broken WAL file? Are you able to reproduce this problem every time you did the sequences of operations you listed?

@wojtek-t
Copy link
Contributor Author

No - it's not reliably reproducible. We've seen this twice in the last few days in continuously running suite.
We no longer have WAL file unfortunately.
@mborsz - if we observe it again, we may want to try capturing this, if possible.

How large is each request you sent to etcd?

These are coming from our internal k8s tests. But the cluster is pretty small (3 nodes IIRC), so there isn't anything that is very big.

@xiang90
Copy link
Contributor

xiang90 commented Nov 29, 2017

@wojtek-t

OK. Also can you try on latest 3.2 to see if you can reproduce it?

Thanks.

@jpbetz jpbetz self-assigned this Nov 29, 2017
@jpbetz
Copy link
Contributor

jpbetz commented Nov 29, 2017

@wojtek-t I'll start an investigation of this shortly.

@wojtek-t
Copy link
Contributor Author

wojtek-t commented Dec 3, 2017

And I will try running with 3.1 and potentially also 3.2 to see if any of those is fixing those.

@wojtek-t
Copy link
Contributor Author

wojtek-t commented Dec 8, 2017

FYI - we have changed our tests to run etcd in 3.1.11 version. So far it looks promising, but we need a bit more time to validate it.

@xiang90
Copy link
Contributor

xiang90 commented Dec 8, 2017

@jpbetz @wojtek-t

We probably should move your test to upstream. It will make sure that etcd wont introduce regression that breaks it.

It is fine for the test case to be specific to GCP environment. All our e2e descriptive tests are running against GCP: http://dash.etcd.io/dashboard/db/functional-tests?orgId=1.

@SaranBalaji90 from AWS will setup the test on AWS environment. Azure might want to do the same thing.

@xiang90
Copy link
Contributor

xiang90 commented Dec 11, 2017

FYI - we have changed our tests to run etcd in 3.1.11 version.

I will be surprised that etcd 3.1.11 fixes the root cause. I feel the root cause is somewhere in raft pkg, and it is a pretty subtle one.

We probably should still spend sometime to understand why it happens under 3.0.10, and if 3.1.x fixes it which commit exactly fixed it.

/cc @jpbetz @wojtek-t

@wojtek-t
Copy link
Contributor Author

Definitely - Joe already said that he will be looking into that.

@jpbetz
Copy link
Contributor

jpbetz commented Dec 11, 2017 via email

@zrss
Copy link

zrss commented Dec 12, 2017

I encountered a similar case, but it is due to no space left on device, and the etcd holds v2 and v3 data

etcd version

2017-12-12 11:34:20.270243 I | etcdmain: etcd Version: 3.1.9

produce

It produced by following steps ...

I run a three node (HA) etcd cluster (such as node-1/node-2/node-3), Unfortunately, after a long time, node-1 was out of disk, in that time the remain two etcd members still running. Then I clear some useless data to free disk space, but the free space still can not match the need of etcd as it will receive a snapshot from the leader member (as its data behind far more then the cluster). Then things got worse.

The following logs shows that when i clear some data the action taken by node-1 etcd, it receive snapshot from leader and restore commitIndex to 116658132, but after that in raft progress save snapshot to .snap failed due to out of disk again as I didn't free enough space for etcd

node-1 etcd logs

2017-12-12 10:42:28.621858 I | rafthttp: receiving database snapshot [index:116658055, from 544c9d165e070a18] ...
2017-12-12 10:42:30.253747 E | rafthttp: failed to save KV snapshot (write /var/etcd-data/etcd/etcd-0/member/snap/tmp674225970: no space left on device)
2017-12-12 10:42:30.545319 I | rafthttp: receiving database snapshot [index:116658101, from 544c9d165e070a18] ...
2017-12-12 10:42:31.938653 E | rafthttp: failed to save KV snapshot (write /var/etcd-data/etcd/etcd-0/member/snap/tmp632108265: no space left on device)
2017-12-12 10:42:32.124505 I | rafthttp: receiving database snapshot [index:116658132, from 544c9d165e070a18] ...
2017-12-12 10:42:34.587907 I | snap: saved database snapshot to disk [total bytes: 47972352]
2017-12-12 10:42:34.587937 I | rafthttp: received and saved database snapshot [index: 116658132, from: 544c9d165e070a18] successfully
2017-12-12 10:42:34.588027 I | raft: e84ebc30798eec96 [commit: 113988062, lastindex: 113988062, lastterm: 28] starts to restore snapshot [index: 116658132, term: 28]
2017-12-12 10:42:34.588106 I | raft: e84ebc30798eec96 [commit: 116658132] restored snapshot [index: 116658132, term: 28]
2017-12-12 10:42:34.589685 I | etcdserver: applying snapshot at index 113979383...
2017-12-12 10:42:34.597730 I | etcdserver: recovering lessor...
2017-12-12 10:42:34.602677 C | etcdserver: raft save snapshot error: write /var/etcd-data/etcd/etcd-0/member/snap/000000000000001c-0000000006f40fd4.snap: no space left on device

however, I found this segment shows write wal and snapshot is a non-atomic operation

// SaveSnap saves the snapshot to disk and release the locked
// wal files since they will not be used.
func (st *storage) SaveSnap(snap raftpb.Snapshot) error {
	walsnap := walpb.Snapshot{
		Index: snap.Metadata.Index,
		Term:  snap.Metadata.Term,
	}
	err := st.WAL.SaveSnapshot(walsnap)
	if err != nil {
		return err
	}
	err = st.Snapshotter.SaveSnap(snap)
	if err != nil {
		return err
	}
	return st.WAL.ReleaseLockTo(snap.Metadata.Index)
}

And at this point, I dump the wal of etcd

node-1 etcd wal dump logs

Snapshot:
term=28 index=113979383 nodes=[544c9d165e070a18 bb4d3bda10d567f8 e84ebc30798eec96]
Start dupmping log entries from snapshot.
WAL metadata:
nodeID=e84ebc30798eec96 clusterID=1e81083acbac81f6 term=28 commitIndex=116658132 vote=0
WAL entries:
lastIndex=113988062

it can be seen commitIndex has been set to 116658132, but as we know snapshot was saved error in node-1 etcd logs, then node-1 etcd will encounter a panic when it restarted again

2017-12-12 11:34:20.499718 I | etcdserver: restarting member e84ebc30798eec96 in cluster 1e81083acbac81f6 at commit index 116658132
2017-12-12 11:34:20.500412 C | raft: e84ebc30798eec96 state.commit 116658132 is out of range [113979383, 113988062]
panic: e84ebc30798eec96 state.commit 116658132 is out of range [113979383, 113988062]

can this be a cause to this problem or am i misunderstanding something about my problem ?

@xiang90
Copy link
Contributor

xiang90 commented Feb 5, 2018

@zrss

I do not think your issue is the same as the one originally reported here. You hit a ENOSPC, but this issue did not.

@wojtek-t @jpbetz any progress on investigating the root cause of this issue?

@wojtek-t
Copy link
Contributor Author

wojtek-t commented Feb 5, 2018

@xiang90 - Joe wanted to look into that. I can only say that 3.1.11 release is fixing this issue.

@xiang90
Copy link
Contributor

xiang90 commented Feb 5, 2018

@wojtek-t

I feel 3.1 probably includes something mitigates the issue, not fixing it. I am concerned that the root cause still exists, and eventually you will hit this bug again after a longer run.

@youyiprime
Copy link

Hello guys! Anyone knows which commit solves or imtigates the issue in 3.1.11 ? Any suggestions?

@jpbetz
Copy link
Contributor

jpbetz commented Jun 15, 2018

Fix is in bbolt coreos/bbolt v1.3.1-coreos.5. Main summary is on #8813 (comment) and the backport to 3.1 is #8902.

@jpbetz jpbetz closed this as completed Jun 15, 2018
@shsjshentao
Copy link
Contributor

shsjshentao commented Jun 22, 2018

So who knows how to fix it in 3.0.17 if happened this problem? Any workrounds? I tried remove member and add member again, it does not work and shows the same error.

@shsjshentao
Copy link
Contributor

Sorry, for unknown reason, failed to remove etcd data directory in the crashed node. So it is still crashed.
I tried the commands again, remove member, remove crashed node etcd data, add member, start etcd can deal with this problem for multiple nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants