panic: tocommit(691243) is out of range [lastIndex(691183)] causes CrashLoopBackOff #30260

cgebe · 2018-09-16T08:47:27Z

Is this a bug report or a feature request?

Bug

BUG REPORT

Which version of CockroachDB are you using?

v2.1.0-beta.20180910

What did you do?

#28487
&
#28487 (comment)
kv.raft_log.synchronize = false

What did you expect to see?

The node which panicked to come back up again.

What did you see instead?

Pod crashes over and over again.

panic: tocommit(691243) is out of range [lastIndex(691183)]. Was the raft log corrupted, truncated, or lost? [recovered]
	panic: tocommit(691243) is out of range [lastIndex(691183)]. Was the raft log corrupted, truncated, or lost?

goroutine 276 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc420568480, 0x2f791c0, 0xc42592aea0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:184 +0x11f
panic(0x25e9200, 0xc4266a2030)
	/usr/local/go/src/runtime/panic.go:502 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc4258841c0, 0x2ad85a3, 0x5d, 0xc429300060, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:106 +0xfe
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc4293020e0, 0xa8c2b)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:201 +0x15c
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc4200e1200, 0x8, 0x4, 0x2, 0x541e, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1265 +0x54
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.stepFollower(0xc4200e1200, 0x8, 0x4, 0x2, 0x541e, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1210 +0x44c
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc4200e1200, 0x8, 0x4, 0x2, 0x541e, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:921 +0x1308
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Step(0xc4208fb980, 0x8, 0x4, 0x2, 0x541e, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:195 +0xdd
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup.func1(0xc4208fb980, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3927 +0x434
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked.func1(0xc42b594b88, 0x1, 0xc4208fb980, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:572 +0x30
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc42086f180, 0x2b82200, 0xc42b594b88, 0xc42b594b70, 0x7909f8)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:573 +0xc2
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup(0xc42086f180, 0x0, 0xc42b594b88, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:598 +0x8c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).stepRaftGroup(0xc42086f180, 0xc427cb4000, 0x26cb800, 0x30)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3887 +0x55
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequestWithReplica(0xc420848580, 0x2f791c0, 0xc4292d41b0, 0xc42086f180, 0xc427cb4000, 0xc4292d41b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3536 +0xeb
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x2f791c0, 0xc4292d41b0, 0xc42086f180, 0x2f791c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3870 +0x97
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc420848580, 0x2f791c0, 0xc4292d41b0, 0xc427cb4000, 0xc42b594ed0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3499 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc420848580, 0x2f791c0, 0xc42592aea0, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3864 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc4201e3680, 0x2f791c0, 0xc42592aea0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:226 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x2f791c0, 0xc42592aea0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:166 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc429221e20, 0xc420568480, 0xc429221e10)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

What was the impact?

Node staying dead.

I guess the panic is legitimate but a smooth recovery would be great.
Is there another way to restore the node without wiping its data dir?

Edit: While decommissioning the node there are plenty of replicas assigned to the dead node. Decommissioning takes ages:

  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5266 |        true        |    false     
(1 row)
..................
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5265 |        true        |    false     
(1 row)
...............
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5264 |        true        |    false     
(1 row)
..............
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5205 |        true        |    false     
(1 row)
................
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5190 |        true        |    false     
(1 row)
..............
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5191 |        true        |    false     
(1 row)
..
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5190 |        true        |    false     
(1 row)
.
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5191 |        true        |    false     
(1 row)
..
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5151 |        true        |    false     
(1 row)
...............
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5139 |        true        |    false     
(1 row)
..............
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5141 |        true        |    false     
(1 row)
.
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5140 |        true        |    false     
(1 row)

  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5125 |        true        |    false     
(1 row)
...............
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5124 |        true        |    false     
(1 row)
..............
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5125 |        true        |    false     
(1 row)
.
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5124 |        true        |    false     
(1 row)
.........
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5125 |        true        |    false     
(1 row)
...
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5126 |        true        |    false     
(1 row)
.
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5125 |        true        |    false     
(1 row)
.....
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5124 |        true        |    false     
(1 row)
.
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5123 |        true        |    false     
(1 row)
.............
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5122 |        true        |    false     
(1 row)
.............
  id | is_live | replicas | is_decommissioning | is_draining  
+----+---------+----------+--------------------+-------------+
   4 |  false  |     5121 |        true        |    false     
(1 row)
.................

The text was updated successfully, but these errors were encountered:

petermattis · 2018-09-17T00:44:55Z

@tschottdorf Could this be a different manifestation of #28918? v2.1.0-beta.20180910 did not contain the fix (#29579).

@cgebe The other possibility is that kv.raft_log.synchronize has caused a problem. That setting is not intended for production usage. The documentation associated with it should probably have a bigger warning as the likelihood of permanent data loss is extremely high. In particular, if a node crashes for any reason, then it is likely to continue to crash with a panic like the one above. I cannot stress strongly enough that kv.rage_log.synchronize should not be used in production clusters or any cluster in which you don't want to periodically wipe the cluster and start from scratch.

I guess the panic is legitimate but a smooth recovery would be great.
Is there another way to restore the node without wiping its data dir?

Not currently. It is probably possible to do something better here, but it is also possible that the attempt at something better makes the situation a lot worse. Trying to recover from unexpected errors is fraught. If we were to automatically remove some of the bad data, we might inadvertently be removing the last copy of the data.

cgebe · 2018-09-17T12:18:41Z

@petermattis I wiped the node and added it again in fresh state. Since the first incident, i did not have the same problem.

The requirements of my use case, especially time series data storage, may not align fully with cockroachdb's intended usage. I already took pressure off the insertion by placing an in-memory store in between and periodically insert to crdb separated by logical database. As i said, i simply cannot turn kv.rage_log.synchronize on, which leads to the high latencies mentioned in #28487 making my system not work at all (insertions are not carried out / transaction errors etc.). At the moment, this setting is the only way i see.

Since my knowledge is limited about the underlying processes, i will continue reporting my experience. Surely, i look forward to optimizations. I really like cockroachdb's functionality and interfaces (SQL with Views) and appreciate your activity. Switching to special time series dbs would be stepping backward in usability (specially in horizontal scaling), therefore unlikely.

petermattis · 2018-09-17T13:34:39Z

You're correct that time series data storage does not align well with CockroachDB's intended usage. You may be better served by a dedicated time series database.

Running with kv.raft_log.synchronize = false is not something we recommend or can support, and have no plans to improve.

I haven't looked at #28487, but there may be something else going on there that we should investigate.

tbg · 2018-09-18T14:47:36Z

@tschottdorf Could this be a different manifestation of #28918?

@petermattis I don't think so since the gap here is ~60 entries, though perhaps this is the failure mode I wasn't quite able to explain in that issue (where I saw a gap of 3). Right off the bat I'd say that this is more likely when a node didn't sync properly.

bdarnell · 2018-09-27T21:05:13Z

kv.raft_log.synchronize is almost certainly the issue here - it's known to cause this failure when a node shuts down uncleanly (and once it happens, it'll keep happening until that node is wiped).

The raft implementation is arguably being too strict here. If a single node has its log regress like this, it's safe to just catch it back up with new log entries. The problem is that if you had this happen to several nodes at once, data could be lost (silently, since there wouldn't be a more up-to-date node to tell them about the data they're missing). At this point all bets are off because various invariants could have been violated.

tbg · 2018-10-11T09:05:59Z

Closing since this is likely caused by not syncing to disk.

jordanlewis added C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community A-kv-replication Relating to Raft, consensus, and coordination. labels Sep 16, 2018

jordanlewis assigned petermattis Sep 16, 2018

petermattis assigned tbg and unassigned petermattis Sep 17, 2018

tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Sep 26, 2018

tbg added this to the 2.1 milestone Sep 27, 2018

tbg closed this as completed Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic: tocommit(691243) is out of range [lastIndex(691183)] causes CrashLoopBackOff #30260

panic: tocommit(691243) is out of range [lastIndex(691183)] causes CrashLoopBackOff #30260

cgebe commented Sep 16, 2018 •

edited

Loading

petermattis commented Sep 17, 2018

cgebe commented Sep 17, 2018

petermattis commented Sep 17, 2018

tbg commented Sep 18, 2018

bdarnell commented Sep 27, 2018

tbg commented Oct 11, 2018

panic: tocommit(691243) is out of range [lastIndex(691183)] causes CrashLoopBackOff #30260

panic: tocommit(691243) is out of range [lastIndex(691183)] causes CrashLoopBackOff #30260

Comments

cgebe commented Sep 16, 2018 • edited Loading

petermattis commented Sep 17, 2018

cgebe commented Sep 17, 2018

petermattis commented Sep 17, 2018

tbg commented Sep 18, 2018

bdarnell commented Sep 27, 2018

tbg commented Oct 11, 2018

cgebe commented Sep 16, 2018 •

edited

Loading