Upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 nodes crashed #36680

roncrdb · 2019-04-09T18:27:48Z

Describe the problem

User was upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 when node crashed.
Below is the stack trace from node 20:

E190409 08:31:03.494426 433 util/log/crash_reporting.go:476  [n20,s20,r94300/?:/Table/118/1/"MT"/"GS:e{3…-4…}] Reported as error eeed501b655c46b5b3ca1dcb1b39929c
F190409 08:31:03.494431 433 storage/replica_raft.go:2350  [n20,s20,r94300/?:/Table/118/1/"MT"/"GS:e{3…-4…}] TruncatedState regressed:
old: index:556489 term:267
new: index:556486 term:266
goroutine 433 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc0000da600, 0xc0000da600, 0x5336e00, 0x17)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5ada300, 0xc000000004, 0x5336e54, 0x17, 0x92e, 0xc00788fcb0, 0x87)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x39d8080, 0xc000ae59b0, 0x4, 0x2, 0x0, 0x0, 0xc00902ea70, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x39d8080, 0xc000ae59b0, 0x1, 0x4, 0x0, 0x0, 0xc00902ea70, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x39d8080, 0xc000ae59b0, 0xc00902ea70, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:191 +0x6c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applyRaftCommand(0xc0069e0480, 0x39d8080, 0xc000ae59b0, 0xc006bf2570, 0x8, 0x0, 0xc006d5fe00, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:2350 +0x7fd
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc0069e0480, 0x39d8080, 0xc000ae59b0, 0xc006bf2570, 0x8, 0x10b, 0x87dca, 0xe0000000e, 0x7, 0xe5, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1938 +0x5df
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc0069e0480, 0x39d8080, 0xc000ae59b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:790 +0x13da
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x39d8080, 0xc000ae59b0, 0xc0069e0480, 0x39d8080)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3585 +0x120
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc0010bd200, 0x39d8080, 0xc000ae59b0, 0xc0069385f0, 0xc00786ded0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3232 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc0010bd200, 0x39d8080, 0xc005a1e8a0, 0x1705c)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3573 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc000425500, 0x39d8080, 0xc005a1e8a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21a
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x39d8080, 0xc005a1e8a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc005de7e20, 0xc0009a6000, 0xc005de7e10)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8

Expected behavior
Should be able to upgrade without nodes crashing

Additional data / screenshots
Node 11 also crashed with same error:

E190409 08:39:41.913621 495 util/log/crash_reporting.go:476  [n11,s11,r12387/?:/System/tsd/cr.node.sql.dists…] Reported as error 2338327a0bd34101803ae46205f46f9f
F190409 08:39:41.913625 495 storage/replica_raft.go:2350  [n11,s11,r12387/?:/System/tsd/cr.node.sql.dists…] TruncatedState regressed:
old: index:23937819 term:2378
new: index:23937817 term:2378

Environment:

CockroachDB v19.1.0-beta.20190318

The text was updated successfully, but these errors were encountered:

tim-o · 2019-04-09T18:29:18Z

Zendesk ticket #3250 has been linked to this issue.

When I landed the change to stop sending the Raft log in snapshots, I gated this on whether the truncated state had already been unreplicated for the range. However, this wasn't enough because older 19.1 betas knew about unreplicated truncated state and yet couldn't handle a regressing truncated state, which sending these snapshots could introduce. As a result, 19.1-beta nodes could crash while running mixed with 19.1-rcX. (Simply restarting those nodes with the upgraded binary should fix the problem). This PR breaks one of our rules around not introducing historical cluster versions, but in this case it's necessary and also shouldn't have any adverse effects. See cockroachdb#36680. Release note (bug fix): prevent a crash that could occur when running a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would manifest with a fatal error stating "TruncatedState regressed". Moving all nodes to the new binary (19.1-rcX or newer) rectifies this situation. This wouldn't affect anyone migrating directly from 2.1.x into 19.1.x, as the majority of our users are expected to.

tbg · 2019-04-10T15:12:17Z

I screwed something up here -- the nodes crashing were running v19.1.0-beta.20190318. They already have the cluster version VersionUnreplicatedRaftTruncatedState but aren't actually able to handle divergent snapshots (because that was only actually necessary later, when that change was made).

However, making that change, I did not introduce a new cluster version but assumed that once the truncated state was unreplicated, it was fine to also let it diverge. Not true, as this issue demonstrates.

The solution (to prevent upgrading into 19.1 to do the same to RC users) unfortunately requires "rewriting cluster version history".

For that specific user, I think the solution is to just restart the crashed nodes with the 19.1.0-rc2 binary. Things should work at that point without damage having been done to the data.

PR is out #36714

When I landed the change to stop sending the Raft log in snapshots, I gated this on whether the truncated state had already been unreplicated for the range. However, this wasn't enough because older 19.1 betas knew about unreplicated truncated state and yet couldn't handle a regressing truncated state, which sending these snapshots could introduce. As a result, 19.1-beta nodes could crash while running mixed with 19.1-rcX. (Simply restarting those nodes with the upgraded binary should fix the problem). This PR breaks one of our rules around not introducing historical cluster versions, but in this case it's necessary and also shouldn't have any adverse effects. See cockroachdb#36680. Release note (bug fix): prevent a crash that could occur when running a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would manifest with a fatal error stating "TruncatedState regressed". Moving all nodes to the new binary (19.1-rcX or newer) rectifies this situation. This wouldn't affect anyone migrating directly from 2.1.x into 19.1.x, as the majority of our users are expected to.

36714: storage: prevent crash migrating from 19.1-beta into 19.1-rcX r=bdarnell a=tbg When I landed the change to stop sending the Raft log in snapshots, I gated this on whether the truncated state had already been unreplicated for the range. However, this wasn't enough because older 19.1 betas knew about unreplicated truncated state and yet couldn't handle a regressing truncated state, which sending these snapshots could introduce. As a result, 19.1-beta nodes could crash while running mixed with 19.1-rcX. (Simply restarting those nodes with the upgraded binary should fix the problem). This PR breaks one of our rules around not introducing historical cluster versions, but in this case it's necessary and also shouldn't have any adverse effects. See #36680. Release note (bug fix): prevent a crash that could occur when running a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would manifest with a fatal error stating "TruncatedState regressed". Moving all nodes to the new binary (19.1-rcX or newer) rectifies this situation. This wouldn't affect anyone migrating directly from 2.1.x into 19.1.x, as the majority of our users are expected to. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

When I landed the change to stop sending the Raft log in snapshots, I gated this on whether the truncated state had already been unreplicated for the range. However, this wasn't enough because older 19.1 betas knew about unreplicated truncated state and yet couldn't handle a regressing truncated state, which sending these snapshots could introduce. As a result, 19.1-beta nodes could crash while running mixed with 19.1-rcX. (Simply restarting those nodes with the upgraded binary should fix the problem). This PR breaks one of our rules around not introducing historical cluster versions, but in this case it's necessary and also shouldn't have any adverse effects. See cockroachdb#36680. Release note (bug fix): prevent a crash that could occur when running a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would manifest with a fatal error stating "TruncatedState regressed". Moving all nodes to the new binary (19.1-rcX or newer) rectifies this situation. This wouldn't affect anyone migrating directly from 2.1.x into 19.1.x, as the majority of our users are expected to.

tbg · 2019-04-29T19:38:47Z

#36719 fixed this.

roncrdb added C-investigation Further steps needed to qualify. C-label will change. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 9, 2019

tbg self-assigned this Apr 10, 2019

tbg mentioned this issue Apr 10, 2019

storage: prevent crash migrating from 19.1-beta into 19.1-rcX #36714

Merged

tbg mentioned this issue Apr 10, 2019

backport-19.1: storage: prevent crash migrating from 19.1-beta into 19.1-rcX #36719

Merged

tbg closed this as completed Apr 29, 2019

tbg mentioned this issue May 1, 2019

server: TestClusterVersionUnreplicatedRaftTruncatedState failed under stress #34815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 nodes crashed #36680

Upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 nodes crashed #36680

roncrdb commented Apr 9, 2019

tim-o commented Apr 9, 2019

tbg commented Apr 10, 2019

tbg commented Apr 29, 2019

Upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 nodes crashed #36680

Upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 nodes crashed #36680

Comments

roncrdb commented Apr 9, 2019

tim-o commented Apr 9, 2019

tbg commented Apr 10, 2019

tbg commented Apr 29, 2019