Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 nodes crashed #36680

Closed
roncrdb opened this issue Apr 9, 2019 · 3 comments
Closed

Upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 nodes crashed #36680

roncrdb opened this issue Apr 9, 2019 · 3 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-investigation Further steps needed to qualify. C-label will change.

Comments

@roncrdb
Copy link

roncrdb commented Apr 9, 2019

Describe the problem

User was upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 when node crashed.
Below is the stack trace from node 20:

E190409 08:31:03.494426 433 util/log/crash_reporting.go:476  [n20,s20,r94300/?:/Table/118/1/"MT"/"GS:e{3…-4…}] Reported as error eeed501b655c46b5b3ca1dcb1b39929c
F190409 08:31:03.494431 433 storage/replica_raft.go:2350  [n20,s20,r94300/?:/Table/118/1/"MT"/"GS:e{3…-4…}] TruncatedState regressed:
old: index:556489 term:267
new: index:556486 term:266
goroutine 433 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc0000da600, 0xc0000da600, 0x5336e00, 0x17)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5ada300, 0xc000000004, 0x5336e54, 0x17, 0x92e, 0xc00788fcb0, 0x87)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x39d8080, 0xc000ae59b0, 0x4, 0x2, 0x0, 0x0, 0xc00902ea70, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x39d8080, 0xc000ae59b0, 0x1, 0x4, 0x0, 0x0, 0xc00902ea70, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x39d8080, 0xc000ae59b0, 0xc00902ea70, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:191 +0x6c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applyRaftCommand(0xc0069e0480, 0x39d8080, 0xc000ae59b0, 0xc006bf2570, 0x8, 0x0, 0xc006d5fe00, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:2350 +0x7fd
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc0069e0480, 0x39d8080, 0xc000ae59b0, 0xc006bf2570, 0x8, 0x10b, 0x87dca, 0xe0000000e, 0x7, 0xe5, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1938 +0x5df
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc0069e0480, 0x39d8080, 0xc000ae59b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:790 +0x13da
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x39d8080, 0xc000ae59b0, 0xc0069e0480, 0x39d8080)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3585 +0x120
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc0010bd200, 0x39d8080, 0xc000ae59b0, 0xc0069385f0, 0xc00786ded0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3232 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc0010bd200, 0x39d8080, 0xc005a1e8a0, 0x1705c)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3573 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc000425500, 0x39d8080, 0xc005a1e8a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21a
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x39d8080, 0xc005a1e8a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc005de7e20, 0xc0009a6000, 0xc005de7e10)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8

Expected behavior
Should be able to upgrade without nodes crashing

Additional data / screenshots
Node 11 also crashed with same error:

E190409 08:39:41.913621 495 util/log/crash_reporting.go:476  [n11,s11,r12387/?:/System/tsd/cr.node.sql.dists…] Reported as error 2338327a0bd34101803ae46205f46f9f
F190409 08:39:41.913625 495 storage/replica_raft.go:2350  [n11,s11,r12387/?:/System/tsd/cr.node.sql.dists…] TruncatedState regressed:
old: index:23937819 term:2378
new: index:23937817 term:2378

Environment:

  • CockroachDB v19.1.0-beta.20190318
@roncrdb roncrdb added C-investigation Further steps needed to qualify. C-label will change. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 9, 2019
@tim-o
Copy link
Contributor

tim-o commented Apr 9, 2019

Zendesk ticket #3250 has been linked to this issue.

@tbg tbg self-assigned this Apr 10, 2019
tbg added a commit to tbg/cockroach that referenced this issue Apr 10, 2019
When I landed the change to stop sending the Raft log in snapshots, I
gated this on whether the truncated state had already been unreplicated
for the range. However, this wasn't enough because older 19.1 betas knew
about unreplicated truncated state and yet couldn't handle a regressing
truncated state, which sending these snapshots could introduce. As a
result, 19.1-beta nodes could crash while running mixed with 19.1-rcX.
(Simply restarting those nodes with the upgraded binary should fix the
problem).

This PR breaks one of our rules around not introducing historical
cluster versions, but in this case it's necessary and also shouldn't
have any adverse effects.

See cockroachdb#36680.

Release note (bug fix): prevent a crash that could occur when running
a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would
manifest with a fatal error stating "TruncatedState regressed". Moving
all nodes to the new binary (19.1-rcX or newer) rectifies this
situation. This wouldn't affect anyone migrating directly from 2.1.x
into 19.1.x, as the majority of our users are expected to.
@tbg
Copy link
Member

tbg commented Apr 10, 2019

I screwed something up here -- the nodes crashing were running v19.1.0-beta.20190318. They already have the cluster version VersionUnreplicatedRaftTruncatedState but aren't actually able to handle divergent snapshots (because that was only actually necessary later, when that change was made).

However, making that change, I did not introduce a new cluster version but assumed that once the truncated state was unreplicated, it was fine to also let it diverge. Not true, as this issue demonstrates.

The solution (to prevent upgrading into 19.1 to do the same to RC users) unfortunately requires "rewriting cluster version history".

For that specific user, I think the solution is to just restart the crashed nodes with the 19.1.0-rc2 binary. Things should work at that point without damage having been done to the data.

PR is out #36714

tbg added a commit to tbg/cockroach that referenced this issue Apr 10, 2019
When I landed the change to stop sending the Raft log in snapshots, I
gated this on whether the truncated state had already been unreplicated
for the range. However, this wasn't enough because older 19.1 betas knew
about unreplicated truncated state and yet couldn't handle a regressing
truncated state, which sending these snapshots could introduce. As a
result, 19.1-beta nodes could crash while running mixed with 19.1-rcX.
(Simply restarting those nodes with the upgraded binary should fix the
problem).

This PR breaks one of our rules around not introducing historical
cluster versions, but in this case it's necessary and also shouldn't
have any adverse effects.

See cockroachdb#36680.

Release note (bug fix): prevent a crash that could occur when running
a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would
manifest with a fatal error stating "TruncatedState regressed". Moving
all nodes to the new binary (19.1-rcX or newer) rectifies this
situation. This wouldn't affect anyone migrating directly from 2.1.x
into 19.1.x, as the majority of our users are expected to.
craig bot pushed a commit that referenced this issue Apr 10, 2019
36714: storage: prevent crash migrating from 19.1-beta into 19.1-rcX r=bdarnell a=tbg

When I landed the change to stop sending the Raft log in snapshots, I
gated this on whether the truncated state had already been unreplicated
for the range. However, this wasn't enough because older 19.1 betas knew
about unreplicated truncated state and yet couldn't handle a regressing
truncated state, which sending these snapshots could introduce. As a
result, 19.1-beta nodes could crash while running mixed with 19.1-rcX.
(Simply restarting those nodes with the upgraded binary should fix the
problem).

This PR breaks one of our rules around not introducing historical
cluster versions, but in this case it's necessary and also shouldn't
have any adverse effects.

See #36680.

Release note (bug fix): prevent a crash that could occur when running
a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would
manifest with a fatal error stating "TruncatedState regressed". Moving
all nodes to the new binary (19.1-rcX or newer) rectifies this
situation. This wouldn't affect anyone migrating directly from 2.1.x
into 19.1.x, as the majority of our users are expected to.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Apr 10, 2019
When I landed the change to stop sending the Raft log in snapshots, I
gated this on whether the truncated state had already been unreplicated
for the range. However, this wasn't enough because older 19.1 betas knew
about unreplicated truncated state and yet couldn't handle a regressing
truncated state, which sending these snapshots could introduce. As a
result, 19.1-beta nodes could crash while running mixed with 19.1-rcX.
(Simply restarting those nodes with the upgraded binary should fix the
problem).

This PR breaks one of our rules around not introducing historical
cluster versions, but in this case it's necessary and also shouldn't
have any adverse effects.

See cockroachdb#36680.

Release note (bug fix): prevent a crash that could occur when running
a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would
manifest with a fatal error stating "TruncatedState regressed". Moving
all nodes to the new binary (19.1-rcX or newer) rectifies this
situation. This wouldn't affect anyone migrating directly from 2.1.x
into 19.1.x, as the majority of our users are expected to.
@tbg
Copy link
Member

tbg commented Apr 29, 2019

#36719 fixed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

3 participants