-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 nodes crashed #36680
Comments
Zendesk ticket #3250 has been linked to this issue. |
When I landed the change to stop sending the Raft log in snapshots, I gated this on whether the truncated state had already been unreplicated for the range. However, this wasn't enough because older 19.1 betas knew about unreplicated truncated state and yet couldn't handle a regressing truncated state, which sending these snapshots could introduce. As a result, 19.1-beta nodes could crash while running mixed with 19.1-rcX. (Simply restarting those nodes with the upgraded binary should fix the problem). This PR breaks one of our rules around not introducing historical cluster versions, but in this case it's necessary and also shouldn't have any adverse effects. See cockroachdb#36680. Release note (bug fix): prevent a crash that could occur when running a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would manifest with a fatal error stating "TruncatedState regressed". Moving all nodes to the new binary (19.1-rcX or newer) rectifies this situation. This wouldn't affect anyone migrating directly from 2.1.x into 19.1.x, as the majority of our users are expected to.
I screwed something up here -- the nodes crashing were running However, making that change, I did not introduce a new cluster version but assumed that once the truncated state was unreplicated, it was fine to also let it diverge. Not true, as this issue demonstrates. The solution (to prevent upgrading into 19.1 to do the same to RC users) unfortunately requires "rewriting cluster version history". For that specific user, I think the solution is to just restart the crashed nodes with the 19.1.0-rc2 binary. Things should work at that point without damage having been done to the data. PR is out #36714 |
When I landed the change to stop sending the Raft log in snapshots, I gated this on whether the truncated state had already been unreplicated for the range. However, this wasn't enough because older 19.1 betas knew about unreplicated truncated state and yet couldn't handle a regressing truncated state, which sending these snapshots could introduce. As a result, 19.1-beta nodes could crash while running mixed with 19.1-rcX. (Simply restarting those nodes with the upgraded binary should fix the problem). This PR breaks one of our rules around not introducing historical cluster versions, but in this case it's necessary and also shouldn't have any adverse effects. See cockroachdb#36680. Release note (bug fix): prevent a crash that could occur when running a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would manifest with a fatal error stating "TruncatedState regressed". Moving all nodes to the new binary (19.1-rcX or newer) rectifies this situation. This wouldn't affect anyone migrating directly from 2.1.x into 19.1.x, as the majority of our users are expected to.
36714: storage: prevent crash migrating from 19.1-beta into 19.1-rcX r=bdarnell a=tbg When I landed the change to stop sending the Raft log in snapshots, I gated this on whether the truncated state had already been unreplicated for the range. However, this wasn't enough because older 19.1 betas knew about unreplicated truncated state and yet couldn't handle a regressing truncated state, which sending these snapshots could introduce. As a result, 19.1-beta nodes could crash while running mixed with 19.1-rcX. (Simply restarting those nodes with the upgraded binary should fix the problem). This PR breaks one of our rules around not introducing historical cluster versions, but in this case it's necessary and also shouldn't have any adverse effects. See #36680. Release note (bug fix): prevent a crash that could occur when running a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would manifest with a fatal error stating "TruncatedState regressed". Moving all nodes to the new binary (19.1-rcX or newer) rectifies this situation. This wouldn't affect anyone migrating directly from 2.1.x into 19.1.x, as the majority of our users are expected to. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
When I landed the change to stop sending the Raft log in snapshots, I gated this on whether the truncated state had already been unreplicated for the range. However, this wasn't enough because older 19.1 betas knew about unreplicated truncated state and yet couldn't handle a regressing truncated state, which sending these snapshots could introduce. As a result, 19.1-beta nodes could crash while running mixed with 19.1-rcX. (Simply restarting those nodes with the upgraded binary should fix the problem). This PR breaks one of our rules around not introducing historical cluster versions, but in this case it's necessary and also shouldn't have any adverse effects. See cockroachdb#36680. Release note (bug fix): prevent a crash that could occur when running a cluster mixed between 19.1-beta and 19.1-rcX nodes. The crash would manifest with a fatal error stating "TruncatedState regressed". Moving all nodes to the new binary (19.1-rcX or newer) rectifies this situation. This wouldn't affect anyone migrating directly from 2.1.x into 19.1.x, as the majority of our users are expected to.
#36719 fixed this. |
Describe the problem
User was upgrading from v19.1.0-beta.20190318 to v19.1.0-rc.2 when node crashed.
Below is the stack trace from node 20:
Expected behavior
Should be able to upgrade without nodes crashing
Additional data / screenshots
Node 11 also crashed with same error:
Environment:
The text was updated successfully, but these errors were encountered: