-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: Finalize the cluster upgrade process automatically. #24377
Conversation
Formatting nit: try to wrap your lines at 80 characters, as in other RFCs. I think there are editor plugins which do this automatically (though last time I wrote a big markdown doc I did it by hand, annoyingly). |
Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 30 at r1 (raw file):
What is the opt-out? A cluster setting? How do we ensure that we have the right setting when starting up? docs/RFCS/auto_finalize_cluster_upgrade.md, line 32 at r1 (raw file):
Formatting nit: Use monospace formatting for docs/RFCS/auto_finalize_cluster_upgrade.md, line 50 at r1 (raw file):
If a node is unreachable, it won't show up in gossip, so how would the surviving nodes know not to upgrade? I think we need to add a condition that there are no under-replicated ranges. Comments from Reviewable |
I believe that we should auto-upgrade clusters by default. I also believe that our docs should recommend disabling the auto-upgrade mechanism (the auto-upgrade is to protect users who don't read the docs). The mechanism to opt out of automatic upgrades should be on a per-version basis. If I follow the docs and upgrade from 2.0 to 2.1 with auto-upgrade disabled, that shouldn't prevent auto-upgrades when I go to 2.2 (when I may just do a rolling upgrade without reading the docs). |
Reviewed 1 of 1 files at r1. docs/RFCS/auto_finalize_cluster_upgrade.md, line 1 at r1 (raw file):
Let's call this "Finalize cluster upgrades automatically" docs/RFCS/auto_finalize_cluster_upgrade.md, line 4 at r1 (raw file):
Just list yourself here! :) docs/RFCS/auto_finalize_cluster_upgrade.md, line 5 at r1 (raw file):
Fill in #24377 on the next update. docs/RFCS/auto_finalize_cluster_upgrade.md, line 22 at r1 (raw file):
Specifically, it unlocks backwards-incompatible features. Backwards compatible features are available without finalizing the upgrade. docs/RFCS/auto_finalize_cluster_upgrade.md, line 30 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Yep. We can consistently read it from docs/RFCS/auto_finalize_cluster_upgrade.md, line 47 at r1 (raw file):
I've been thinking more about this race. I think the best way to handle it is by introducing (yet another) version, a Then the upgrade process looks like this:
I think original credit for this idea belongs to @nvanbenschoten. @bdarnell, does this overcomplicate matters? docs/RFCS/auto_finalize_cluster_upgrade.md, line 50 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Looks like we write node status keys for every node in the cluster. Could we repurpose those? That is, only upgrade once we've verified that every node with a status key is either running the right version or decommissioned and dead? Comments from Reviewable |
Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 47 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
It occurs to me that Comments from Reviewable |
I agree that we should auto upgrade by default. However, upgrade on a per-version basis could be confusing to the users. Let's say user turned auto_upgrade off when updating to 2.1. The auto_upgrade should remain off so users still need to manually run the finalization command when upgrading to 2.2. If we turned it on internally without notifying users and simply say it in the docs, users could treat it as a bug rather than a feature. We should always make the setting consistent by having a default value and let users flip that around if they want to. @bdarnell |
This has no mention of testing. Please add some. This might be a large section. Is there no way to set how long until this finalization step occurs? I don't see any mention of it. Also, there needs to be a UI component to this RFC. We need to surface in both the CLI and the Admin UI if a cluster is going to be upgraded and when. Or if it needs to be done manually. If we are going to step it one upgrade at a time, then this should show that. Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful. Comments from Reviewable |
Do you mean give operators a grace period just in case they want to cancel the upgrade? @BramGruneir |
Yes. Exactly that. Instead of this being an on/off feature. Have it wait X mins/hours before upgrading. Thus ensuring that the cluster is stable and it give the operators a chance to turn it off, change the length of the time before the auto-upgrade happens or upgrade immediately. Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful. Comments from Reviewable |
If so, setting Maybe set it to -1 to disable this feature? But I'm also wondering if duration could be negative or not. |
I'd prefer if we tackled the UI in a separate RFC. I don't want the scope of this feature to explode on Victor. Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful. Comments from Reviewable |
Since we already have an issue for the web UI to warn users for this scenario that we were planning on tackling in 2.1, we could adapt the issue to instead warn them about the impending auto-update (when applicable). To avoid increasing scope (to Nikhil's point) we can coordinate the update on our end with Victor so they happen at the same time. If that works for everyone, I'll go ahead and update the linked issue with the changes we would need based on this RFC. |
It's not uncommon to use 0 to indicate that the feature is turned off, we just need to describe it in the notes. And looking at a negative time might seem weird. We could also set the minimum wait to 1h or 30min or something, to further not confuse people. If they wanted to upgrade right away, they can manually run the upgrade command.
More than that, there should also be a function to find out how long until the upgrade occurs. This would be useful for both UIs. Comments from Reviewable |
I think this is just a question of how the setting is worded. It's true that it would look buggy if they said I think that making the opt-out of auto-upgrade permanent instead of per-release will lead to more surprises since people will forget how they set it between releases.
I'm not sure a time-based auto-update makes sense. If it doesn't auto-upgrade as soon as the rolling restart is finished, users will still be confused that their new features aren't working yet. A few hours doesn't buy you much time to roll back (you at least want to see how the new version performs under your daily peak load before you commit, right?) Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 30 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Ah, right. This isn't at early startup like our store-level migrations; it can wait until the cluster is up and able to serve consistent reads. docs/RFCS/auto_finalize_cluster_upgrade.md, line 47 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Introducing a new cluster version variable seems over-complicated to me. Can we do this by making the checks in docs/RFCS/auto_finalize_cluster_upgrade.md, line 50 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Yeah, that could do it. It would make decommissioning mandatory in a way that it's not today, but it probably makes sense for this use case. Comments from Reviewable |
@windchan7 let's find some time tomorrow to go through this feedback! Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 47 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Doesn't it already do that? Anyway, let's suppose
I'm not sure how we get around this without a separate version that limits what nodes you can talk to without actually turning on the backwards-incompatible features. Comments from Reviewable |
I had a discussion with Nikhil yesterday and agreed that |
Review status: 0 of 1 files reviewed at latest revision, 8 unresolved discussions, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 47 at r1 (raw file):
Yeah, I think so. I was getting it mixed up with the constant BinaryMinimumSupportedVersion.
True. A bulletproof solution here would need an extra step here. So far we've just been relying on the fact that new nodes running the old version are unlikely, especially when the upgrade is done manually a significant interval later. docs/RFCS/auto_finalize_cluster_upgrade.md, line 57 at r2 (raw file):
When I said "use manual upgrade" I didn't necessarily mean that literal syntax. I'm not sure it's worth the complexity of custom syntax here since it's only going to be used when you're copy/pasting from the docs here. docs/RFCS/auto_finalize_cluster_upgrade.md, line 60 at r2 (raw file):
Bikeshed: Comments from Reviewable |
Code review tip: hit the "Done" button once you've addressed someone's feedback! Reviewed 1 of 1 files at r3. docs/RFCS/auto_finalize_cluster_upgrade.md, line 5 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Please linkify these when you get the chance:
docs/RFCS/auto_finalize_cluster_upgrade.md, line 47 at r1 (raw file):
Right. So my question is how much do you care about this race? I'm inclined to call it future work, but perhaps work that we mandate makes it into 2.1. It's looking like the daemon is going to be an isolated chunk of work that blindly runs docs/RFCS/auto_finalize_cluster_upgrade.md, line 57 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Victor and I were unable to come up with a setting name that was sufficiently self-descriptive. If we don't introduce custom syntax I think we'll be stuck with a slightly confusing setting name, but perhaps that's ok. docs/RFCS/auto_finalize_cluster_upgrade.md, line 60 at r2 (raw file):
I'll toss out docs/RFCS/auto_finalize_cluster_upgrade.md, line 54 at r3 (raw file):
s/field/setting/ docs/RFCS/auto_finalize_cluster_upgrade.md, line 110 at r3 (raw file):
The node list page already shows the node's executable version, I think. We could easily extend this to also show their view of the cluster version. Comments from Reviewable |
This is looking really good, Victor! Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. Comments from Reviewable |
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 1 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 4 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 22 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 30 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 32 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 50 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 47 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
I'm OK with deferring this for future work. docs/RFCS/auto_finalize_cluster_upgrade.md, line 60 at r2 (raw file): Previously, benesch (Nikhil Benesch) wrote…
I was thinking that you'd be unable to set the cluster version to 2.0 without clearing the docs/RFCS/auto_finalize_cluster_upgrade.md, line 59 at r3 (raw file):
The setting should be a string, not a float (like the Comments from Reviewable |
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 54 at r3 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Do you mind clarify what it means? Thanks Nikhil. Comments from Reviewable |
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 54 at r3 (raw file): Previously, windchan7 (Victor Chen) wrote…
Comments from Reviewable |
I don't think this RFC needs to address the specifics of the UI implementation, but it would be useful to consider the information that we'll want to surface on the CLI and web UI. I'm sensitive to the concern about scope creep, but I'm still bruised from previous RFCs that played fast & loose with the UI side. What new status information does this RFC create, and what does the API look like to retrieve this status? From a first pass, it's something like:
Including this information is critical to making sure this RFC gives a complete view of the process. Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. docs/RFCS/auto_finalize_cluster_upgrade.md, line 5 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 47 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 57 at r2 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 54 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 59 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. docs/RFCS/auto_finalize_cluster_upgrade.md, line 110 at r3 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Done. Comments from Reviewable |
Done. Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions. Comments from Reviewable |
Reviewed 1 of 1 files at r5. docs/RFCS/auto_finalize_cluster_upgrade.md, line 47 at r1 (raw file): Previously, windchan7 (Victor Chen) wrote…
Ok, sounds like we're good to go. Victor, can you file an issue to track the future work here, then link it in? ("settings: version upgrade races with booting old node") docs/RFCS/auto_finalize_cluster_upgrade.md, line 60 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Ooh, I like that. Comments from Reviewable |
32720b0
to
9e6e65b
Compare
bors r+ |
bors r- |
Canceled |
Pending final review! @bdarnell, @couchand, @BramGruneir, thoughts? |
Didn't look at the whole thing, but the additional detail about the API for upgrade state LGTM |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. docs/RFCS/20180411_finalize_cluster_upgrade_automatically.md, line 16 at r6 (raw file):
s/feature feature/feature/ docs/RFCS/20180411_finalize_cluster_upgrade_automatically.md, line 107 at r6 (raw file):
We should also check for the state when all live nodes are running the new version but some nodes are down and not decommissioned, to alert the operator to the need to either revive or decommission them. Comments from Reviewable |
I'm finalizing the design, please let me know if you have any other comments. |
Issue: cockroachdb#23912, cockroachdb#22686 Release note: None
bors r+ |
Build succeeded |
This RFC proposes to automate the finalization process to upgrade a cluster.
Issue: #23912
Release note: None