-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd cluster downgrade #11716
Comments
Thank you! I have reviewed the changes some time ago. I will re-review the changes. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Are we still working on this? |
Sorry I dropped it for a long time. Basically, the server side is almost done except the integration test which I'll add it soon later. The client and etcdctl part could be done in v3.5 or v3.6 because v3.6 is the first version that will support downgrade to v3.5. |
cc @ptabor |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Seems like there are a couple of things missing from this issue, mainly the:
@YoyinZyc Could you confirm this is the case and is this something you are going to continue working on? Thank you! |
@serathius might eventually continue work on this. |
Nice, thanks for the update! I started experimenting a bit with this as we wanted to test it for 3.5 -> 3.4 downgrade. Will report my findings here. I already patched the v3 client as I couldn't get it working otherwise if @serathius hasn't started yet, is it okay for me to open a PR for it? |
I'm happy to finish work on cluster downgrade. @lilic thanks for the help, it would be great to collaborate. Feel free to send the PR just let's make sure to split the work to avoid conflicts. |
Thanks, sounds great! Here is my PR #13083 to add Downgrade to the client, I had to do that as etcd doesn't have rpc reflection I couldn't use As mentioned there, the downgrade failed for me when I tried it, happy to provide further steps I did if you are interested. |
I tried to perform a downgrade from v3.5.0 -> v3.4.16, but ran into a few issues:
logs:
snapshot:
Maybe I ran some steps incorrectly, I can also detail them here again if needed?
I am wondering if the https://github.com/etcd-io/etcd/blob/main/CHANGELOG-3.5.md#storage-format-changes changes have anything to do with that, and we just don't support this kind of a downgrade from v3.5 -> v3.4? |
Hey, thanks for catching this up. I will take a look. |
I managed to reproduce same issues as @lilic on v3.5 -> v3.4 downgrade. Looks like downgrade policy changes #11919 were introduced in v3.5, but it shouldn't be an issue as cluster version should be lowered allowing older clusters to join. Logic in v3.4 should allow to downgrade assuming that determines that cluster version is determined to be 3.4.
Looks like this doesn't happen. Possibly because v3.4 runs from data created by v3.5 which saved to bolt that cluster version is v3.5. Will need to dig dipper here. Whats is more worrying for me was the results of perspective v3.6 -> v3.5 downgrade. Downgrade went through, but the after replacing one cluster member, all other paniced. I recreated all members, cluster looked healthy, so I decided to repeat the procedure by upgrading back to v3.6. Unfortunatelly looks like the downgrade process was not marked as finished even though all members were in correct version. I'm still learning the procedure and may not have done everything correctly, but looks like downgrade needs much more work and testing before releasing. |
Ok, found original issue that was meant to fix v3.5 -> v3.4 downgrade. #12988 |
Analysed what exactly happends on v3.6 -> v3.5 downgrade: M1, M2 - member 1 and 2 Time each row it point in time. Time increases with lower rows. Cells store versions.
Looks like there is a conflict between logic for selecting cluster version and downgrade. No matter the downgrade cluster version will be imminently changed when a new member with lower cluster version joins. Proposal: It Members with higher version should be able to join during downgrade. Reasons:
|
great sleuthing @serathius! thank you 👍 |
Thanks for the update!
+1 |
1 minor version higher are OK. Do we plan to guard 2 minor versions difference ? |
I would say yes, that should be a new downgrade then. I am not sure it would work smoothly from 3.5 -> 3.3, and we would have too much to keep in mind when developing, to not introduce breaking changes. I guess we could add something like proposed in #13022 to allow more than 1 minor version. |
@ptabor Yes, current guarding does that, but even more. We are not allowing joining members with version other then target version, so in case of downgrade from v3.5 -> v3.4, we only allow v3.4. Proposal is to expand this to allow both (v3.4 and v3.5) to handle disconnects. During downgrade v3.5 -> v3.4, members with v3.6 should not be allowed to join. |
Found another two problems, one is related to decision about upgrade being finished second to wal replay.
Logs from M2 when replaying wal:
First issueLeaders decides that clusters has been downgraded when cluster version equal target version. As cluster version is defined as minimal server version this happens imminently when first downgraded member joins. First this seem too fast for me as technically operator didn't finish the downgrade, they just downgraded one node. On the other hand from cluster perspective, cluster version was lowered which means that cluster was brought to mixed state similar that would happen doing upgrade. Proposal: Use server version to determine if cluster have been downgraded. Reasons:
This issue lead me to think about what is expected to happen when operator cancels the upgrade half way. Technically one node is enough to bring the cluster version down, which means that at this time canceling downgrade doesn't mean anything as downgrade has already happen. I think we should discuss what behavior is expected from downgrade process so we can rethink the original design. Implementing the proposal makes whole downgrade process work. I have successfully rerun the downgrade twice (on separate clusters), still there was issue with wall that makes me feel that we are not done here. Second issueWhen member tries to rejoin the cluster it panics when replaying wal. Technically it should be possible for them to join, they have version matching the cluster version. Then what happens? Downgrade was canceled and it was there to see it, so they saved this information to backed. When replaying the wal, member compares (cluster version from wal, it's local server version and latest downgrade info from backend). I'm still trying to figure out how it all works but those conclusions are based on testing case when upgrade succeeded after fixing first problem. In that case wal problem didn't happen, I assume the reason is that downgrade was not canceled, making the assertion pass. This still doesn't fix the problem as we cannot replay historical upgrades stored in wal with latest binary version. Please let me know if this analysis and conclusions make sense. In the meantime I will try to learn more about how wal, state in memory and state stored in backend interacts. |
This is definitely an issue, I think in the original design is that, during the process of downgrading, cluster version will be locked without changing to the lowest version, then we have a separate target version. The cluster version only changes when all nodes are changed to the target version. "cluster version is defined as minimal server version " This was there for the benefit of upgrading. Before, etcd only allow upgrading, and when a cluster is upgrading to a higher version, we want the cluster version to remain at the lowest version until all nodes are upgraded, which is not correct for downgrading scenario. So I think what we really want is for the cluster version to stay at the original version, not the lowest version, for both upgrading or downgrading scenario. And yes, mustDetectDowngrade is there to prevent downgrade since etcd does not support downgrade before. It needs to be relaxed now that we support it. For the determinatino about if cluster have been downgraded, I think your proposal of using server version is good. You may also want to make sure cluster version doens't change as the application who uses etcd may have some dependencies on the cluster version.
I have a feeling that @YoyinZyc has seen that before, but I have not sure. @YoyinZyc do you remember this? |
Sorry I didn't follow this thread for a long time. My knowledge may be outdated. As for your first issue, I remember my implementation will not let downgrade be cancelled when the cluster version is equal to target version. The leader will keep monitoring its followers and only if all the followers reach to target version the downgrade cancel request will be sent out and the downgrade status will be marked as done. My big guess on why M2 is crushed based on the log you pasted here is the cluster version.
Ideally, once you downgrade M1 to 3.5, the There may be some bug around restoring cluster version which cause this issue. |
For downgrade we could technically lock the cluster version for the upgrade process, but this would not work with cancel. If upgrade can be canceled we would automatically revert to original logic "pick lowest server version". With cancel, downgrading one node and canceling would be enough to consider cluster downgraded. I think we should settle on different definition of cluster version, a definition that has stable value no matter if we are upgrading, downgrading or recovering from canceled downgrade. Proposal - cluster version based on quorumI would propose to change definition of cluster version from minimum version of server, to be defined as a version of quorum of members. So two in three node clusters and three in five node clusters. This has benefits like:
Downside:
Proposal - Use a version target for both downgrade and upgradeTo eliminate downside from proposal 1 we can not only use quorum but also make upgrades and downgrades symmetrical, with only difference that upgrades are automatically initialized by leader when upgraded node is detected and downgrades are initialized by operator manually. In this case DowngradeTarget (DT), would be just changed to SecondaryVersion(SV). When set it would indicate that cluster is during version change and prevent members with version outside of [CV,SV] to join and operator to start downgrade in cluster with mixed versions. Example states for v3.5
One tricky thing is what to do when we reach majority and cluster version is updated. In that case we could swap versions CV and SV. We would need a additional bool to store direction of version change, but this way we can ensure that after reaching quorum, operator will need to finish the current downgrade/upgrade before starting a new one. Example upgrade
This idea still needs some thought, but I think it will work as it introduces symmetrical mechanism for preventing large version drift from only upgrade to also work for downgrade. Decision if member can join is done by a member itself and it can make this decision only based on it's local state (cluster version, downgrade target/secondary version, binary version). To guarantee that member version difference in cluster is max <=1 that also work for downgrades we need to store at two values (min and max). No matter what names of versions we use (secondary version, target version etc), I think we need to use two values. Please let me know if this goes in good direction or we should go back to original design. I would be happy to put it in a document where we can continue the discussions. No matter which approach we take I think we need this document to make the downgrade process understood by everyone. |
@serathius Thank you for the proposals. I think proposal 2 sounds good but I have a question. |
In this discussion I focused on the control logic for coordinating zero-downtime downgrades and upgrades. This control logic should enforce invariant that upgrades should not be done in by more than one version. This works with assumption that backed is +/-1 version compatible and backend can be downgraded/updated by just running Etcd server on it. Introducing backend compatibility guarantees that are clearly defined, tested and tooled is definitely required to make the downgrade/upgrade process reliable. I will propose something here too. |
@YoyinZyc Thanks for bringing my attention to backend or etcd data backward compatibility. After giving thing a thought I found that problem is big enough that it deserves it's own dedicated design. I have wrote down the design for etcd data versioning. It proposes to introduces idea of Storage version and proposes some restrictions to ensure that older etcd binaries can safety load data during downgrade without breaking Raft consensus requirements or causing unexpected behaviors. Please take a look https://docs.google.com/document/d/1yD0GDkxqWBPAax6jLZ97clwAz2Gp0Gux6xaTrtJ6wHE/edit?usp=sharing cc @ptabor @lilic @YoyinZyc @jingyih @hexfusion |
Created dedicated issue to track progress of implementation. #13168 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Previous downgrade related PR(#11362, #11492) are both too big to review. Let's open this issue to track downgrade implementation progress this time. I'll break it down into small tasks and get them merged step by step.
Design doc: https://docs.google.com/document/d/1mSihXRJz8ROhXf4r5WrBGc8aka-b8HKjo2VDllOc6ac/edit#heading=h.u32uhgogqjq2
TODO (EDIT by @serathius):
Server side:
cc @jingyih @gyuho @wenjiaswe @jpbetz @lavalamp
The text was updated successfully, but these errors were encountered: