Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: add "--unsafe-allow-cluster-version-downgrade" for not failing cluster version downgrade #13022

Closed
wants to merge 2 commits into from

Conversation

chaochn47
Copy link
Member

Compared to the public etcd downgrade design, I am trying to simplify the manual "whitelisting target downgrade version" process and it doesn't require any server/client side API changes. Once the new --unsafe-allow-cluster-version-downgrade flag is enabled, only one minor version downgrade is allowed (e.g. v3.6 to v3.5). We can cherry-pick the same change to v3.4 and v3.3 if this feature is useful.

However, if a user is already using --experimental-enable-lease-checkpoint flag to bootstrap a v3.4 cluster, the version downgrade is not possible due to the MustUnmarshal will panic on LeaseCheckpoint internal raft message which doesn't exist in v3.3. Not to mention there is no corresponding apply method. As a result, the new added etcd v3.3 flag --unsafe-allow-cluster-version-downgrade is not expected to set to true in this case.

I've excessively tested v3.5 to v3.4 and v3.4 to v3.3 downgrade in two environments and the results look good. I will publish the backport PRs once the community agree on the approach.

  1. Tested locally first bootstrap with 3 etcd v3.4 servers and then downgrade to v3.3 one by one. It aims to test when the v2 store cluster version request replicate from v3.4 to new v3.3 server, it can be replaced by local 3.3 cluster version.
  2. Tested in a 1.17 kubernetes cluster which by default have 3 etcd v3.4 servers. It amis to test new v3.3 server can recover from the snapshot and replace the cluster version. At the same time when downgrading starts, I ran the sonobuoy e2e conformance test to verify it won't cause k8s functionality regression.

I understand this is a risky operation without controlling the set of APIs used by the etcd client. But, I would like to put it here for reference, as others may find it useful.

Any feedback is welcome, Thanks!

@gyuho gyuho self-assigned this May 21, 2021
@gyuho
Copy link
Contributor

gyuho commented May 21, 2021

@jpbetz Do we have any idea how widely lease checkpointer is used? The main motivation here is, while downgrade API helps you work around allowing version checks, it still does not prevent crashes from new message field like lease checkpointer. Then we might as well allow built-in downgrade, for whomever is willing to take the risk -- the risk can be deterministic, calculated if one controls the client behavior.

@codecov-commenter
Copy link

codecov-commenter commented May 21, 2021

Codecov Report

Merging #13022 (94f9936) into main (80ccb27) will increase coverage by 19.74%.
The diff coverage is 31.25%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main   #13022       +/-   ##
===========================================
+ Coverage   47.24%   66.99%   +19.74%     
===========================================
  Files         438      420       -18     
  Lines       34110    33409      -701     
===========================================
+ Hits        16116    22382     +6266     
+ Misses      16088     9047     -7041     
- Partials     1906     1980       +74     
Flag Coverage Δ
all 66.99% <31.25%> (+19.74%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/config/config.go 74.24% <ø> (+34.84%) ⬆️
server/etcdserver/api/membership/downgrade.go 12.12% <0.00%> (-3.27%) ⬇️
server/etcdserver/raft.go 86.17% <0.00%> (+11.17%) ⬆️
server/etcdserver/server.go 78.10% <0.00%> (+17.93%) ⬆️
server/etcdserver/api/membership/cluster.go 78.80% <50.00%> (+21.07%) ⬆️
server/embed/config.go 68.70% <100.00%> (+23.28%) ⬆️
server/embed/etcd.go 68.65% <100.00%> (+3.90%) ⬆️
server/etcdmain/config.go 83.85% <100.00%> (+2.90%) ⬆️
server/etcdserver/cluster_util.go 68.72% <100.00%> (+28.28%) ⬆️
raft/quorum/quorum.go 0.00% <0.00%> (-100.00%) ⬇️
... and 290 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80ccb27...94f9936. Read the comment docs.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, but I was wondering could also work together with the downgrade subcommand, should we then maybe put it under that as a flag?
etcdctl downgrade --unsafe

Note the downgrade command work still needs to finished up. I just wonder if that is a somewhat better UX?

@@ -489,7 +499,8 @@ func NewConfig() *Config {
ExperimentalMemoryMlock: false,
ExperimentalTxnModeWriteWithSharedBuffer: true,

V2Deprecation: config.V2_DEPR_DEFAULT,
UnsafeAllowClusterVersionDowngrade: DefaultUnsafeAllowClusterVersionDowngrade,
V2Deprecation: config.V2_DEPR_DEFAULT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick and optional: lets add a new line as these two are not associated, as per ordering of the rest of the fields.

Copy link
Member Author

@chaochn47 chaochn47 Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed.

@chaochn47
Copy link
Member Author

chaochn47 commented Jun 8, 2021

A big reason for adding a flag is the downgrade api is not completed in client side and we need to integrate this to the operator downgrade workflow.

So this PR provides a simpler way for minor version downgrade automation.

I can help finish the remaining etcdctl downgrade work if the original author is okay with that ==

@chaochn47 chaochn47 force-pushed the auto-rollback-v3.6-to-v3.5 branch from 9711b4f to c8d2dad Compare June 8, 2021 21:15
@chaochn47 chaochn47 force-pushed the auto-rollback-v3.6-to-v3.5 branch from c8d2dad to 3d4499a Compare June 8, 2021 23:27
@chaochn47 chaochn47 force-pushed the auto-rollback-v3.6-to-v3.5 branch from 1c4027b to 94f9936 Compare June 9, 2021 00:40
@lilic
Copy link
Contributor

lilic commented Jun 9, 2021

I added downgrade API to the client, but the part of the command still need to be worked on, I started the discussion on original issue #11716.

@stale
Copy link

stale bot commented Sep 7, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 7, 2021
@stale stale bot closed this Sep 28, 2021
@chaochn47 chaochn47 deleted the auto-rollback-v3.6-to-v3.5 branch March 28, 2022 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants