*: add "--unsafe-allow-cluster-version-downgrade" for not failing cluster version downgrade #13022

chaochn47 · 2021-05-21T00:16:08Z

Compared to the public etcd downgrade design, I am trying to simplify the manual "whitelisting target downgrade version" process and it doesn't require any server/client side API changes. Once the new --unsafe-allow-cluster-version-downgrade flag is enabled, only one minor version downgrade is allowed (e.g. v3.6 to v3.5). We can cherry-pick the same change to v3.4 and v3.3 if this feature is useful.

However, if a user is already using --experimental-enable-lease-checkpoint flag to bootstrap a v3.4 cluster, the version downgrade is not possible due to the MustUnmarshal will panic on LeaseCheckpoint internal raft message which doesn't exist in v3.3. Not to mention there is no corresponding apply method. As a result, the new added etcd v3.3 flag --unsafe-allow-cluster-version-downgrade is not expected to set to true in this case.

I've excessively tested v3.5 to v3.4 and v3.4 to v3.3 downgrade in two environments and the results look good. I will publish the backport PRs once the community agree on the approach.

Tested locally first bootstrap with 3 etcd v3.4 servers and then downgrade to v3.3 one by one. It aims to test when the v2 store cluster version request replicate from v3.4 to new v3.3 server, it can be replaced by local 3.3 cluster version.
Tested in a 1.17 kubernetes cluster which by default have 3 etcd v3.4 servers. It amis to test new v3.3 server can recover from the snapshot and replace the cluster version. At the same time when downgrading starts, I ran the sonobuoy e2e conformance test to verify it won't cause k8s functionality regression.

I understand this is a risky operation without controlling the set of APIs used by the etcd client. But, I would like to put it here for reference, as others may find it useful.

Any feedback is welcome, Thanks!

gyuho · 2021-05-21T00:35:45Z

@jpbetz Do we have any idea how widely lease checkpointer is used? The main motivation here is, while downgrade API helps you work around allowing version checks, it still does not prevent crashes from new message field like lease checkpointer. Then we might as well allow built-in downgrade, for whomever is willing to take the risk -- the risk can be deterministic, calculated if one controls the client behavior.

codecov-commenter · 2021-05-21T01:05:17Z

Codecov Report

Merging #13022 (94f9936) into main (80ccb27) will increase coverage by 19.74%.
The diff coverage is 31.25%.

@@             Coverage Diff             @@
##             main   #13022       +/-   ##
===========================================
+ Coverage   47.24%   66.99%   +19.74%     
===========================================
  Files         438      420       -18     
  Lines       34110    33409      -701     
===========================================
+ Hits        16116    22382     +6266     
+ Misses      16088     9047     -7041     
- Partials     1906     1980       +74

Flag	Coverage Δ
all	`66.99% <31.25%> (+19.74%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/config/config.go	`74.24% <ø> (+34.84%)`	⬆️
server/etcdserver/api/membership/downgrade.go	`12.12% <0.00%> (-3.27%)`	⬇️
server/etcdserver/raft.go	`86.17% <0.00%> (+11.17%)`	⬆️
server/etcdserver/server.go	`78.10% <0.00%> (+17.93%)`	⬆️
server/etcdserver/api/membership/cluster.go	`78.80% <50.00%> (+21.07%)`	⬆️
server/embed/config.go	`68.70% <100.00%> (+23.28%)`	⬆️
server/embed/etcd.go	`68.65% <100.00%> (+3.90%)`	⬆️
server/etcdmain/config.go	`83.85% <100.00%> (+2.90%)`	⬆️
server/etcdserver/cluster_util.go	`68.72% <100.00%> (+28.28%)`	⬆️
raft/quorum/quorum.go	`0.00% <0.00%> (-100.00%)`	⬇️
... and 290 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80ccb27...94f9936. Read the comment docs.

lilic

This makes sense to me, but I was wondering could also work together with the downgrade subcommand, should we then maybe put it under that as a flag?
etcdctl downgrade --unsafe

Note the downgrade command work still needs to finished up. I just wonder if that is a somewhat better UX?

lilic · 2021-05-21T07:39:41Z

server/embed/config.go

@@ -489,7 +499,8 @@ func NewConfig() *Config {
 		ExperimentalMemoryMlock:                  false,
 		ExperimentalTxnModeWriteWithSharedBuffer: true,

-		V2Deprecation: config.V2_DEPR_DEFAULT,
+		UnsafeAllowClusterVersionDowngrade: DefaultUnsafeAllowClusterVersionDowngrade,
+		V2Deprecation:                      config.V2_DEPR_DEFAULT,


Nitpick and optional: lets add a new line as these two are not associated, as per ordering of the rest of the fields.

Good point, fixed.

chaochn47 · 2021-06-08T21:07:52Z

A big reason for adding a flag is the downgrade api is not completed in client side and we need to integrate this to the operator downgrade workflow.

So this PR provides a simpler way for minor version downgrade automation.

I can help finish the remaining etcdctl downgrade work if the original author is okay with that ==

lilic · 2021-06-09T07:55:20Z

I added downgrade API to the client, but the part of the command still need to be worked on, I started the discussion on original issue #11716.

stale · 2021-09-07T10:31:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

gyuho self-assigned this May 21, 2021

gyuho added the decision-needed/big-feature label May 21, 2021

chaochn47 force-pushed the auto-rollback-v3.6-to-v3.5 branch from 78f9d2b to 9711b4f Compare May 21, 2021 16:51

lilic reviewed Jun 8, 2021

View reviewed changes

chaochn47 force-pushed the auto-rollback-v3.6-to-v3.5 branch from 9711b4f to c8d2dad Compare June 8, 2021 21:15

add v3.6 to v3.5 downgrade support automatically

3d4499a

chaochn47 force-pushed the auto-rollback-v3.6-to-v3.5 branch from c8d2dad to 3d4499a Compare June 8, 2021 23:27

CHANGELOG-3.5.md: update

94f9936

chaochn47 force-pushed the auto-rollback-v3.6-to-v3.5 branch from 1c4027b to 94f9936 Compare June 9, 2021 00:40

lilic mentioned this pull request Jun 11, 2021

Etcd cluster downgrade #11716

Open

13 tasks

chaochn47 mentioned this pull request Aug 20, 2021

use v2 api to update cluster version #12988

Merged

stale bot added the stale label Sep 7, 2021

stale bot closed this Sep 28, 2021

grosser mentioned this pull request Nov 15, 2021

minor version downgrades instructions ? #13482

Closed

chaochn47 deleted the auto-rollback-v3.6-to-v3.5 branch March 28, 2022 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: add "--unsafe-allow-cluster-version-downgrade" for not failing cluster version downgrade #13022

*: add "--unsafe-allow-cluster-version-downgrade" for not failing cluster version downgrade #13022

chaochn47 commented May 21, 2021

gyuho commented May 21, 2021 •

edited

Loading

codecov-commenter commented May 21, 2021 •

edited

Loading

lilic left a comment

lilic May 21, 2021

chaochn47 Jun 8, 2021 •

edited

Loading

chaochn47 commented Jun 8, 2021 •

edited

Loading

lilic commented Jun 9, 2021

stale bot commented Sep 7, 2021

*: add "--unsafe-allow-cluster-version-downgrade" for not failing cluster version downgrade #13022

*: add "--unsafe-allow-cluster-version-downgrade" for not failing cluster version downgrade #13022

Conversation

chaochn47 commented May 21, 2021

gyuho commented May 21, 2021 • edited Loading

codecov-commenter commented May 21, 2021 • edited Loading

Codecov Report

lilic left a comment

Choose a reason for hiding this comment

lilic May 21, 2021

Choose a reason for hiding this comment

chaochn47 Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

chaochn47 commented Jun 8, 2021 • edited Loading

lilic commented Jun 9, 2021

stale bot commented Sep 7, 2021

gyuho commented May 21, 2021 •

edited

Loading

codecov-commenter commented May 21, 2021 •

edited

Loading

chaochn47 Jun 8, 2021 •

edited

Loading

chaochn47 commented Jun 8, 2021 •

edited

Loading