Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downgrading cluster may corrupt data #6457

Closed
tschuy opened this issue Sep 16, 2016 · 4 comments
Closed

Downgrading cluster may corrupt data #6457

tschuy opened this issue Sep 16, 2016 · 4 comments

Comments

@tschuy
Copy link

tschuy commented Sep 16, 2016

While a cluster was being upgraded in-place from 2.3.7 to 3.0.7, the first machine being upgraded was rebooted, restarting the v2 service. As the v3 and v2 services pointed to the same data directories, etcd2 failed to start.

After disabling the etcd2 service and starting etcd3, the cluster is failing to start with state-out-of-range errors:

$ sudo journalctl -f -u  etcd.service
-- Logs begin at Mon 2016-09-05 14:26:55 UTC. --
Sep 16 17:54:28 discovery-etcd-0.us-west-1b.coreos.systems systemd[1]: etcd.service: Unit entered failed state.
Sep 16 17:54:28 discovery-etcd-0.us-west-1b.coreos.systems systemd[1]: etcd.service: Failed with result 'exit-code'.
Sep 16 17:54:38 discovery-etcd-0.us-west-1b.coreos.systems systemd[1]: etcd.service: Service hold-off time over, scheduling restart.
Sep 16 17:54:38 discovery-etcd-0.us-west-1b.coreos.systems systemd[1]: Stopped etcd3.
Sep 16 17:54:38 discovery-etcd-0.us-west-1b.coreos.systems systemd[1]: Starting etcd3...
Sep 16 17:54:39 discovery-etcd-0.us-west-1b.coreos.systems rkt[14883]: image: using image from local store for image name coreos.com/etcd:v3.0.7
Sep 16 17:54:39 discovery-etcd-0.us-west-1b.coreos.systems rkt[14883]: sha512-f7e8c8ac24b6b995987bc4362beeeaac
Sep 16 17:54:39 discovery-etcd-0.us-west-1b.coreos.systems systemd[1]: Started etcd3.
Sep 16 17:54:39 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: image: using image from local store for image name coreos.com/rkt/stage1-coreos:1.14.0
Sep 16 17:54:39 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: image: using image from local store for image name coreos.com/etcd:v3.0.7
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.270526] etcd[5]: panic: d23f70f817a2d5b6 state.commit 2639181023 is out of range [2639065026, 2639073951]
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.401797] etcd[5]: goroutine 1 [running]:
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.402502] etcd[5]: panic(0xd420a0, 0xc82f586610)
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.402928] etcd[5]:         /usr/lib/go/src/runtime/panic.go:481 +0x3e6
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.403382] etcd[5]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc8200f7ce0, 0x1233ee0, 0x2b, 0xc8523a8980, 0x4, 0x4)
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.403464] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x191
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.403540] etcd[5]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).loadState(0xc8523d8b60, 0x172908, 0x31c5cfd1106f9608, 0x9d4eb4df, 0x0, 0x0, 0x0)
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.403613] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:942 +0x2a2
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.403687] etcd[5]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.newRaft(0xc8201258d8, 0x0)
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.403768] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:225 +0x8ff
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.403845] etcd[5]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.RestartNode(0xc8201258d8, 0x0, 0x0)
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.403917] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:213 +0x45
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.403989] etcd[5]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.restartNode(0xc820139080, 0xc82018a000, 0x29, 0xc820125d78, 0x0, 0x0, 0x0, 0x0)
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404100] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/raft.go:369 +0x7c7
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404180] etcd[5]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc820139080, 0x0, 0x0, 0x0)
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404256] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:354 +0x4308
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404328] etcd[5]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc820197000, 0x0, 0x0, 0x0)
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404399] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:374 +0x245f
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404471] etcd[5]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404542] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:116 +0x2101
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404614] etcd[5]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404705] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:36 +0x21e
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404782] etcd[5]: main.main()
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems rkt[14906]: [508656.404855] etcd[5]:         /home/anthony/src/gopath/src/github.com/coreos/etcd/cmd/main.go:28 +0x14
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems systemd[1]: etcd.service: Unit entered failed state.
Sep 16 17:54:50 discovery-etcd-0.us-west-1b.coreos.systems systemd[1]: etcd.service: Failed with result 'exit-code'.

It would be good to have a mechanism in place to prevent this kind of corruption when downgrades occur, and instead just fail safely.

@philips
Copy link
Contributor

philips commented Sep 16, 2016

You mean you upgraded to v3.0.7 the downgraeded to 2.3.7 afterwards?

@xiang90
Copy link
Contributor

xiang90 commented Oct 12, 2016

We do not support cluster downgrade between minor versions. I do not think we should put a lot of effort to make this super safe in the near future. So closing.

@xiang90 xiang90 closed this as completed Oct 12, 2016
@xiang90
Copy link
Contributor

xiang90 commented Oct 12, 2016

discussed with @heyitsanthony again about this. we probably should provide better logging or instructions. so reopen.

@gyuho
Copy link
Contributor

gyuho commented May 3, 2018

Merging into #9306.

@gyuho gyuho closed this as completed May 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants