Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

e2e: TestBackupAndRestore failed to restore seed etcd member #1825

Closed
hongchaodeng opened this issue Jan 3, 2018 · 12 comments
Closed

e2e: TestBackupAndRestore failed to restore seed etcd member #1825

hongchaodeng opened this issue Jan 3, 2018 · 12 comments

Comments

@hongchaodeng
Copy link
Member

hongchaodeng commented Jan 3, 2018

jenkins job: https://jenkins-etcd.prod.coreos.systems/view/operator/job/etcd-operator-e2eslow-pr/1138/console

log:

e2e tests failed

It means the e2e-testing pod status.phase is not succeeded.

But the e2e-testing pod's log seems fine:
https://jenkins-etcd.prod.coreos.systems/view/operator/job/etcd-operator-e2eslow-pr/lastSuccessfulBuild/artifact/_output/logs/e2e-testing.e2e-testing.log

ok  	github.com/coreos/etcd-operator/test/e2e/e2eslow	226.727s
test success ===
@hasbro17
Copy link
Contributor

hasbro17 commented Jan 3, 2018

@hongchaodeng hongchaodeng changed the title jenkins: e2e test "failed" even though from the log no test failed jenkins: e2e test "failed" even though no test failed Jan 3, 2018
@hongchaodeng
Copy link
Member Author

I see.

@hongchaodeng hongchaodeng changed the title jenkins: e2e test "failed" even though no test failed e2e: restore operator blacked out for a period Jan 3, 2018
@hongchaodeng
Copy link
Member Author

Some further debugging suggest restore operator was gone for a period:
https://jenkins-etcd.prod.coreos.systems/view/operator/job/etcd-operator-e2eslow-pr/1138/artifact/_output/logs/etcd-operator.etcd-restore-operator.log

time="2018-01-03T21:32:54Z" level=info msg="serving backup for restore CR test-etcd-backup-restore-2122009987609794903" 
rpc error: code = Unknown desc = Error: No such container: 293f361e57e5dd1842aa1d7909a27d2c003ae2327ef4c64f28b1a64805a43671
time="2018-01-03T21:35:13Z" level=info msg="Go Version: go1.9.2" 
time="2018-01-03T21:35:13Z" level=info msg="Go OS/Arch: linux/amd64" 

@hongchaodeng hongchaodeng changed the title e2e: restore operator blacked out for a period e2e: TestBackupAndRestore failed to restore seed etcd member Jan 7, 2018
@hongchaodeng
Copy link
Member Author

Actually it might not be related to restore operator restarted. The restart is possibly due to disruptive_test. So this is a flake that we need to reproduce

@hongchaodeng
Copy link
Member Author

hongchaodeng commented Jan 8, 2018

@hongchaodeng
Copy link
Member Author

reproducible

@hongchaodeng
Copy link
Member Author

When reproducing the bug, I have found that the issues come from DNS resolving. Digging further, I found that the pod spec isn't right -- it doesn't have the check-dns init container!

Digging the code shows that this is somehow override in addRecoveryToPod():

func addRecoveryToPod(pod *v1.Pod, token string, m *etcdutil.Member, cs api.ClusterSpec, backupURL *url.URL) {
pod.Spec.InitContainers = makeRestoreInitContainers(backupURL, token, cs.Repository, cs.Version, m)
}

Let's fix this first and also verify if it fix the bug.

@hongchaodeng
Copy link
Member Author

Found the root. Very weird:

  restore-datadir:
    Container ID:  docker://b35c0c401f89646c8fe7ce60b92dad6e56d6204c80e7cc6e4ae5fac1b79ea81e
    Image:         quay.io/coreos/etcd:v3.2.13
    Image ID:      docker-pullable://quay.io/coreos/etcd@sha256:e372648462335719b21c5410d5a064a470953a73d59b587611a75f6dfe0d6fba
    Port:          <none>
    Command:
      /bin/sh
      -ec
      ETCDCTL_API=3 etcdctl snapshot restore /var/etcd/latest.backup --name test-etcd-backup-restore-0-0000 --initial-cluster test-etcd-backup-restore-0-0000=https://test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-525.svc:2380 --initial-cluster-token b78631b4-bc1e-48db-be0d-6fe831409874 --initial-advertise-peer-urls https://test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-525.svc:2380 --data-dir /var/etcd/data 2>/dev/termination-log
    State:      Terminated
      Reason:   Error
      Message:  2018-01-19 05:42:47.283461 I | pkg/netutil: resolving test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-525.svc:2380 to 10.28.6.250:2380
2018-01-19 05:42:47.285023 I | pkg/netutil: resolving test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-525.svc:2380 to 10.28.5.145:2380
Error:  --initial-cluster must include test-etcd-backup-restore-0-0000=https://test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-525.svc:2380 given --initial-advertise-peer-urls=https://test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-525.svc:2380

      Exit Code:    128

@hongchaodeng
Copy link
Member Author

hongchaodeng commented Jan 19, 2018

Actually I suspect the fetch-backup init container might have "failed". But "curl" returns exit code 0 on non-200 http response. There is also no easy way to pipe error message onto /dev/termination-log.

@hongchaodeng
Copy link
Member Author

The other type of frequent failure happens when scaling up from 1->2:

time="2018-01-19T16:03:09Z" level=info msg="added member (test-etcd-backup-restore-0-0001)" cluster-name=test-etcd-backup-restore-0 pkg=cluster 
time="2018-01-19T16:03:09Z" level=info msg="Finish reconciling" cluster-name=test-etcd-backup-restore-0 pkg=cluster 
time="2018-01-19T16:03:17Z" level=info msg="Start reconciling" cluster-name=test-etcd-backup-restore-0 pkg=cluster 
time="2018-01-19T16:03:17Z" level=info msg="running members: test-etcd-backup-restore-0-0000" cluster-name=test-etcd-backup-restore-0 pkg=cluster 
time="2018-01-19T16:03:17Z" level=info msg="cluster membership: test-etcd-backup-restore-0-0001,test-etcd-backup-restore-0-0000" cluster-name=test-etcd-backup-restore-0 pkg=cluster 
time="2018-01-19T16:03:17Z" level=info msg="Finish reconciling" cluster-name=test-etcd-backup-restore-0 pkg=cluster 
time="2018-01-19T16:03:17Z" level=error msg="failed to reconcile: lost quorum" cluster-name=test-etcd-backup-restore-0 pkg=cluster 

@hongchaodeng
Copy link
Member Author

Regarding the restore failure, here the analysis:

Logs for restore failed:

2018-01-19 23:42:54.398661 I | pkg/netutil: resolving test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-580.svc:2380 to 10.28.5.36:2380
2018-01-19 23:42:54.402454 I | pkg/netutil: resolving test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-580.svc:2380 to 10.28.4.183:2380
Error:  --initial-cluster must include test-etcd-backup-restore-0-0000=https://test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-580.svc:2380 given --initial-advertise-peer-urls=https://test-etcd-backup-restore-0-0000.test-etcd-backup-restore-0.e2e-etcd-operator-flaketest-580.svc:2380

In snapshot_command.go VerifyBootstrap():
https://github.com/coreos/etcd/blob/68d27b2d845ed71b434bbfd1f37b64cd04a6c83d/etcdctl/ctlv3/command/snapshot_command.go#L175
It tried to resolve both DNS names, both are test-etcd-backup-restore-0-0000..., and compares their resolved IPs.
This is due to DNS delay since we just deleted 0000 and forcefully (maybe we should wait a bit).

Note that the compare behavior is different in etcd 3.3:
https://github.com/coreos/etcd/blob/0f1ac0cef6834f0927dec74b0f0bf4d0dad9b763/etcdserver/config.go#L120-L121
It will actually compare the DNS names again if IPs not matched.

@hongchaodeng
Copy link
Member Author

fixed: #1875 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants