Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

Backup timing out with "context deadline exceeded" #1903

Closed
zlangbert opened this issue Feb 4, 2018 · 4 comments
Closed

Backup timing out with "context deadline exceeded" #1903

zlangbert opened this issue Feb 4, 2018 · 4 comments

Comments

@zlangbert
Copy link

zlangbert commented Feb 4, 2018

I have a cluster that is failing to backup to S3 with the following logs:

time="2018-02-04T17:49:48Z" level=info msg="getMaxRev: endpoint etcd-0:2379 revision (246968983)"
time="2018-02-04T17:49:48Z" level=info msg="getMaxRev: endpoint etcd-1:2379 revision (246968983)"
time="2018-02-04T17:49:48Z" level=info msg="getMaxRev: endpoint etcd-2:2379 revision (246968984)" 
time="2018-02-04T17:50:49Z" level=error msg="error syncing etcd backup (kube-system/cluster-backup-1517764931): failed to save snapshot (failed to write snapshot (MultipartUpload: upload multipart failed 
  upload id: xxxxxxxx
caused by: ReadRequestBody: read multipart upload data failed
caused by: rpc error: code = 4 desc = context deadline exceeded))" pkg=controller 

I think what's happening is the operation is getting canceled because the snapshot timeout (60s) is too short. This cluster is creating etcd snapshots of about 260mb. I have another cluster with snapshots around 50mb that are able to save successfully.

Is the snapshot timeout something we could consider increasing, or make configurable? Thanks

@hongchaodeng
Copy link
Member

@zlangbert
Yes. We can make the snapshot timeout configurable.
Do you have stats for upload throughput?

@zlangbert
Copy link
Author

zlangbert commented Feb 4, 2018

It looks like it may have been CPU bound. My default resource request/limit was 50m/50m. I increased the operator to 250m/250m and it finished successfully. I think my metrics aren't granular enough to show the CPU starvation.

I did time the snapshot save:

etcd-0 ~ $ ETCDCTL_API=3 time etcdctl --endpoints="https://127.0.0.1:2379" --cacert="/etc/ssl/etcd/ca.crt" snapshot save test.db
Snapshot saved at test.db

real	0m5.926s
user	0m1.168s
sys	0m0.499s

In any case, I think this issue is still valid, the timeout probably should be configurable. Also if the docs mentioned minimum recommended resource requirements that might save someone some trouble. Thanks!

@hongchaodeng
Copy link
Member

Also if the docs mentioned minimum recommended resource requirements that might save someone some trouble.

Sounds good. It is a general problem so I create another issue: #1905

@fanminshi
Copy link
Contributor

fixed via #1906

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants