Backup timing out with "context deadline exceeded" #1903

zlangbert · 2018-02-04T18:02:48Z

I have a cluster that is failing to backup to S3 with the following logs:

time="2018-02-04T17:49:48Z" level=info msg="getMaxRev: endpoint etcd-0:2379 revision (246968983)"
time="2018-02-04T17:49:48Z" level=info msg="getMaxRev: endpoint etcd-1:2379 revision (246968983)"
time="2018-02-04T17:49:48Z" level=info msg="getMaxRev: endpoint etcd-2:2379 revision (246968984)" 
time="2018-02-04T17:50:49Z" level=error msg="error syncing etcd backup (kube-system/cluster-backup-1517764931): failed to save snapshot (failed to write snapshot (MultipartUpload: upload multipart failed 
  upload id: xxxxxxxx
caused by: ReadRequestBody: read multipart upload data failed
caused by: rpc error: code = 4 desc = context deadline exceeded))" pkg=controller

I think what's happening is the operation is getting canceled because the snapshot timeout (60s) is too short. This cluster is creating etcd snapshots of about 260mb. I have another cluster with snapshots around 50mb that are able to save successfully.

Is the snapshot timeout something we could consider increasing, or make configurable? Thanks

The text was updated successfully, but these errors were encountered:

hongchaodeng · 2018-02-04T18:09:25Z

@zlangbert
Yes. We can make the snapshot timeout configurable.
Do you have stats for upload throughput?

zlangbert · 2018-02-04T18:30:40Z

It looks like it may have been CPU bound. My default resource request/limit was 50m/50m. I increased the operator to 250m/250m and it finished successfully. I think my metrics aren't granular enough to show the CPU starvation.

I did time the snapshot save:

etcd-0 ~ $ ETCDCTL_API=3 time etcdctl --endpoints="https://127.0.0.1:2379" --cacert="/etc/ssl/etcd/ca.crt" snapshot save test.db
Snapshot saved at test.db

real	0m5.926s
user	0m1.168s
sys	0m0.499s

In any case, I think this issue is still valid, the timeout probably should be configurable. Also if the docs mentioned minimum recommended resource requirements that might save someone some trouble. Thanks!

hongchaodeng · 2018-02-04T18:43:34Z

Also if the docs mentioned minimum recommended resource requirements that might save someone some trouble.

Sounds good. It is a general problem so I create another issue: #1905

fanminshi · 2018-02-07T17:21:44Z

fixed via #1906

hongchaodeng added the priority/P1 label Feb 4, 2018

This was referenced Feb 5, 2018

[Proposal] support periodic backups #1841

Closed

backup: make backup timeout configurable #1906

Closed

fanminshi closed this as completed Feb 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup timing out with "context deadline exceeded" #1903

Backup timing out with "context deadline exceeded" #1903

zlangbert commented Feb 4, 2018 •

edited

Loading

hongchaodeng commented Feb 4, 2018

zlangbert commented Feb 4, 2018 •

edited

Loading

hongchaodeng commented Feb 4, 2018

fanminshi commented Feb 7, 2018

Backup timing out with "context deadline exceeded" #1903

Backup timing out with "context deadline exceeded" #1903

Comments

zlangbert commented Feb 4, 2018 • edited Loading

hongchaodeng commented Feb 4, 2018

zlangbert commented Feb 4, 2018 • edited Loading

hongchaodeng commented Feb 4, 2018

fanminshi commented Feb 7, 2018

zlangbert commented Feb 4, 2018 •

edited

Loading

zlangbert commented Feb 4, 2018 •

edited

Loading