backup: elevated tail latencies in SQL workload while backing up to s3 #115190

dt · 2023-11-28T18:53:08Z

We have observed that backups to s3 cause increased tail latencies in foreground traffic, sometimes significantly.

We currently see some cases where 600+ of heap is inuse by the chunk buffers in the sdk, leading to increased rates of GC (even absent memory pressure but rather just due to its size relative to the live heap if a reasonable GOGC is not set e.g. due to #115164 ).

These more frequent GC runs appear to also see higher per-run pause times, sometimes much higher.

The S3 SDK hashes chunk sized (currently 8mb) blocks both with MD5 and SHA256, for content checksum and signing respectively. It appears that due in large part to golang/go#64417, this causes us to observe long gc pause times and traces show STW pauses overlapping with block hashing.

This is a tracking issue for all related issues.

Jira issue: CRDB-33924

blathers-crl · 2023-11-28T18:53:11Z

cc @cockroachdb/disaster-recovery

dt added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels Nov 28, 2023

dt self-assigned this Nov 28, 2023

This was referenced Nov 28, 2023

cloud/s3: skip md5 hashing #115189

Closed

*: avoid long non-preemptable function calls #115192

Closed

cloud/s3: audit memory usage during uploads #115196

Open

dt added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 28, 2023

nvanbenschoten mentioned this issue Nov 28, 2023

cloud,*: bulk upload work disruptive to foreground latencies #108790

Closed

yuzefovich mentioned this issue Nov 28, 2023

release-23.1: colrpc, flowinfra, kvcoord, server: wrap sends and recvs in tracing #115081

Merged

dt removed their assignment Dec 4, 2023

benbardin added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 13, 2023

irfansharif mentioned this issue Jan 11, 2024

roachtest: add + fix elastic-backup test equivalent for AWS #107770

Closed

exalate-issue-sync bot added P-3 Issues/test failures with no fix SLA and removed P-2 Issues/test failures with a fix SLA of 3 months labels Feb 27, 2024

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Backlog in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

dt commented Nov 28, 2023 •

edited

Loading

blathers-crl bot commented Nov 28, 2023

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

Comments

dt commented Nov 28, 2023 • edited Loading

blathers-crl bot commented Nov 28, 2023

dt commented Nov 28, 2023 •

edited

Loading