backup: elevated tail latencies in SQL workload while backing up to s3 #115190
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-3
Issues/test failures with no fix SLA
T-disaster-recovery
We have observed that backups to s3 cause increased tail latencies in foreground traffic, sometimes significantly.
We currently see some cases where 600+ of heap is inuse by the chunk buffers in the sdk, leading to increased rates of GC (even absent memory pressure but rather just due to its size relative to the live heap if a reasonable GOGC is not set e.g. due to #115164 ).
These more frequent GC runs appear to also see higher per-run pause times, sometimes much higher.
The S3 SDK hashes chunk sized (currently 8mb) blocks both with MD5 and SHA256, for content checksum and signing respectively. It appears that due in large part to golang/go#64417, this causes us to observe long gc pause times and traces show STW pauses overlapping with block hashing.
This is a tracking issue for all related issues.
Jira issue: CRDB-33924
The text was updated successfully, but these errors were encountered: