Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

Open
5 of 6 tasks
dt opened this issue Nov 28, 2023 · 1 comment
Open
5 of 6 tasks

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

dt opened this issue Nov 28, 2023 · 1 comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-disaster-recovery

Comments

@dt
Copy link
Member

dt commented Nov 28, 2023

We have observed that backups to s3 cause increased tail latencies in foreground traffic, sometimes significantly.

We currently see some cases where 600+ of heap is inuse by the chunk buffers in the sdk, leading to increased rates of GC (even absent memory pressure but rather just due to its size relative to the live heap if a reasonable GOGC is not set e.g. due to #115164 ).

These more frequent GC runs appear to also see higher per-run pause times, sometimes much higher.

The S3 SDK hashes chunk sized (currently 8mb) blocks both with MD5 and SHA256, for content checksum and signing respectively. It appears that due in large part to golang/go#64417, this causes us to observe long gc pause times and traces show STW pauses overlapping with block hashing.

This is a tracking issue for all related issues.

Jira issue: CRDB-33924

@dt dt added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels Nov 28, 2023
@dt dt self-assigned this Nov 28, 2023
Copy link

blathers-crl bot commented Nov 28, 2023

cc @cockroachdb/disaster-recovery

@dt dt added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 28, 2023
@dt dt removed their assignment Dec 4, 2023
@benbardin benbardin added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 13, 2023
@exalate-issue-sync exalate-issue-sync bot added P-3 Issues/test failures with no fix SLA and removed P-2 Issues/test failures with a fix SLA of 3 months labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-disaster-recovery
Projects
No open projects
Development

No branches or pull requests

2 participants