storage: investigate fsync latency spikes #106231

jbowens · 2023-07-05T23:39:45Z

We've seen many instances of fsync latency spikes in cloud clusters (including in cockroachlabs/support#2395). These fsync latency spikes can be 10+ seconds long, but without being the 20 seconds necessary to trigger disk stall detection to terminate the node.

These fsync latency stalls can be extremely disruptive to the cluster. In cockroachlabs/support#2395 overall throughput tanked as eventually every worker in the bounded worker pool becomes stuck on some operation waiting for the slow disk. There are issues (eg, #88699) already tracking the work to reduce the impact of one node's slow disk on overall cluster throughput. But I think there's something additional to investigate with respect to cloud platforms and why these stalls occur.

Is our volume of in-progress IOPS highly variable, and we momentarily exhaust IOPS limit resulting in throttling? If so perf: user-level IO scheduler pebble#18 could help ensure we avoid starving the WAL writer through saturating IOPS.
Is it possible there's something within the process introducing latency between the point at which fsync is measured (eg, the Pebble LogWriter.flushLoop) and the fsync itself? This seems unlikely. Although we have non-trivial logic within the VFS stack, the fsync codepaths are very minimal and contain no locking.

We should try to reproduce across cloud providers and investigate. For example, write a roachtest that demonstrates the issues mentioned above.

Informs #107623.

Jira issue: CRDB-29450

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-07-05T23:39:47Z

Hi @jbowens, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

jbowens · 2023-07-10T21:35:07Z

We discussed during storage triage and a few other avenues of exploration / remedies were also discussed.

[@RaduBerinde]: The metrics that CockroachDB surfaces (eg, through timeseries) have very low granularity: We collect Store metrics every 10 seconds. This makes it very difficult to observe momentary IOPS exhaustion. Surfacing higher fidelity metrics here could help.

Should we be momentarily exhausting IOPS, short of implementing a user-level IO scheduler, we could:

[@RaduBerinde]: Move the WAL to a separate volume would help isolate the LogWriter from IOPS exhaustion due to flushes, compactions and reads.
[@nicktrav]: Separate the raft log (kvserver: separate raft log #16624) and moving that engine to a separate volume from the applied state could also improve isolation for foreground writes.

jbowens · 2023-08-14T18:02:23Z

WIP roachtest for inducing IOPS starvation:
https://github.com/cockroachdb/cockroach/compare/master...jbowens:cockroach:overload-iops-roachtest?expand=1

blathers-crl bot added the T-storage Storage Team label Jul 5, 2023

jbowens added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. labels Jul 7, 2023

jbowens added C-investigation Further steps needed to qualify. C-label will change. and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Jul 28, 2023

jbowens added this to Storage Jun 4, 2024

jbowens moved this to Backlog in Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: investigate fsync latency spikes #106231

storage: investigate fsync latency spikes #106231

jbowens commented Jul 5, 2023 •

edited

Loading

blathers-crl bot commented Jul 5, 2023

jbowens commented Jul 10, 2023 •

edited

Loading

jbowens commented Aug 14, 2023

storage: investigate fsync latency spikes #106231

storage: investigate fsync latency spikes #106231

Comments

jbowens commented Jul 5, 2023 • edited Loading

blathers-crl bot commented Jul 5, 2023

jbowens commented Jul 10, 2023 • edited Loading

jbowens commented Aug 14, 2023

jbowens commented Jul 5, 2023 •

edited

Loading

jbowens commented Jul 10, 2023 •

edited

Loading