storage: investigate fsync latency spikes #106231
Labels
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-investigation
Further steps needed to qualify. C-label will change.
T-storage
Storage Team
We've seen many instances of fsync latency spikes in cloud clusters (including in cockroachlabs/support#2395). These fsync latency spikes can be 10+ seconds long, but without being the 20 seconds necessary to trigger disk stall detection to terminate the node.
These fsync latency stalls can be extremely disruptive to the cluster. In cockroachlabs/support#2395 overall throughput tanked as eventually every worker in the bounded worker pool becomes stuck on some operation waiting for the slow disk. There are issues (eg, #88699) already tracking the work to reduce the impact of one node's slow disk on overall cluster throughput. But I think there's something additional to investigate with respect to cloud platforms and why these stalls occur.
LogWriter.flushLoop
) and the fsync itself? This seems unlikely. Although we have non-trivial logic within the VFS stack, the fsync codepaths are very minimal and contain no locking.We should try to reproduce across cloud providers and investigate. For example, write a roachtest that demonstrates the issues mentioned above.
Informs #107623.
Jira issue: CRDB-29450
The text was updated successfully, but these errors were encountered: