-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(pageserver): abort process if fsync fails #9108
Conversation
5065 tests run: 4907 passed, 0 failed, 158 skipped (full report)Flaky tests (8)Postgres 17
Postgres 15
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
68a7b13 at 2024-09-27T18:35:22.810Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick search for regex \.sync_
in the codebase shows that this PR doesn't cover
VirtualFile::crashsafe_overwrite
- some usage of tokio::fs
neon/pageserver/src/tenant/remote_timeline_client/download.rs
Lines 185 to 186 in a132323
destination_file .sync_all() neon/pageserver/src/tenant/mgr.rs
Line 222 in 1708743
fs::File::open(parent).await?.sync_all().await?; tempfile.as_file().sync_all()?;
Also, there's the question how to avoid regressing the code base in the future. Can we use clippy to ban tokio::fs::File::sync_all
?
We still have usage of fsync in safekeeper, and I don't plan to fix them in this pull request for now, so we cannot enforce the clippy lint until we fix all occurrences :( |
Signed-off-by: Alex Chi Z <chi@neon.tech>
Signed-off-by: Alex Chi Z <chi@neon.tech>
af08759
to
fb2bb78
Compare
cc @arssher , please think about what safekeepers should do on sync failure / whether it would be better to just abort the process in those cases. Not in this PR though. |
created #9172 for safekeeper |
close #8140 The original issue is rather vague on what we should do. After discussion w/ @problame we decided to narrow down the problems we want to solve in that issue. * read path -- do not panic for now. * write path -- panic only on write errors (i.e., device error, fsync error), but not on no-space for now. The guideline is that if the pageserver behavior could lead to violation of persistent constraints (i.e., return an operation as successful but not actually persisting things), we should panic. Fsync is the place where both of us agree that we should panic, because if fsync fails, the kernel will mark dirty pages as clean, and the next fsync will not necessarily return false. This would make the storage client assume the operation is successful. ## Summary of changes Make fsync panic on fatal errors. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>
Problem
close #8140
The original issue is rather vague on what we should do. After discussion w/ @problame we decided to narrow down the problems we want to solve in that issue.
The guideline is that if the pageserver behavior could lead to violation of persistent constraints (i.e., return an operation as successful but not actually persisting things), we should panic. Fsync is the place where both of us agree that we should panic, because if fsync fails, the kernel will mark dirty pages as clean, and the next fsync will not necessarily return false. This would make the storage client assume the operation is successful.
Summary of changes
Make fsync panic on fatal errors.
Checklist before requesting a review
Checklist before merging