Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pageserver): abort process if fsync fails #9108

Merged
merged 3 commits into from
Sep 27, 2024

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented Sep 23, 2024

Problem

close #8140

The original issue is rather vague on what we should do. After discussion w/ @problame we decided to narrow down the problems we want to solve in that issue.

  • read path -- do not panic for now.
  • write path -- panic only on write errors (i.e., device error, fsync error), but not on no-space for now.

The guideline is that if the pageserver behavior could lead to violation of persistent constraints (i.e., return an operation as successful but not actually persisting things), we should panic. Fsync is the place where both of us agree that we should panic, because if fsync fails, the kernel will mark dirty pages as clean, and the next fsync will not necessarily return false. This would make the storage client assume the operation is successful.

Summary of changes

Make fsync panic on fatal errors.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@skyzh skyzh requested review from jcsp and problame September 23, 2024 16:20
@skyzh skyzh requested a review from a team as a code owner September 23, 2024 16:20
Copy link

github-actions bot commented Sep 23, 2024

5065 tests run: 4907 passed, 0 failed, 158 skipped (full report)


Flaky tests (8)

Postgres 17

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 32.0% (7490 of 23395 functions)
  • lines: 50.0% (60481 of 120888 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
68a7b13 at 2024-09-27T18:35:22.810Z :recycle:

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick search for regex \.sync_ in the codebase shows that this PR doesn't cover

Also, there's the question how to avoid regressing the code base in the future. Can we use clippy to ban tokio::fs::File::sync_all ?

@skyzh
Copy link
Member Author

skyzh commented Sep 24, 2024

Also, there's the question how to avoid regressing the code base in the future. Can we use clippy to ban tokio::fs::File::sync_all ?

We still have usage of fsync in safekeeper, and I don't plan to fix them in this pull request for now, so we cannot enforce the clippy lint until we fix all occurrences :(

Signed-off-by: Alex Chi Z <chi@neon.tech>
Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh skyzh force-pushed the skyzh/final-fatal-err-write-path branch from af08759 to fb2bb78 Compare September 24, 2024 18:19
@skyzh skyzh requested a review from problame September 24, 2024 18:30
@problame
Copy link
Contributor

Also, there's the question how to avoid regressing the code base in the future. Can we use clippy to ban tokio::fs::File::sync_all ?

We still have usage of fsync in safekeeper, and I don't plan to fix them in this pull request for now, so we cannot enforce the clippy lint until we fix all occurrences :(

cc @arssher , please think about what safekeepers should do on sync failure / whether it would be better to just abort the process in those cases. Not in this PR though.

@problame problame changed the title fix(pageserver): panic if fsync fails fix(pageserver): abort process if fsync fails Sep 26, 2024
@skyzh
Copy link
Member Author

skyzh commented Sep 26, 2024

created #9172 for safekeeper

@skyzh skyzh enabled auto-merge (squash) September 26, 2024 18:21
@skyzh skyzh merged commit cde1654 into main Sep 27, 2024
84 checks passed
@skyzh skyzh deleted the skyzh/final-fatal-err-write-path branch September 27, 2024 18:58
bayandin pushed a commit that referenced this pull request Sep 29, 2024
close #8140

The original issue is rather vague on what we should do. After
discussion w/ @problame we decided to narrow down the problems we want
to solve in that issue.

* read path -- do not panic for now.
* write path -- panic only on write errors (i.e., device error, fsync
error), but not on no-space for now.

The guideline is that if the pageserver behavior could lead to violation
of persistent constraints (i.e., return an operation as successful but
not actually persisting things), we should panic. Fsync is the place
where both of us agree that we should panic, because if fsync fails, the
kernel will mark dirty pages as clean, and the next fsync will not
necessarily return false. This would make the storage client assume the
operation is successful.

## Summary of changes

Make fsync panic on fatal errors.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pageserver: abort process on fsync errors
2 participants