Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close file descriptors for redo process #1834

Merged
merged 3 commits into from
May 31, 2022
Merged

Close file descriptors for redo process #1834

merged 3 commits into from
May 31, 2022

Conversation

bojanserafimov
Copy link
Contributor

@bojanserafimov bojanserafimov commented May 31, 2022

Fixes #1814

Tested manually using lsof .zenith/pageserver.pid

@bojanserafimov bojanserafimov changed the title Close fds Close file descriptors for redo process May 31, 2022
Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment somewhere to explain why it's important to close the file descriptors:

  1. It's a security issue if we don't. The WAL process is a sandbox that's not supposed to be able to access anything in the parent process.
  2. The concrete problem with the lock file.

Copy link
Contributor

@SomeoneToIgnore SomeoneToIgnore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@funbringer
Copy link
Contributor

funbringer commented May 31, 2022

I believe this solution is suboptimal. If the goal is to fix just #1814, it would suffice to ensure that somebody sets the FD_CLOEXEC flag for the pidfile's fd. Since we use https://github.com/knsd/daemonize, this is exactly the entity which should do that for us. Unfortunately, that's not the case for the version 0.4.1. Compare this to the main branch: the code is there, but the fix hasn't made it to the crates.io yet (I wonder why, the project seems to be dead: last commit on 31 Jul 2021, and better yet, last release on 27 Mar 2019).

One could argue that there may be more fds like that. Fortunately, std already uses cloexec where possible (see the docs for exec). This leads me to think that we should do three things instead:

  • Somehow use the latest version of daemonize, or better yet, get rid of it. That's not how one should daemonize services in this day and age anyway (aka man 7 daemon).
  • Audit the whole project for other places where we don't set cloexec either due to the broken deps or sloppiness with libs like nix.
  • Insert an assert! to the pre_exec to blow up the fork if we detect any stray fds.

@bojanserafimov
Copy link
Contributor Author

I believe this solution is suboptimal. If the goal is to fix just #1814, it would suffice to ensure that somebody sets the FD_CLOEXEC flag for the pidfile's fd. Since we use https://github.com/knsd/daemonize, this is exactly the entity which should do that for us. Unfortunately, that's not the case for the version 0.4.1. Compare this to the main branch: the code is there, but the fix hasn't made it to the crates.io yet (I wonder why, the project seems to be dead: last commit on 31 Jul 2021).

One could argue that there may be more fds like that. Fortunately, std already uses cloexec where possible (see the docs for exec). This leads me to think that we should do two three things instead:

* Somehow use the latest version of daemonize, or better yet, get rid of it. [That's not how one should daemonize services in this day and age anyway](https://0pointer.de/public/systemd-man/daemon.html#New-Style%20Daemons).

* Audit the whole project for other places where we don't set `cloexec` either due to the broken deps or sloppiness with libs like nix.

* Insert an `assert!` to the `pre_exec` to blow up the fork if we detect any stray fds.

Yes #1814 (comment)

Short term: Would you rather deal with unreleased commits from daemonize, or this new close_fds crate? We might need this crate anyway for the pre_exec assertion that you mention, unless there's a simpler way to correctly iterate open fds in a multi-thread program. Or alternatively we can do this check after exec, in the postgres code.

Long term: Moving away from daemonize also fixes #1840, among other things. It's a bigger project though, we shouldn't block a/reliability fixes on this transition.

@bojanserafimov
Copy link
Contributor Author

merging this since it's a short-term improvement, but we can continue the daemonize discussion here #1841

@bojanserafimov bojanserafimov merged commit ca10cc1 into main May 31, 2022
@bojanserafimov bojanserafimov deleted the close-fds branch May 31, 2022 18:14
aome510 added a commit that referenced this pull request Jul 12, 2022
This PR adds a test for #1834 and fixes the error in https://app.circleci.com/pipelines/github/neondatabase/neon/7753/workflows/94d1b796-10a3-4989-b23c-4c1eb4a49cf5/jobs/79586, which happens because `pageserver.pid` is held by `initdb` command on restart.

Because the test requires `lsof` to be installed in the docker image, this PR also updates the caches and docker image specified in CircleCI config file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lock file is held open after pageserver process has exited
5 participants