Close file descriptors for redo process #1834

bojanserafimov · 2022-05-31T04:49:56Z

Fixes #1814

Tested manually using lsof .zenith/pageserver.pid

hlinnaka

Let's add a comment somewhere to explain why it's important to close the file descriptors:

It's a security issue if we don't. The WAL process is a sandbox that's not supposed to be able to access anything in the parent process.
The concrete problem with the lock file.

pageserver/src/walredo.rs

SomeoneToIgnore

Thank you!

funbringer · 2022-05-31T17:10:32Z

I believe this solution is suboptimal. If the goal is to fix just #1814, it would suffice to ensure that somebody sets the FD_CLOEXEC flag for the pidfile's fd. Since we use https://github.com/knsd/daemonize, this is exactly the entity which should do that for us. Unfortunately, that's not the case for the version 0.4.1. Compare this to the main branch: the code is there, but the fix hasn't made it to the crates.io yet (~~I wonder why,~~ the project seems to be dead: last commit on 31 Jul 2021, and better yet, last release on 27 Mar 2019).

One could argue that there may be more fds like that. Fortunately, std already uses cloexec where possible (see the docs for exec). This leads me to think that we should do three things instead:

Somehow use the latest version of daemonize, or better yet, get rid of it. That's not how one should daemonize services in this day and age anyway (aka man 7 daemon).
Audit the whole project for other places where we don't set cloexec either due to the broken deps or sloppiness with libs like nix.
Insert an assert! to the pre_exec to blow up the fork if we detect any stray fds.

bojanserafimov · 2022-05-31T17:45:20Z

I believe this solution is suboptimal. If the goal is to fix just #1814, it would suffice to ensure that somebody sets the FD_CLOEXEC flag for the pidfile's fd. Since we use https://github.com/knsd/daemonize, this is exactly the entity which should do that for us. Unfortunately, that's not the case for the version 0.4.1. Compare this to the main branch: the code is there, but the fix hasn't made it to the crates.io yet (~~I wonder why,~~ the project seems to be dead: last commit on 31 Jul 2021).

One could argue that there may be more fds like that. Fortunately, std already uses cloexec where possible (see the docs for exec). This leads me to think that we should do two three things instead:
* Somehow use the latest version of daemonize, or better yet, get rid of it. [That's not how one should daemonize services in this day and age anyway](https://0pointer.de/public/systemd-man/daemon.html#New-Style%20Daemons).

* Audit the whole project for other places where we don't set `cloexec` either due to the broken deps or sloppiness with libs like nix.

* Insert an `assert!` to the `pre_exec` to blow up the fork if we detect any stray fds.

Yes #1814 (comment)

Short term: Would you rather deal with unreleased commits from daemonize, or this new close_fds crate? We might need this crate anyway for the pre_exec assertion that you mention, unless there's a simpler way to correctly iterate open fds in a multi-thread program. Or alternatively we can do this check after exec, in the postgres code.

Long term: Moving away from daemonize also fixes #1840, among other things. It's a bigger project though, we shouldn't block a/reliability fixes on this transition.

bojanserafimov · 2022-05-31T18:13:52Z

merging this since it's a short-term improvement, but we can continue the daemonize discussion here #1841

This PR adds a test for #1834 and fixes the error in https://app.circleci.com/pipelines/github/neondatabase/neon/7753/workflows/94d1b796-10a3-4989-b23c-4c1eb4a49cf5/jobs/79586, which happens because `pageserver.pid` is held by `initdb` command on restart. Because the test requires `lsof` to be installed in the docker image, this PR also updates the caches and docker image specified in CircleCI config file.

bojanserafimov added 2 commits May 31, 2022 00:07

Close file descriptors for redo process

6681669

Simplify

1078248

bojanserafimov changed the title ~~Close fds~~ Close file descriptors for redo process May 31, 2022

hlinnaka approved these changes May 31, 2022

View reviewed changes

SomeoneToIgnore reviewed May 31, 2022

View reviewed changes

pageserver/src/walredo.rs Outdated Show resolved Hide resolved

phoenix24 approved these changes May 31, 2022

View reviewed changes

Add comments

c3de411

SomeoneToIgnore approved these changes May 31, 2022

View reviewed changes

hlinnaka approved these changes May 31, 2022

View reviewed changes

bojanserafimov mentioned this pull request May 31, 2022

Remove daemonize dependency #1841

Closed

bojanserafimov merged commit ca10cc1 into main May 31, 2022

bojanserafimov deleted the close-fds branch May 31, 2022 18:14

aome510 mentioned this pull request Jul 8, 2022

Add close_fds for initdb command and add close fd test #2060

Merged

aome510 mentioned this pull request Aug 11, 2022

flaky tests because of "unable to lock pid file. could not daemonize. bailing." #2247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close file descriptors for redo process #1834

Close file descriptors for redo process #1834

bojanserafimov commented May 31, 2022 •

edited

Loading

hlinnaka left a comment

SomeoneToIgnore left a comment

funbringer commented May 31, 2022 •

edited

Loading

bojanserafimov commented May 31, 2022

bojanserafimov commented May 31, 2022

Close file descriptors for redo process #1834

Close file descriptors for redo process #1834

Conversation

bojanserafimov commented May 31, 2022 • edited Loading

hlinnaka left a comment

Choose a reason for hiding this comment

SomeoneToIgnore left a comment

Choose a reason for hiding this comment

funbringer commented May 31, 2022 • edited Loading

bojanserafimov commented May 31, 2022

bojanserafimov commented May 31, 2022

bojanserafimov commented May 31, 2022 •

edited

Loading

funbringer commented May 31, 2022 •

edited

Loading