Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bin/environmentd: auto-restart on halt #27852

Merged
merged 1 commit into from
Jun 25, 2024

Conversation

benesch
Copy link
Member

@benesch benesch commented Jun 24, 2024

Bring the behavior of bin/environmentd more in line with Kubernetes and auto-restart envd after it halts. This facilitates demos and local testing of the new 0dt deployment flow, which involves a halt-and-restart when the new generation takes over.

Kubernetes will also restart envd after a panic, but we choose not to do so here, to avoid papering over panics during local development.

Motivation

  • This PR adds a known-desirable feature: works towards 0dt.

Checklist

@benesch benesch requested review from aljoscha and jkosh44 June 24, 2024 16:52
@benesch benesch marked this pull request as ready for review June 24, 2024 16:52
# Kill the child PID, which is the group into which any descendent processes
# spawned by the child process will exist in.
try:
os.killpg(child_pid, signal.SIGTERM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is what I had as well, but doesn't that make it so we leak child processes once we go through a restart. The restarted processes will have a different child_pid, and it's child processes will be in a different group. So we don't kill them here.

I didn't test your PR locally, but that's what I got with a very similar (if not the same) approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! But those child processes should be clusterd processes that are managed by the process orchestrator, and thus readopted by the next environmentd incarnation. Though of course they'd be abandoned and leaked when the last environmentd incarnation exited. I was thinking we could live with that glitch but ...

...after thinking about this for a while I felt like the process group and session management here was overly complex. I've pushed up a new revision that vastly simplifies the management. I think it is nearly 100% robust and avoids the glitch you called out here. The only ways it can fail are if 1) if run.py itself crashes (unavoidable) or 2) if environmentd intentionally moves its subprocesses out into their own process group, which would be actively adversarial and something we don't need to worry about.

Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent! I verified this locally with my 0dt demo/testing setup: works as expected! 👌

Bring the behavior of bin/environmentd more in line with Kubernetes and
auto-restart envd after it halts. This facilitates demos and local
testing of the new 0dt deployment flow, which involves a
halt-and-restart when the new generation takes over.

Kubernetes will also restart envd after a panic, but we choose not to do
so here, to avoid papering over panics during local development.

This commit involves a substantial refactoring and simplification of the
process group and session management in the run.py script. Moving the
subprocess into its own process group and session appears to be wholly
unnecessary; instead, we can simply create a new process group for the
parent, thus ensuring that all descendent processes are created in the
the process group for the run.py script, then send a SIGTERM to all
processes in that group when run.py exits.
@benesch benesch merged commit 5898eea into MaterializeInc:main Jun 25, 2024
8 checks passed
@benesch benesch deleted the bin-envd-run-restart branch June 25, 2024 16:02
@benesch
Copy link
Member Author

benesch commented Jun 25, 2024

Amazing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants