bin/environmentd: auto-restart on halt #27852

benesch · 2024-06-24T16:52:31Z

Bring the behavior of bin/environmentd more in line with Kubernetes and auto-restart envd after it halts. This facilitates demos and local testing of the new 0dt deployment flow, which involves a halt-and-restart when the new generation takes over.

Kubernetes will also restart envd after a panic, but we choose not to do so here, to avoid papering over panics during local development.

Motivation

This PR adds a known-desirable feature: works towards 0dt.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- n/a

aljoscha · 2024-06-24T17:52:30Z

misc/python/materialize/cli/run.py

+    # Kill the child PID, which is the group into which any descendent processes
+    # spawned by the child process will exist in.
+    try:
+        os.killpg(child_pid, signal.SIGTERM)


This approach is what I had as well, but doesn't that make it so we leak child processes once we go through a restart. The restarted processes will have a different child_pid, and it's child processes will be in a different group. So we don't kill them here.

I didn't test your PR locally, but that's what I got with a very similar (if not the same) approach.

True! But those child processes should be clusterd processes that are managed by the process orchestrator, and thus readopted by the next environmentd incarnation. Though of course they'd be abandoned and leaked when the last environmentd incarnation exited. I was thinking we could live with that glitch but ...

...after thinking about this for a while I felt like the process group and session management here was overly complex. I've pushed up a new revision that vastly simplifies the management. I think it is nearly 100% robust and avoids the glitch you called out here. The only ways it can fail are if 1) if run.py itself crashes (unavoidable) or 2) if environmentd intentionally moves its subprocesses out into their own process group, which would be actively adversarial and something we don't need to worry about.

aljoscha

excellent! I verified this locally with my 0dt demo/testing setup: works as expected! 👌

Bring the behavior of bin/environmentd more in line with Kubernetes and auto-restart envd after it halts. This facilitates demos and local testing of the new 0dt deployment flow, which involves a halt-and-restart when the new generation takes over. Kubernetes will also restart envd after a panic, but we choose not to do so here, to avoid papering over panics during local development. This commit involves a substantial refactoring and simplification of the process group and session management in the run.py script. Moving the subprocess into its own process group and session appears to be wholly unnecessary; instead, we can simply create a new process group for the parent, thus ensuring that all descendent processes are created in the the process group for the run.py script, then send a SIGTERM to all processes in that group when run.py exits.

benesch · 2024-06-25T16:02:58Z

Amazing!

benesch requested review from aljoscha and jkosh44 June 24, 2024 16:52

benesch marked this pull request as ready for review June 24, 2024 16:52

aljoscha reviewed Jun 24, 2024

View reviewed changes

benesch force-pushed the bin-envd-run-restart branch from 20ee337 to ae34954 Compare June 25, 2024 03:14

aljoscha approved these changes Jun 25, 2024

View reviewed changes

benesch force-pushed the bin-envd-run-restart branch from ae34954 to 035b7dd Compare June 25, 2024 15:04

benesch merged commit 5898eea into MaterializeInc:main Jun 25, 2024
8 checks passed

benesch deleted the bin-envd-run-restart branch June 25, 2024 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin/environmentd: auto-restart on halt #27852

bin/environmentd: auto-restart on halt #27852

benesch commented Jun 24, 2024

aljoscha Jun 24, 2024

benesch Jun 25, 2024

aljoscha left a comment

benesch commented Jun 25, 2024

bin/environmentd: auto-restart on halt #27852

bin/environmentd: auto-restart on halt #27852

Conversation

benesch commented Jun 24, 2024

Motivation

Checklist

aljoscha Jun 24, 2024

Choose a reason for hiding this comment

benesch Jun 25, 2024

Choose a reason for hiding this comment

aljoscha left a comment

Choose a reason for hiding this comment

benesch commented Jun 25, 2024