-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy entrypoint in Go #42
Conversation
@pglass sorry I think I approved prematurely, somehow I only looked at the diff from one of the commits. I haven't actually reviewed the entrypoint command. Let me finish reviewing and then I'll reapprove. |
if ok { | ||
return exitCode | ||
} | ||
return -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we return -1 as opposed to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically a case where we don't know what the exit code was for some reason (channel was closed without sending a value - more on that below).
I think I can change this to c.envoyCmd.ProcessState.ExitCode()
, which also uses -1 in some cases:
ExitCode returns the exit code of the exited process, or -1 if the process hasn't exited or was terminated by a signal.
Thoughts?
channel was closed without sending a value
How could this happen? Since we know the process started (already waited for the pid), I think the only way this would happen is if there's a panic during the cmd.Wait()
below.
consul-ecs/subcommand/envoy-entrypoint/envoy.go
Lines 39 to 59 in 6871fc4
func (e *EnvoyCmd) Run(wg *sync.WaitGroup) { | |
wg.Add(1) | |
defer wg.Done() | |
defer close(e.ExitCodeCh) | |
defer close(e.PidCh) | |
if err := e.Cmd.Start(); err != nil { | |
// Closed channels indicate the command failed to start. | |
return | |
} | |
e.PidCh <- e.Cmd.Process.Pid | |
if err := e.Cmd.Wait(); err != nil { | |
e.ExitCodeCh <- e.Cmd.ProcessState.ExitCode() | |
} else { | |
e.ExitCodeCh <- 0 | |
} | |
// Signal the process group to exit, to try to clean up subprocesses. | |
_ = syscall.Kill(-e.Cmd.Process.Pid, syscall.SIGTERM) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this a bit better:
- Always wait for Envoy to exit so that we always know its exit code (which will still be -1 in some cases).
- When the app monitor is done, send a SIGTERM to Envoy instead of returning (so that we no longer rely on the
cancel
in the defer to terminate Envoy). - Stop using CommandContext. This means the Envoy command is not cancellable via the context, but that should be okay since we always wait for Envoy to exit.
subcommand/envoy-entrypoint/envoy.go
Outdated
|
||
func NewEnvoyCmd(ctx context.Context, args []string) *EnvoyCmd { | ||
// CommandContext allows cancelling the command. | ||
// When cancelled, the process is sent a SIGKILL and is not waited on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should let envoy exit cleanly so that it can drain connections and stuff. Ideal would be to call /quitquitquit
endpoint when app container exists: https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--quitquit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is a good point.
Envoy actually catches both SIGTERM and SIGINT to shutdown cleanly as well. But I'm not sure if there's a difference between /quitquitquit and SIGTERM/SIGINT (my quick local test and glance at the Envoy codebase indicates they do the same. Envoy's signal handlers and /quitquitquit both call Server::Instance::shutdown()
).
From Envoy docs,
By default, the Envoy server will close listeners immediately on server shutdown. To drain listeners for some duration of time prior to server shutdown, use drain_listeners before shutting down the server. The listeners will be directly stopped without any graceful draining behaviour, and cease accepting new connections immediately.
To add a graceful drain period prior to listeners being closed, use the query parameter drain_listeners?graceful. By default, Envoy will discourage requests for some period of time (as determined by --drain-time-s) but continue accepting new connections until the drain timeout. The behaviour of request discouraging is determined by the drain manager.
So, we may even need something like:
- Drain listeners: POST /drain_listeners?graceful
- Wait for some amount of time.
- Shutdown Envoy cleanly: SIGTERM or POST /quitquitquit
Thoughts @ishustava @erichaberkorn? I think for this first pass, I'd prefer to keep this simple: send a SIGTERM so Envoy exits quickly/cleanly, and then follow up with draining logic later based on user feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a SIGTERM would work too (that's how it behaves on k8s by default too), whichever one is easier to implement should be fine. I meant that we should let envoy exit gracefully and wait rather than sending a SIGKILL, but didn't express myself very well 😄
Agree, as a first pass using graceful shutdown should be fine.
} | ||
|
||
c.sigs = make(chan os.Signal, 1) | ||
c.wg = &sync.WaitGroup{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about why we need the waitgroup too. It seems that the two goroutines we're starting (envoy and app monitor) will also send some signal to some sort of done/exit channel when they're done. Wouldn't receiving on those channels be sufficient to accomplish the same behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, hmm.
I need the WaitGroup because it's possible the entrypoint never receives a SIGTERM, so then the entrypoint never monitors the task metadata. What that means is the AppContainerMonitor.Done()
channel is never closed, because that happens in AppContainerMonitor.Run()
, which is not called if there is no SIGTERM.
Basically, I need a way to know if the AppContainerMonitor is not yet started. Or I always start the AppContainerMonitor (in which case it needs to be notified of when the SIGTERM happens, which I think is complicated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked this a little so that there's no more WaitGroup. I think it's better! 🙂 I went with the approach of having the AppContainerMonitor independently listen for SIGTERM. We always start the app monitor in the background, which means its "Done()" channel is actually reliable for waiting on.
* Remove the WaitGroup and wait on specific channels instead for goroutines to finish. This requires that we always start the AppContainerMonitor, which is independently notified of SIGTERM to wake up and wait for the container to exit. * EnvoyCmd is no longer cancellable. CommandContext would kill the process ungracefully when cancelled through the context.. Now we send an SIGTERM, and always wait for the Envoy process to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, Paul! Love the refactor in go.
Left a couple more comments but they're not blocking.
// When the application containers stop (after SIGTERM), tell Envoy to exit. | ||
if ok { | ||
c.log.Info("terminating Envoy with sigterm") | ||
_ = c.envoyCmd.Process.Signal(syscall.SIGTERM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to add a comment explaining why we're ignoring error here
// * Bash actually ignores SIGINT by default (note: CTRL-C sends SIGINT to the process group, not just the parent) | ||
// * Tests can be run in different places, so /bin/sh could be any shell with different behavior. | ||
// Why a background process + wait? Why not just a trap + sleep? | ||
// * The sleep blocks the trap. Traps are not executed until the current command completes, except for `wait`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thank you for answering all my questions 😄
|
||
// Not implemented for Windows. | ||
// Our Unix implementation doesn't compile on Windows, and we only need to support | ||
// Linux since this is an entrypoint to a Docker container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow godoc style
// Linux since this is an entrypoint to a Docker container | |
// Linux since this is an entrypoint to a Docker container. |
subcommand/envoy-entrypoint/envoy.go
Outdated
} | ||
e.startedCh <- struct{}{} | ||
|
||
_ = e.Cmd.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we ignoring error here?
subcommand/envoy-entrypoint/envoy.go
Outdated
e.doneCh <- struct{}{} | ||
|
||
// Signal the process group to exit, to try to clean up subprocesses. | ||
_ = syscall.Kill(-e.Cmd.Process.Pid, syscall.SIGTERM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here: not sure why the error is ignored. Also, do we need to wait for process to exit here?
Changes proposed in this PR:
Add a
consul-ecs envoy-entrypoint
command, which we can use as an entrypoint for the Envoy container instead of plain shell scriptThe entrypoint will:
How I've tested this PR:
How I expect reviewers to test this PR:
👀
Checklist: