Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy entrypoint in Go #42

Merged
merged 15 commits into from
Nov 2, 2021
Merged

Envoy entrypoint in Go #42

merged 15 commits into from
Nov 2, 2021

Conversation

pglass
Copy link

@pglass pglass commented Oct 26, 2021

Changes proposed in this PR:

  • Add a consul-ecs envoy-entrypoint command, which we can use as an entrypoint for the Envoy container instead of plain shell script

  • The entrypoint will:

    • Spawn the given command in a subprocess (presumably Envoy)
    • Ignore SIGTERM, so that Envoy continues running into task shutdown
    • Forward other signals to the subprocess
    • Upon SIGTERM, begin polling Task metadata. When the application container(s) exit, the entrypoint terminates Envoy.

How I've tested this PR:

  • Unit tests included
  • Manual local testing
  • Acceptance tests in terraform-aws-consul-ecs (commit and CircleCI job)

How I expect reviewers to test this PR:

👀

Checklist:

  • Tests added
  • CHANGELOG entry added

subcommand/mesh-init/command.go Outdated Show resolved Hide resolved
@ishustava
Copy link
Contributor

@pglass sorry I think I approved prematurely, somehow I only looked at the diff from one of the commits. I haven't actually reviewed the entrypoint command. Let me finish reviewing and then I'll reapprove.

subcommand/envoy-entrypoint/command.go Outdated Show resolved Hide resolved
if ok {
return exitCode
}
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we return -1 as opposed to 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a case where we don't know what the exit code was for some reason (channel was closed without sending a value - more on that below).

I think I can change this to c.envoyCmd.ProcessState.ExitCode(), which also uses -1 in some cases:

ExitCode returns the exit code of the exited process, or -1 if the process hasn't exited or was terminated by a signal.

Thoughts?


channel was closed without sending a value

How could this happen? Since we know the process started (already waited for the pid), I think the only way this would happen is if there's a panic during the cmd.Wait() below.

func (e *EnvoyCmd) Run(wg *sync.WaitGroup) {
wg.Add(1)
defer wg.Done()
defer close(e.ExitCodeCh)
defer close(e.PidCh)
if err := e.Cmd.Start(); err != nil {
// Closed channels indicate the command failed to start.
return
}
e.PidCh <- e.Cmd.Process.Pid
if err := e.Cmd.Wait(); err != nil {
e.ExitCodeCh <- e.Cmd.ProcessState.ExitCode()
} else {
e.ExitCodeCh <- 0
}
// Signal the process group to exit, to try to clean up subprocesses.
_ = syscall.Kill(-e.Cmd.Process.Pid, syscall.SIGTERM)
}

Copy link
Author

@pglass pglass Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this a bit better:

  1. Always wait for Envoy to exit so that we always know its exit code (which will still be -1 in some cases).
  2. When the app monitor is done, send a SIGTERM to Envoy instead of returning (so that we no longer rely on the cancel in the defer to terminate Envoy).
  3. Stop using CommandContext. This means the Envoy command is not cancellable via the context, but that should be okay since we always wait for Envoy to exit.

subcommand/envoy-entrypoint/command_unix.go Outdated Show resolved Hide resolved

func NewEnvoyCmd(ctx context.Context, args []string) *EnvoyCmd {
// CommandContext allows cancelling the command.
// When cancelled, the process is sent a SIGKILL and is not waited on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should let envoy exit cleanly so that it can drain connections and stuff. Ideal would be to call /quitquitquit endpoint when app container exists: https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--quitquit.

Copy link
Author

@pglass pglass Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is a good point.

Envoy actually catches both SIGTERM and SIGINT to shutdown cleanly as well. But I'm not sure if there's a difference between /quitquitquit and SIGTERM/SIGINT (my quick local test and glance at the Envoy codebase indicates they do the same. Envoy's signal handlers and /quitquitquit both call Server::Instance::shutdown()).

From Envoy docs,

By default, the Envoy server will close listeners immediately on server shutdown. To drain listeners for some duration of time prior to server shutdown, use drain_listeners before shutting down the server. The listeners will be directly stopped without any graceful draining behaviour, and cease accepting new connections immediately.

To add a graceful drain period prior to listeners being closed, use the query parameter drain_listeners?graceful. By default, Envoy will discourage requests for some period of time (as determined by --drain-time-s) but continue accepting new connections until the drain timeout. The behaviour of request discouraging is determined by the drain manager.

So, we may even need something like:

Thoughts @ishustava @erichaberkorn? I think for this first pass, I'd prefer to keep this simple: send a SIGTERM so Envoy exits quickly/cleanly, and then follow up with draining logic later based on user feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a SIGTERM would work too (that's how it behaves on k8s by default too), whichever one is easier to implement should be fine. I meant that we should let envoy exit gracefully and wait rather than sending a SIGKILL, but didn't express myself very well 😄

Agree, as a first pass using graceful shutdown should be fine.

}

c.sigs = make(chan os.Signal, 1)
c.wg = &sync.WaitGroup{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about why we need the waitgroup too. It seems that the two goroutines we're starting (envoy and app monitor) will also send some signal to some sort of done/exit channel when they're done. Wouldn't receiving on those channels be sufficient to accomplish the same behavior?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, hmm.

I need the WaitGroup because it's possible the entrypoint never receives a SIGTERM, so then the entrypoint never monitors the task metadata. What that means is the AppContainerMonitor.Done() channel is never closed, because that happens in AppContainerMonitor.Run(), which is not called if there is no SIGTERM.

Basically, I need a way to know if the AppContainerMonitor is not yet started. Or I always start the AppContainerMonitor (in which case it needs to be notified of when the SIGTERM happens, which I think is complicated).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this a little so that there's no more WaitGroup. I think it's better! 🙂 I went with the approach of having the AppContainerMonitor independently listen for SIGTERM. We always start the app monitor in the background, which means its "Done()" channel is actually reliable for waiting on.

* Remove the WaitGroup and wait on specific channels instead for
  goroutines to finish. This requires that we always start the
  AppContainerMonitor, which is independently notified of SIGTERM
  to wake up and wait for the container to exit.
* EnvoyCmd is no longer cancellable. CommandContext would kill the
  process ungracefully when cancelled through the context.. Now we
  send an SIGTERM, and always wait for the Envoy process to finish.
@pglass pglass requested a review from erichaberkorn October 28, 2021 20:34
Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Paul! Love the refactor in go.

Left a couple more comments but they're not blocking.

// When the application containers stop (after SIGTERM), tell Envoy to exit.
if ok {
c.log.Info("terminating Envoy with sigterm")
_ = c.envoyCmd.Process.Signal(syscall.SIGTERM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add a comment explaining why we're ignoring error here

// * Bash actually ignores SIGINT by default (note: CTRL-C sends SIGINT to the process group, not just the parent)
// * Tests can be run in different places, so /bin/sh could be any shell with different behavior.
// Why a background process + wait? Why not just a trap + sleep?
// * The sleep blocks the trap. Traps are not executed until the current command completes, except for `wait`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thank you for answering all my questions 😄


// Not implemented for Windows.
// Our Unix implementation doesn't compile on Windows, and we only need to support
// Linux since this is an entrypoint to a Docker container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow godoc style

Suggested change
// Linux since this is an entrypoint to a Docker container
// Linux since this is an entrypoint to a Docker container.

}
e.startedCh <- struct{}{}

_ = e.Cmd.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we ignoring error here?

e.doneCh <- struct{}{}

// Signal the process group to exit, to try to clean up subprocesses.
_ = syscall.Kill(-e.Cmd.Process.Pid, syscall.SIGTERM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: not sure why the error is ignored. Also, do we need to wait for process to exit here?

@pglass pglass merged commit e9990d7 into main Nov 2, 2021
@pglass pglass deleted the pglass/envoy-entrypoint branch November 2, 2021 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants