Envoy entrypoint in Go #42

pglass · 2021-10-26T19:29:22Z

Changes proposed in this PR:

Add a consul-ecs envoy-entrypoint command, which we can use as an entrypoint for the Envoy container instead of plain shell script
The entrypoint will:
- Spawn the given command in a subprocess (presumably Envoy)
- Ignore SIGTERM, so that Envoy continues running into task shutdown
- Forward other signals to the subprocess
- Upon SIGTERM, begin polling Task metadata. When the application container(s) exit, the entrypoint terminates Envoy.

How I've tested this PR:

Unit tests included
Manual local testing
Acceptance tests in terraform-aws-consul-ecs (commit and CircleCI job)

How I expect reviewers to test this PR:

👀

Checklist:

Tests added
CHANGELOG entry added

subcommand/mesh-init/command_test.go

subcommand/envoy-entrypoint/command_unix.go

subcommand/mesh-init/command_test.go

subcommand/envoy-entrypoint/command_unix.go

subcommand/mesh-init/command.go

ishustava · 2021-10-27T19:49:00Z

@pglass sorry I think I approved prematurely, somehow I only looked at the diff from one of the commits. I haven't actually reviewed the entrypoint command. Let me finish reviewing and then I'll reapprove.

subcommand/envoy-entrypoint/command.go

ishustava · 2021-10-27T23:15:33Z

subcommand/envoy-entrypoint/command_unix.go

+			if ok {
+				return exitCode
+			}
+			return -1


why do we return -1 as opposed to 1?

This is basically a case where we don't know what the exit code was for some reason (channel was closed without sending a value - more on that below).

I think I can change this to c.envoyCmd.ProcessState.ExitCode(), which also uses -1 in some cases:

ExitCode returns the exit code of the exited process, or -1 if the process hasn't exited or was terminated by a signal.

Thoughts?

channel was closed without sending a value

How could this happen? Since we know the process started (already waited for the pid), I think the only way this would happen is if there's a panic during the cmd.Wait() below.

consul-ecs/subcommand/envoy-entrypoint/envoy.go

Lines 39 to 59 in 6871fc4

func (e *EnvoyCmd) Run(wg *sync.WaitGroup) {

wg.Add(1)

defer wg.Done()

defer close(e.ExitCodeCh)

defer close(e.PidCh)

if err := e.Cmd.Start(); err != nil {

// Closed channels indicate the command failed to start.

return

}

e.PidCh <- e.Cmd.Process.Pid

if err := e.Cmd.Wait(); err != nil {

e.ExitCodeCh <- e.Cmd.ProcessState.ExitCode()

} else {

e.ExitCodeCh <- 0

}

// Signal the process group to exit, to try to clean up subprocesses.

_ = syscall.Kill(-e.Cmd.Process.Pid, syscall.SIGTERM)

}

I made this a bit better:

Always wait for Envoy to exit so that we always know its exit code (which will still be -1 in some cases).

When the app monitor is done, send a SIGTERM to Envoy instead of returning (so that we no longer rely on the cancel in the defer to terminate Envoy).

Stop using CommandContext. This means the Envoy command is not cancellable via the context, but that should be okay since we always wait for Envoy to exit.

subcommand/envoy-entrypoint/command_unix.go

ishustava · 2021-10-27T23:52:09Z

subcommand/envoy-entrypoint/envoy.go

+
+func NewEnvoyCmd(ctx context.Context, args []string) *EnvoyCmd {
+	// CommandContext allows cancelling the command.
+	// When cancelled, the process is sent a SIGKILL and is not waited on.


I think we should let envoy exit cleanly so that it can drain connections and stuff. Ideal would be to call /quitquitquit endpoint when app container exists: https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--quitquit.

Oh, this is a good point.

Envoy actually catches both SIGTERM and SIGINT to shutdown cleanly as well. But I'm not sure if there's a difference between /quitquitquit and SIGTERM/SIGINT (my quick local test and glance at the Envoy codebase indicates they do the same. Envoy's signal handlers and /quitquitquit both call Server::Instance::shutdown()).

From Envoy docs,

By default, the Envoy server will close listeners immediately on server shutdown. To drain listeners for some duration of time prior to server shutdown, use drain_listeners before shutting down the server. The listeners will be directly stopped without any graceful draining behaviour, and cease accepting new connections immediately.

To add a graceful drain period prior to listeners being closed, use the query parameter drain_listeners?graceful. By default, Envoy will discourage requests for some period of time (as determined by --drain-time-s) but continue accepting new connections until the drain timeout. The behaviour of request discouraging is determined by the drain manager.

So, we may even need something like:

Drain listeners: POST /drain_listeners?graceful

Wait for some amount of time.

Shutdown Envoy cleanly: SIGTERM or POST /quitquitquit

Thoughts @ishustava @erichaberkorn? I think for this first pass, I'd prefer to keep this simple: send a SIGTERM so Envoy exits quickly/cleanly, and then follow up with draining logic later based on user feedback.

Yeah, a SIGTERM would work too (that's how it behaves on k8s by default too), whichever one is easier to implement should be fine. I meant that we should let envoy exit gracefully and wait rather than sending a SIGKILL, but didn't express myself very well 😄

Agree, as a first pass using graceful shutdown should be fine.

ishustava · 2021-10-27T23:55:17Z

subcommand/envoy-entrypoint/command_unix.go

+	}
+
+	c.sigs = make(chan os.Signal, 1)
+	c.wg = &sync.WaitGroup{}


I'm a bit confused about why we need the waitgroup too. It seems that the two goroutines we're starting (envoy and app monitor) will also send some signal to some sort of done/exit channel when they're done. Wouldn't receiving on those channels be sufficient to accomplish the same behavior?

Yeah, hmm.

I need the WaitGroup because it's possible the entrypoint never receives a SIGTERM, so then the entrypoint never monitors the task metadata. What that means is the AppContainerMonitor.Done() channel is never closed, because that happens in AppContainerMonitor.Run(), which is not called if there is no SIGTERM.

Basically, I need a way to know if the AppContainerMonitor is not yet started. Or I always start the AppContainerMonitor (in which case it needs to be notified of when the SIGTERM happens, which I think is complicated).

I reworked this a little so that there's no more WaitGroup. I think it's better! 🙂 I went with the approach of having the AppContainerMonitor independently listen for SIGTERM. We always start the app monitor in the background, which means its "Done()" channel is actually reliable for waiting on.

* Remove the WaitGroup and wait on specific channels instead for goroutines to finish. This requires that we always start the AppContainerMonitor, which is independently notified of SIGTERM to wake up and wait for the container to exit. * EnvoyCmd is no longer cancellable. CommandContext would kill the process ungracefully when cancelled through the context.. Now we send an SIGTERM, and always wait for the Envoy process to finish.

ishustava

Looks good, Paul! Love the refactor in go.

Left a couple more comments but they're not blocking.

ishustava · 2021-10-29T16:41:45Z

subcommand/envoy-entrypoint/command_unix.go

+			// When the application containers stop (after SIGTERM), tell Envoy to exit.
+			if ok {
+				c.log.Info("terminating Envoy with sigterm")
+				_ = c.envoyCmd.Process.Signal(syscall.SIGTERM)


would be good to add a comment explaining why we're ignoring error here

ishustava · 2021-10-29T16:43:52Z

subcommand/envoy-entrypoint/command_unix_test.go

+// * Bash actually ignores SIGINT by default (note: CTRL-C sends SIGINT to the process group, not just the parent)
+// * Tests can be run in different places, so /bin/sh could be any shell with different behavior.
+// Why a background process + wait? Why not just a trap + sleep?
+// * The sleep blocks the trap. Traps are not executed until the current command completes, except for `wait`.


nice, thank you for answering all my questions 😄

ishustava · 2021-10-29T16:48:42Z

subcommand/envoy-entrypoint/command_windows.go

+
+// Not implemented for Windows.
+// Our Unix implementation doesn't compile on Windows, and we only need to support
+// Linux since this is an entrypoint to a Docker container


To follow godoc style

Suggested change

// Linux since this is an entrypoint to a Docker container

// Linux since this is an entrypoint to a Docker container.

ishustava · 2021-10-29T16:50:24Z

subcommand/envoy-entrypoint/envoy.go

+	}
+	e.startedCh <- struct{}{}
+
+	_ = e.Cmd.Wait()


why are we ignoring error here?

ishustava · 2021-10-29T16:50:50Z

subcommand/envoy-entrypoint/envoy.go

+	e.doneCh <- struct{}{}
+
+	// Signal the process group to exit, to try to clean up subprocesses.
+	_ = syscall.Kill(-e.Cmd.Process.Pid, syscall.SIGTERM)


same here: not sure why the error is ignored. Also, do we need to wait for process to exit here?

Paul Glass added 7 commits October 20, 2021 15:49

Go-based entrypoint for Envoy

2b3de35

Bump Golang and Consul version

adf12c1

Fix windows build

6f33649

Terminate Envoy when app containers exit

745ad61

Rework entrypoint subcommand

40fd3dc

Rename: entrypoint -> envoy-entrypoint

0c93428

Tidy up entrypoint logging

6da0d1b

pglass requested review from ishustava and erichaberkorn October 26, 2021 19:29

pglass commented Oct 26, 2021

View reviewed changes

subcommand/mesh-init/command_test.go Outdated Show resolved Hide resolved

envoy-entrypoint: correct mesh-init container name

670b5fd

erichaberkorn reviewed Oct 27, 2021

View reviewed changes

subcommand/envoy-entrypoint/command_unix.go Outdated Show resolved Hide resolved

subcommand/mesh-init/command_test.go Outdated Show resolved Hide resolved

erichaberkorn reviewed Oct 27, 2021

View reviewed changes

subcommand/envoy-entrypoint/command_unix.go Show resolved Hide resolved

Paul Glass added 2 commits October 27, 2021 12:00

mesh-init: replace -envoy-bootstrap-file with -envoy-bootstrap-dir

280b796

Merge remote-tracking branch 'origin/main' into pglass/envoy-entrypoint

03a5347

ishustava approved these changes Oct 27, 2021

View reviewed changes

subcommand/mesh-init/command.go Outdated Show resolved Hide resolved

Tidy up envoy-entrypoint

6871fc4

ishustava reviewed Oct 28, 2021

View reviewed changes

pglass requested a review from erichaberkorn October 28, 2021 20:34

Paul Glass added 2 commits October 28, 2021 15:39

Fix build error

6e5a414

Merge remote-tracking branch 'origin/main' into pglass/envoy-entrypoint

ecd4434

erichaberkorn approved these changes Oct 28, 2021

View reviewed changes

ishustava approved these changes Oct 29, 2021

View reviewed changes

Error handling in envoy-entrypoint

511e188

pglass merged commit e9990d7 into main Nov 2, 2021

pglass deleted the pglass/envoy-entrypoint branch November 2, 2021 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy entrypoint in Go #42

Envoy entrypoint in Go #42

pglass commented Oct 26, 2021 •

edited

Loading

ishustava commented Oct 27, 2021

ishustava Oct 27, 2021

pglass Oct 28, 2021

pglass Oct 28, 2021 •

edited

Loading

ishustava Oct 27, 2021

pglass Oct 28, 2021 •

edited

Loading

ishustava Oct 28, 2021

ishustava Oct 27, 2021

pglass Oct 28, 2021

pglass Oct 28, 2021

ishustava left a comment

ishustava Oct 29, 2021

ishustava Oct 29, 2021

ishustava Oct 29, 2021

ishustava Oct 29, 2021

ishustava Oct 29, 2021

	func (e EnvoyCmd) Run(wg sync.WaitGroup) {
	wg.Add(1)
	defer wg.Done()
	defer close(e.ExitCodeCh)
	defer close(e.PidCh)

	if err := e.Cmd.Start(); err != nil {
	// Closed channels indicate the command failed to start.
	return
	}
	e.PidCh <- e.Cmd.Process.Pid

	if err := e.Cmd.Wait(); err != nil {
	e.ExitCodeCh <- e.Cmd.ProcessState.ExitCode()
	} else {
	e.ExitCodeCh <- 0
	}

	// Signal the process group to exit, to try to clean up subprocesses.
	_ = syscall.Kill(-e.Cmd.Process.Pid, syscall.SIGTERM)
	}

	// Linux since this is an entrypoint to a Docker container
	// Linux since this is an entrypoint to a Docker container.

Envoy entrypoint in Go #42

Envoy entrypoint in Go #42

Conversation

pglass commented Oct 26, 2021 • edited Loading

Changes proposed in this PR:

How I've tested this PR:

How I expect reviewers to test this PR:

Checklist:

ishustava commented Oct 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pglass Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pglass Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ishustava left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pglass commented Oct 26, 2021 •

edited

Loading

pglass Oct 28, 2021 •

edited

Loading

pglass Oct 28, 2021 •

edited

Loading