Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging on Connect proxy error #12389

Closed
mikenomitch opened this issue Mar 25, 2022 · 3 comments · Fixed by #15908
Closed

Improve logging on Connect proxy error #12389

mikenomitch opened this issue Mar 25, 2022 · 3 comments · Fixed by #15908
Assignees

Comments

@mikenomitch
Copy link
Contributor

Proposal

When there is an error creating the bootstrap configuration for Connect proxy sidecar, the error is not properly surfaced. Currently there is no indication of what happened when Envoy fails to start allocations

Use-cases

Ideally Nomad could pass this error through to the end user so they could better diagnose and fix the issue.

For instance, an ACL mismatch in Consul and/or notifications on expired Certificates would be logged/surfaced.

Attempted Solutions

The following code is used to surface the bug in a local Nomad fork:

diff --git a/client/allocrunner/taskrunner/envoy_bootstrap_hook.go b/client/allocrunner/taskrunner/envoy_bootstrap_hook.go
index fa9c63a43..494c7398a 100644
--- a/client/allocrunner/taskrunner/envoy_bootstrap_hook.go
+++ b/client/allocrunner/taskrunner/envoy_bootstrap_hook.go
@@ -274,6 +274,7 @@ func (h *envoyBootstrapHook) Prestart(ctx context.Context, req *ifs.TaskPrestart

 	// keep track of latest error returned from exec-ing consul envoy bootstrap
 	var cmdErr error
+	var cmdOut string

 	// Since Consul services are registered asynchronously with this task
 	// hook running, retry until timeout or success.
@@ -323,6 +324,8 @@ func (h *envoyBootstrapHook) Prestart(ctx context.Context, req *ifs.TaskPrestart
 		// fails.
 		_ = os.Remove(bootstrapFilePath)

+		cmdOut = buf.String()
+
 		return true, cmdErr
 	}, exptime.BackoffOptions{
 		MaxSleepTime:   h.envoyBootstrapWaitTime,
@@ -332,7 +335,7 @@ func (h *envoyBootstrapHook) Prestart(ctx context.Context, req *ifs.TaskPrestart
 		// Wrap the last error from Consul and set that as our status.
 		_, recoverable := cmdErr.(*exec.ExitError)
 		return structs.NewRecoverableError(
-			fmt.Errorf("error creating bootstrap configuration for Connect proxy sidecar: %v", cmdErr),
+			fmt.Errorf("error creating bootstrap configuration for Connect proxy sidecar: %v \n %v", cmdErr, cmdOut),
 			recoverable,
 		)
 	}
@gulducat
Copy link
Member

Heya @mikenomitch !

I tried a couple things to surface more information about the nature of bootstrap errors, and ran into d476780, released in v1.3, which at least partially addresses this, though the UX is admittedly still not ideal.

The conundrum is that on the Nomad side, we have little/no control over the output of the bootstrap command, and trying to parse it out into something that fits well in a Task Event seems pretty fraught. For example, to induce an error I added a non-existent flag to the bootstrapArgs here:

cmd := exec.CommandContext(ctx, "consul", bootstrapArgs...)

and a change similar to your suggestion yielded the entire "Usage" output in the error, which gets emitted as a task event that takes up the whole page in Nomad UI.

example output comparison

before, currently very little info:

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1

after, notice the side scroll:

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1: flag provided but not defined: -non-existent-arg Usage: -address value LAN address to advertise in the gateway service registration -admin-access-log-path string The path to write the access log for the administration server. If no access log is desired specify "/dev/null". By default it will use "/dev/null". (default "/dev/null") -admin-bind string The address:port to start envoy's admin server on. Envoy requires this but care must be taken to ensure it's not exposed to an untrusted network as it has full control over the secrets and config of the proxy. (default "localhost:19000") -bind-address <name>=<ip>:<port> Bind address to use instead of the default binding rules given as <name>=<ip>:<port> pairs. This flag may be specified multiple times to add multiple bind addresses. -bootstrap Generate the bootstrap.json but don't exec envoy -ca-file value Path to a CA file to use for TLS when communicating with Consul. This can also be specified via the CONSUL_CACERT environment variable. -ca-path value Path to a directory of CA certificates to use for TLS when communicating with Consul. This can also be specified via the CONSUL_CAPATH environment variable ..... etc etc etc etc etc .....

🤣


We could filter out "Usage:.*" but the underlying "what could the output possibly contain?" issue remains, so I'm not sure what to suggest beyond what d476780 enables.

On the Consul side, feasibly hashicorp/consul#9724 could improve the situation. It would also just be nice if we didn't need to run consul CLI for this bootstrap process.

@gulducat
Copy link
Member

Hi again @mikenomitch :) I chatted with the team about this some more, and I'm wondering how this sounds to you.

Would it help if we added some documentation describing the process laid out in d476780 and included a link to that doc in the error that gets surfaced as a task event?

So in addition to

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1

the message presented to the user could include a link like

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1 -- see: https://developer.hashicorp.com/nomad/docs/integrations/consul-connect#limitations

How does that strike you?

@mikenomitch
Copy link
Contributor Author

@gulducat and I had a chat about this offline. Given that the error messages from Consul aren't great to surface (see above comment). We probably aren't going to add the code change as is.

@gulducat is going to document the steps laid out here so this will be more avoidable in the future, then we'll close this out.

If an endpoint is added for envoy bootstrapping, then we can switch to that and surface that error for an eventual fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants