Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul/connect: add support for bridge networks with connect native tasks #8443

Merged
merged 4 commits into from
Jul 29, 2020

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Jul 15, 2020

Before, Connect Native Tasks needed one of these to work:

  • To be run in host networking mode
  • To have the Consul agent configured to listen to a unix socket
  • To have the Consul agent configured to listen to a public interface

None of these are a great experience, though running in host networking is
still the best solution for non-Linux hosts. This PR establishes a connection
proxy between the Consul HTTP listener and a unix socket inside the alloc fs,
bypassing the network namespace for any Connect Native task. Similar to and
re-uses a bunch of code from the gRPC listener version for envoy sidecar proxies.

Proxy is established only if the alloc is configured for bridge networking and
there is at least one Connect Native task in the Task Group.

Fixes #8290

@shoenig shoenig force-pushed the x-cnb branch 3 times, most recently from 802fb18 to 460fbd4 Compare July 22, 2020 20:29
@shoenig shoenig marked this pull request as ready for review July 22, 2020 20:56
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers! LGTM!

tg := h.alloc.Job.LookupTaskGroup(h.alloc.TaskGroup)
for _, s := range tg.Services {
if s.Connect.HasSidecar() {
if s.Connect.HasSidecar() { // todo(shoenig) check we are in bridge mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this todo because we don't need to create this socket if we're in host mode? If so maybe expand it a bit as most people see todo and think "bug" -- not "we're doing work we don't need to do"

TODO's that could cause bugs should have an issue. If this is just a slight optimization I don't think it's worth worrying about an issue. I suspect we'll fix this by re-architecting how we expose Consul to tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah it's just a quick optimization to check for a bridge network before scanning each service. Added it to the new http socket hook and meant to do the same here.


// lock synchronizes proxy which may be mutated and read concurrently via
// Prerun, Update, and Postrun.
lock sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] put fields protected by a lock next to the lock field: https://golang.org/doc/effective_go.html#commentary (the end of that section)

Nevermind, just noticed this also guards alloc in Update, so just update the comment.

// shouldRun returns true if the alloc contains at least one connect native
// task and has a network configured in bridge mode
//
// todo(shoenig): what about CNI networks?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickethier ?

I feel like the answer is probably yes (treat like bridge) as we often use "bridge" to mean "using a network namespace" which definitely applies to CNI plugins too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, logically CNI and bridge are the same. Bridge even reuses most of the cni bits.


if err := h.proxy.stop(); err != nil {
// Only log a failure to stop, worst case is the proxy leaks a goroutine.
h.logger.Debug("error stopping Consul HTTP proxy", "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's bump the error level since it shouldn't error in normal usage (but isn't error level severity).

Suggested change
h.logger.Debug("error stopping Consul HTTP proxy", "error", err)
h.logger.Warn("error stopping Consul HTTP proxy", "error", err)

// The Consul HTTP socket should be usable by all users in case a task is
// running as a non-privileged user. Unix does not allow setting domain
// socket permissions when creating the file, so we must manually call
// chmod afterwords.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// chmod afterwords.
// chmod afterwards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL -_-


select {
case <-p.doneCh:
case <-time.After(3 * time.Second):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's always comment magic numbers

Suggested change
case <-time.After(3 * time.Second):
case <-time.After(3 * time.Second):
// Give the listener some time to shutdown but don't block forever since the leaked goroutine is not costly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we use this same value in the grpc proxy, I'd suggest declaring a const that can be used in both.

// the connect native task. This will enable the task to communicate with Consul
// if the task is running inside an alloc's network namespace (i.e. bridge mode).
//
// Sets CONSUL_HTTP_ADDR if not already set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it already be set? By users who want to override it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if someone manually sets (via env/template) any of the standard consul environment variables I believe those values should always take priority. Maybe they have Consul bound to an addressable IP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoenig I hope it is okay to ask here, before opening a ticket. The PR nicely adds this for bridge mode, but in host mode it still requires special configuration if eg the consul port of the local agent is changed (not sure if that makes sense to do, but…). Is there any reason to not also map this socket in in host network mode -- or at least provide CONSUL_HTTP_ADDR with the addr that nomad knows for connecting to consul? Or is that done somewhere else and I missed it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @apollo13 ! The socket dance exists in the bridge hook because that's how we get "around" the network namespace and access Consul - which is typically listening on a loopback interface not privy to the isolated task. A Connect native task using host networking OTOH shouldn't have a problem connecting to Consul directly, and so no sockets are made for them. Your suggestion of automatically setting CONSUL_HTTP_ADDR is probably reasonable - feel free to open an issue/PR to discuss more.

Copy link
Member

@nickethier nickethier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…asks

Before, Connect Native Tasks needed one of these to work:

- To be run in host networking mode
- To have the Consul agent configured to listen to a unix socket
- To have the Consul agent configured to listen to a public interface

None of these are a great experience, though running in host networking is
still the best solution for non-Linux hosts. This PR establishes a connection
proxy between the Consul HTTP listener and a unix socket inside the alloc fs,
bypassing the network namespace for any Connect Native task. Similar to and
re-uses a bunch of code from the gRPC listener version for envoy sidecar proxies.

Proxy is established only if the alloc is configured for bridge networking and
there is at least one Connect Native task in the Task Group.

Fixes #8290
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve bridge networking story with Connect Native tasks
4 participants