Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global ACL tokens are required for Consul Connect when using local mesh gateways #7381

Open
lkysow opened this issue Mar 3, 2020 · 4 comments
Labels
needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release type/enhancement Proposed improvement or new feature

Comments

@lkysow
Copy link
Member

lkysow commented Mar 3, 2020

Overview of the Issue

A sidecar proxy using a local ACL token will not be able to route to another dc's connect service even if mesh gateways are in local mode.

The following will be logged on the Consul client:

2020-03-03T17:28:18.005Z [ERROR] agent.client: RPC failed to server: method=Health.ServiceNodes server=10.244.2.30:8300 error="rpc error making call: rpc error making call: ACL not found"
2020-03-03T17:28:18.005Z [ERROR] agent.proxycfg: watch error: id=upstream-target:static-server.default.dc2:static-server?dc=dc2 error="error filling agent cache: rpc error making call: rpc error making call: ACL not found"
2020-03-03T17:28:18.066Z [ERROR] agent.client: RPC failed to server: method=Health.ServiceNodes server=10.244.2.30:8300 error="rpc error making call: rpc error making call: ACL not found"

This is a particular issue in Kubernetes because we use consul login to create our tokens and this always returns a local token.

Reproduction Steps

  1. Create two datacenters with acls enabled and federate them
  2. Set a proxy-defaults config:
            "kind": "proxy-defaults",
            "name": "global",
            "mesh_gateway": {
              "mode": "local
            }
    
  3. Start mesh gateways in both dcs
  4. Create a local ACL token in dc1
  5. Start a sidecar proxy in dc1 using the local ACL token with an upstream of a service in dc2
  6. You should see the errors in the logs

Suggested Solution

We should short-circuit where we iterate over the upstreams and start blocking queries:

for _, u := range s.proxyCfg.Upstreams {
dc := s.source.Datacenter
if u.Datacenter != "" {
// TODO(rb): if we ASK for a specific datacenter, do we still use the chain?
dc = u.Datacenter
}
ns := currentNamespace
if u.DestinationNamespace != "" {
ns = u.DestinationNamespace
}
cfg, err := parseReducedUpstreamConfig(u.Config)
if err != nil {
// Don't hard fail on a config typo, just warn. We'll fall back on
// the plain discovery chain if there is an error so it's safe to
// continue.
s.logger.Warn("failed to parse upstream config",
"upstream", u.Identifier(),
"error", err,
)
}
switch u.DestinationType {
case structs.UpstreamDestTypePreparedQuery:
err = s.cache.Notify(s.ctx, cachetype.PreparedQueryName, &structs.PreparedQueryExecuteRequest{
Datacenter: dc,
QueryOptions: structs.QueryOptions{Token: s.token, MaxAge: defaultPreparedQueryPollInterval},
QueryIDOrName: u.DestinationName,
Connect: true,
Source: *s.source,
}, "upstream:"+u.Identifier(), s.ch)
if err != nil {
return err
}
case structs.UpstreamDestTypeService:
fallthrough
case "": // Treat unset as the default Service type
err = s.cache.Notify(s.ctx, cachetype.CompiledDiscoveryChainName, &structs.DiscoveryChainRequest{
Datacenter: s.source.Datacenter,
QueryOptions: structs.QueryOptions{Token: s.token},
Name: u.DestinationName,
EvaluateInDatacenter: dc,
EvaluateInNamespace: ns,
OverrideMeshGateway: s.proxyCfg.MeshGateway.OverlayWith(u.MeshGateway),
OverrideProtocol: cfg.Protocol,
OverrideConnectTimeout: cfg.ConnectTimeout(),
}, "discovery-chain:"+u.Identifier(), s.ch)
if err != nil {
return err
}
default:
.

Instead we should check if we're using local gateways and not make these calls. The results of these calls are discarded later if using local mesh gateways so we don't need them.

Future

In order to use mesh gateways in remote mode or to use tokens from consul login to make cross-dc calls, we need another solution. We could make the consul login tokens global, but then login would require the primary DC to be available. A better long-term solution would be to federate trust such that locally minted tokens can be trusted globally.

@jorgemarey
Copy link
Contributor

Hi, just found out about this opening hashicorp/nomad#8063 , would it be possible to fix at least the local mode of gateways right now? We're using connect with nomad and we're thinking about federating our consul clusters but we don't want nomad to stop working (can't start new allocations) if the primary DC is not reachable. Nomad right now when starting new allocations request a global token to consul.

@mkeeler
Copy link
Member

mkeeler commented Sep 14, 2020

@jorgemarey There are two things at work in secondary DCs to make the primary DCs unreachability less of a concern.

First is that you can enable token replication in the secondary DC (acl.enable_token_replication configuration item). When this is enabled all global tokens will be replicated in a background routine to the secondary DC and token resolution will no longer ever be made to the primary DC.

Second is that client agents and servers without token replication enabled will cache the results of token resolution. How long they live depends on the value of the acl.token_ttl configuration item.

The core problem here is that the mesh-gateway setup requires making RPC requests to the other DCs to discover other mesh-gateways running in those DCs. Local tokens such as those created by auth-methods and consul login are only valid within the local DC so all those cross-DC requests are guaranteed to fail with ACL not found errors.

We have been exploring alternative ways of doing multi-dc federation especially for ACLs to get around these problems. Right now there isn't anything to report on those efforts but the problem is something the team is aware of and trying to find a solution for.

@mikemorris mikemorris added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Feb 1, 2021
@david-yu
Copy link
Contributor

david-yu commented Mar 9, 2021

Hi folks, Consul Kubernetes PM here doing some research on this feature request related to using Mesh Gateways in local mode for cross datacenter service communication.

I would love to chat with you about your use case for mesh gateways, feedback around ACLs, and would also love to understand the architecture of your Consul deployment.
Please see this link here to schedule a 30 minute time to chat with me about this issue: https://calendly.com/dyu-hashicorp/cross-dc-communication-using-consul-acl-tokens. I'm based in PST time but I can adjust my time to meet as well on a case by case basis!

@jorgemarey
Copy link
Contributor

Hi, any updates on this? Just saw this comment on the nomad issue: hashicorp/nomad#8063 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

6 participants