Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Change consul SI tokens to be local? #8063

Closed
jorgemarey opened this issue May 27, 2020 · 13 comments · Fixed by #12586
Closed

[question] Change consul SI tokens to be local? #8063

jorgemarey opened this issue May 27, 2020 · 13 comments · Fixed by #12586

Comments

@jorgemarey
Copy link
Contributor

jorgemarey commented May 27, 2020

Hi! I was testing nomad+consul connect in some test clusters. I tested to shut down the primary consul datacenter and saw that I wasn't able to run any connect task because the consul SI token can't be created due to it being global. Does it make sense/Can it be local so if the primary DC fails for some reason other datacenters can still work properly?

I will try to change the code here and set a Local: true to test it.

https://github.com/hashicorp/nomad/blob/master/nomad/consul.go#L220

Edit: to clarify, everyting thats running it's running ok, but can't deploy new allocations.

@shoenig
Copy link
Member

shoenig commented May 27, 2020

Good catch @jorgemarey !

Considering each SI token is tied to the allocation in which a task is running, and each SI token is created/destroyed only by the Nomad Servers associated with the cluster of the requesting Nomad Client, yeah I think it makes sense to create them as local tokens.

Did your test work out as expected?

@jorgemarey
Copy link
Contributor Author

Hey @shoenig. Just tested it and everyting seems to work fine. The tokens are created locally and I was able to deploy new allocations event then the primay DC was down.

I don't know if this has any other implications, but I could make a PR with this change if you think that's ok.

@shoenig
Copy link
Member

shoenig commented May 28, 2020

A PR would be great @jorgemarey , thanks!

@shoenig shoenig added this to Needs Triage in Nomad - Community Issues Triage via automation May 28, 2020
@shoenig shoenig added this to the 0.11.3 milestone May 28, 2020
@shoenig
Copy link
Member

shoenig commented May 28, 2020

Ahh so unfortunately there is a problem with using Local:true tokens right now - if the Connect service is trying to contact an upstream in a remote datacenter, the local SI token won't be allowed. The issue being in Consul hashicorp/consul#7381, hashicorp/consul#7899.

We'll keep this issue and PR open for now, and merge it when the functionality is in place.

@shoenig shoenig modified the milestones: 0.11.3, 0.12.0 May 28, 2020
@shoenig shoenig moved this from Needs Triage to Done in Nomad - Community Issues Triage May 28, 2020
@jorgemarey
Copy link
Contributor Author

Thanks @shoenig, I'll be looking forward to this fix. We're working on federating our consul DCs but we don't want nomad to fail (fail to run new allocations) if the connection to the primary consul DC fails.

@shoenig
Copy link
Member

shoenig commented May 12, 2021

We should revisit this; according to consul team this may just work now - as long as the remote DC's Consul default agent token is privileged enough for agent:read

@shoenig shoenig moved this from Done to Needs Roadmapping in Nomad - Community Issues Triage May 12, 2021
@jorgemarey
Copy link
Contributor Author

Hi @shoenig, any news on this?

We're having some problems, and I think it's due to this. Sometimes the envoy sidecar fails to start (after a few tries it starts correctly), but I think it's related to token creation being done in the primary datacenter, so until the token gets replicated the task fails to start.

@jorgemarey
Copy link
Contributor Author

Hi, sorry to ping again over here. Any news on this?

@shoenig
Copy link
Member

shoenig commented Apr 15, 2022

Hey @jorgemarey sorry this has taken 2 years, but we may be in a reasonable place now to make the switch. The fundamental change comes from hashicorp/consul#7414 which shipped in Consul 1.8, which is now beyond EOL. As described in that issue the implication is the Consul agent in the remote DC will now require its anonymous ACL token to contain the permissions

service_prefix "" { policy = "read" }
node_prefix    "" { policy = "read" }

but that's not unusual (acl.tokens.default)

@shoenig shoenig self-assigned this Apr 15, 2022
@jorgemarey
Copy link
Contributor Author

Hi @shoenig. That's great. We have several federated DCs distributed all over the world. This change will allow me to sleep better :D (we never had any problem related to this, but the fear of losing contact with the primary and not being able to deploy on the secondary is there)
The requirements of the anonymous ACL tokens are OK by me. The only thing that I wonder is that this tokens are also generated for the mesh gateways to use. I don't know if changing this will affect the deployment of mesh gateways with nomad. I don't know if consul needs global tokens for the mesh gateways to work.

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Apr 19, 2022
@shoenig shoenig added this to the 1.3.0 milestone Apr 19, 2022
@MagicRB
Copy link

MagicRB commented Jul 30, 2022

Even with setting the default token to something with sufficient ACLs (i've verified this), consul connect still doesn't work nor does accessing dc-1 as proxied by a server in dc-2 with a local token, my gues is that the token stripping isnt taking place like it should.

Jul 30 09:06:39 toothpick consul[363230]: 2022-07-30T09:06:39.273+0200 [ERROR] agent.http: Request error: method=GET url=/v1/internal/ui/services?dc=dc-1 from=x:42152 error="rpc error making call: ACL not found"
Jul 30 09:06:51 toothpick consul[363230]: 2022-07-30T09:06:51.125+0200 [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?dc=dc-1 from=x:42152 error="ACL not found"
Jul 30 09:06:54 toothpick consul[363230]: 2022-07-30T09:06:54.395+0200 [ERROR] agent.http: Request error: method=POST url=/v1/internal/acl/authorize?dc=dc-2 from=x.x.x.x:42152 error="ACL not found"
Jul 30 09:06:54 toothpick consul[363230]: 2022-07-30T09:06:54.452+0200 [ERROR] agent.http: Request error: method=GET url=/v1/internal/ui/services?dc=dc-2 from=x:42152 error="ACL not found"

If I log into Consul through dc-2 with the token generated by Nomad that's set local, I can see no services, so ACL issue. Said token works in dc-1. The anonymous token I generated works everywhere.

@MagicRB
Copy link

MagicRB commented Jul 30, 2022

Indeed reverting this changes fixes all my issues.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.