Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clients require node:read on anon ACL policy to join cluster via Consul discovery (1.5.1) #16470

Closed
CarelvanHeerden opened this issue Mar 13, 2023 · 8 comments · Fixed by #16490

Comments

@CarelvanHeerden
Copy link

Nomad version

v1.5.1

Operating system and Environment details

Ubuntu 20.04

Issue

Our Test system was running v1.4.3, with ACL system configured. Upgraded to V1.5.1 using a rolling upgrade (replacing all servers and clients)
The servers started up correctly and joined the cluster.
The clients however could not join, getting a RPC error: Permission denied

The only way to get the clients to Join, was to add back the Anonymous policy

Client Config

etc/nomad.d/acl.hcl

acl {
  enabled = true
}

/etc/nomad.d/client.hcl

client {
 enabled = true
 disable_remote_exec = true
 max_kill_timeout = "90s"
  options = {
  docker.privileged.enabled = true
  docker.volumes.enabled = true
  }
  reserved {
  cpu  = 500
  memory         = 512
  }
}

/etc/nomad.d/nomad.hcl

datacenter = "DMZ"
data_dir = "/opt/nomad"
region = "sa"
telemetry {
  collection_interval = "1s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

Server Config

etc/nomad.d/acl.hcl

acl {
  enabled = true
}

/etc/nomad.d/server.hcl

server {
  enabled = true
  bootstrap_expect = 3
  default_scheduler_config {
    scheduler_algorithm = "spread"
    memory_oversubscription_enabled = true
    preemption_config {
      batch_scheduler_enabled    = true
      enable_event_broker    = true
      system_scheduler_enabled   = true
      service_scheduler_enabled  = true
      sysbatch_scheduler_enabled = true
    }
  }
}

/etc/nomad.d/nomad.hcl

datacenter = "GENERAL"
data_dir = "/opt/nomad"
region = "sa"
telemetry {
  collection_interval = "1s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

CLI output

nomad acl token list

Name             Type        Global  Accessor ID                           Expired
Bootstrap Token  management  true    4a03f056-XXXXX-XXXXX-XXXXX  false
Terraform        management  true    6e26d344-XXXXX-XXXXX-XXXXX  false

Reproduction steps

Upgrade working Nomad Cluster from 1.4.3 to 1.5.1

Expected Result

Clients should be able to join the cluster without the need for the Anonymous policy

Actual Result

Clients unable to join the cluster

Nomad Client logs (if appropriate)

/usr/local/bin/nomad agent -config /etc/nomad.d -log- 
==> Loaded configuration from /etc/nomad.d/acl.hcl, /etc/nomad.d/client.hcl, /etc/nomad.d/consul.hcl, /etc/nomad.d/nomad.hcl, /etc/nomad.d/vault.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 172.1.5.5:4646
            Bind Addrs: HTTP: [0.0.0.0:4646]
                Client: true
             Log Level: DEBUG
                Region: sa (DC: DMZ)
                Server: false
               Version: 1.5.1

==> Nomad agent started! Log data will stream in below:

    2023-03-13T20:17:42.046Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/plugins
    2023-03-13T20:17:42.046Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
    2023-03-13T20:17:42.046Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
    2023-03-13T20:17:42.046Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
    2023-03-13T20:17:42.050Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2023-03-13T20:17:42.050Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2023-03-13T20:17:42.050Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2023-03-13T20:17:42.050Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2023-03-13T20:17:42.050Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2023-03-13T20:17:42.051Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/opt/nomad/plugins error=<nil>
    2023-03-13T20:17:42.052Z [WARN]  client.cpuset.v1: failed to ensure reserved cpuset.cpus interface exists; disable cpuset management: error="mkdir /sys/fs/cgroup/cpuset/nomad/reserved: file exists"
    2023-03-13T20:17:42.052Z [INFO]  client: using state directory: state_dir=/opt/nomad/client
    2023-03-13T20:17:42.052Z [INFO]  client: using alloc directory: alloc_dir=/opt/nomad/alloc
    2023-03-13T20:17:42.052Z [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2023-03-13T20:17:42.089Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "signal", "storage", "vault", "env_azure", "env_digitalocean", "env_aws", "env_gce"]
    2023-03-13T20:17:42.090Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2023-03-13T20:17:42.090Z [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
    2023-03-13T20:17:42.090Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup initial_period=15s
    2023-03-13T20:17:42.097Z [INFO]  client.fingerprint_mgr.consul: consul agent is available
    2023-03-13T20:17:42.098Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
    2023-03-13T20:17:42.098Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=2294
    2023-03-13T20:17:42.098Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=2
    2023-03-13T20:17:42.098Z [DEBUG] client.fingerprint_mgr.cpu: detected reservable cores: cpuset=[0, 1]
    2023-03-13T20:17:42.106Z [DEBUG] client.fingerprint_mgr.network: link speed detected: interface=eth0 mbits=40000
    2023-03-13T20:17:42.107Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=eth0 IP=172.1.5.5
    2023-03-13T20:17:42.111Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
    2023-03-13T20:17:42.111Z [DEBUG] client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
    2023-03-13T20:17:42.111Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
    2023-03-13T20:17:42.126Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=docker0
    2023-03-13T20:17:42.126Z [DEBUG] client.fingerprint_mgr.network: unable to parse link speed: path=/sys/class/net/docker0/speed device=docker0
    2023-03-13T20:17:42.126Z [DEBUG] client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=docker0 mbits=1000
    2023-03-13T20:17:42.199Z [INFO]  client.fingerprint_mgr.vault: Vault is available
    2023-03-13T20:17:42.199Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=41.545320649s
    2023-03-13T20:17:42.206Z [DEBUG] client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type resp_code=404
    2023-03-13T20:17:42.250Z [DEBUG] client.fingerprint_mgr.env_azure: read an empty value: attribute=public-ipv4
    2023-03-13T20:17:42.258Z [DEBUG] client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=network/interface/0/ipv6/ipAddress/0/privateIpAddress resp_code=404
    2023-03-13T20:17:42.258Z [DEBUG] client.fingerprint_mgr.env_azure: read an empty value: attribute=local-ipv6
    2023-03-13T20:17:42.277Z [DEBUG] client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=network/interface/0/ipv6/ipAddress/0/publicIpAddress resp_code=404
    2023-03-13T20:17:42.277Z [DEBUG] client.fingerprint_mgr.env_azure: read an empty value: attribute=public-ipv6
    2023-03-13T20:17:42.350Z [DEBUG] client.fingerprint_mgr.env_digitalocean: could not read value for attribute: attribute=region resp_code=400
    2023-03-13T20:17:42.350Z [DEBUG] client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cgroup", "consul", "cpu", "host", "network", "nomad", "plugins_cni", "signal", "storage", "vault", "env_azure"]
    2023-03-13T20:17:42.351Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2023-03-13T20:17:42.351Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2023-03-13T20:17:42.351Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2023-03-13T20:17:42.351Z [DEBUG] client.device_mgr: exiting since there are no device plugins
    2023-03-13T20:17:42.351Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=healthy description=Healthy
    2023-03-13T20:17:42.351Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
    2023-03-13T20:17:42.351Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
    2023-03-13T20:17:42.351Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
    2023-03-13T20:17:42.352Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=undetected description=disabled
    2023-03-13T20:17:42.352Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
    2023-03-13T20:17:42.352Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=java health=undetected description=""
    2023-03-13T20:17:42.355Z [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=["southafricanorth"]
    2023-03-13T20:17:42.379Z [ERROR] client: error discovering nomad servers:
  error=
  | 5 errors occurred:
  | \t* rpc error: Permission denied
  | \t* rpc error: Permission denied
  | \t* rpc error: Permission denied
  | \t* rpc error: Permission denied
  | \t* rpc error: Permission denied
  | 
  
    2023-03-13T20:17:42.388Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=docker health=healthy description=Healthy
    2023-03-13T20:17:42.388Z [DEBUG] client.driver_mgr: detected drivers: drivers="map[healthy:[exec docker] undetected:[raw_exec qemu java]]"
    2023-03-13T20:17:42.388Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
    2023-03-13T20:17:42.388Z [INFO]  client: started client: node_id=aaef0be8-ae81-7420-d056-c44a72adb7a4
    2023-03-13T20:17:42.389Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:42.389Z [DEBUG] http: UI is enabled
    2023-03-13T20:17:42.389Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:42.389Z [DEBUG] http: UI is enabled
    2023-03-13T20:17:42.445Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:42.477Z [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=1 deregistered_checks=0
    2023-03-13T20:17:42.569Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:42.609Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:42.842Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:42.874Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:42.926Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.101Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.204Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.218Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.218Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.311Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.371Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.586Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.592Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.666Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.762Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.819Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.883Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.890Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:43.927Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.015Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.054Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.145Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.321Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.351Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.490Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.570Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.612Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.617Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.634Z [WARN]  client.server_mgr: no servers available
^C==> Caught signal: interrupt
    2023-03-13T20:17:44.756Z [INFO]  agent: requesting shutdown
    2023-03-13T20:17:44.756Z [INFO]  client: shutting down
    2023-03-13T20:17:44.756Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
    2023-03-13T20:17:44.756Z [INFO]  client.plugin: plugin manager finished: plugin-type=device
    2023-03-13T20:17:44.756Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
    2023-03-13T20:17:44.756Z [DEBUG] client.vault: stopped
    2023-03-13T20:17:44.834Z [WARN]  client.server_mgr: no servers available
    2023-03-13T20:17:44.873Z [INFO]  client.plugin: plugin manager finished: plugin-type=driver
    2023-03-13T20:17:44.873Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
    2023-03-13T20:17:44.873Z [INFO]  client.plugin: plugin manager finished: plugin-type=csi
    2023-03-13T20:17:44.873Z [DEBUG] client: registration waiting on servers
    2023-03-13T20:17:44.873Z [DEBUG] client.server_mgr: shutting down
    2023-03-13T20:17:44.920Z [INFO]  agent: shutdown complete
    2023-03-13T20:17:44.921Z [DEBUG] http: shutting down http server
@tgross
Copy link
Member

tgross commented Mar 13, 2023

Hi @CarelvanHeerden! I'm pretty sure this has to do with #16217 where the node is now hitting our Status.Members RPC instead of Status.Peers when we're using Consul service discovery. I forgot to add the same exception we use for other endpoints that allow the node to authenticate with its own secret.

If you tighten the anonymous policy to just the following:

node {
  policy = "read"
}

That will tighten up the policy a bit. But obviously we need to fix that. I'm at the end of my day here but I'll circle back to this tomorrow to see what the best approach to fix is.

@tgross tgross self-assigned this Mar 13, 2023
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 13, 2023
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Mar 13, 2023
@tgross tgross added this to the 1.5.x milestone Mar 13, 2023
@CarelvanHeerden
Copy link
Author

Thanks so much @tgross

@hynek
Copy link
Contributor

hynek commented Mar 14, 2023

JFTR for those Googling, this problem happens when updating from 1.5.0 to 1.5.1, too.

@tgross tgross changed the title Upgrading from v1.4.3 to v1.5.1, clients can no longer join cluster. rpc error: Permission denied clients require node:read on anon ACL to join cluster via Consul discovery (1.5.1) Mar 14, 2023
@tgross tgross changed the title clients require node:read on anon ACL to join cluster via Consul discovery (1.5.1) clients require node:read on anon ACL policy to join cluster via Consul discovery (1.5.1) Mar 14, 2023
@tgross
Copy link
Member

tgross commented Mar 14, 2023

I've retitled the issue to reflect the specific problem and workaround; it's not related to upgrades per se and if you start a new cluster from scratch on 1.5.1 you'll hit the same problem. I'll have a patch up later today and we'll get that out in the next patch release of Nomad.

Some notes:

  • The main workaround is to add node:read to the anonymous policy.
  • Alternate workarounds that should work are to use one of the cloud auto-join options, set a specific IP address in the server_join block, or manually join the clients.
  • Once a client has joined, it should be fine thereafter because it gets the updated list of servers from the server it sends heartbeats to. (Although if it restarts you'd need to re-join it.)
  • Note that the data available from the various node RPC endpoints that node:read allows access to will still be protected via mTLS, which is required for secure operation.
  • How'd we miss this in testing? I added multi-home networking to our E2E suite in order to test client: use RPC address and not serf after initial Consul discovery #16217 and it worked fine there with ACLs and mTLS enabled. Unfortunately I missed that our E2E environment has node:read on the anonymous policy. So the workaround we have here is what was running in the test environment. I've opened E2E: tighten anonymous ACL policy #16483 to clean that up in the near future.

@tgross
Copy link
Member

tgross commented Mar 14, 2023

I've got a draft PR #16490 up with the fix. It turned out to be a little more complicated than I expected, because the problem wasn't simply that we didn't add the client secret but that the client secret isn't meaningful to the server at that point -- we haven't registered yet! I've got another round of end-to-end testing to wrap up tomorrow morning with this and then it should be ready to land.

@dcarbone
Copy link

dcarbone commented Mar 16, 2023

For what its worth, attempting to manually join a 1.5.1 client to my 1.5.1 server cluster using either nomad node config -servers xxx or nomad node config -update-servers xxx results in Error updating server list: Unexpected response code: 500 (no servers).

Defining a retry_join list works just fine, however.

@tgross
Copy link
Member

tgross commented Mar 16, 2023

Still working on #16490, as it's just been a bit more complicated than we wanted. The resulting changes should actually reduce round-trips to the server as part of client startup though.

For what its worth, attempting to manually join a 1.5.1 client to my 1.5.1 server cluster using either nomad node config -servers xxx or nomad node config -update-servers xxx results in Error updating server list: Unexpected response code: 500 (no servers)

The -update-servers flag is what you want there. Looking at the setServersImpl method where that error originates, I don't see too many code paths that could return that error without including another one. What do the client logs say when you do that?

@tgross
Copy link
Member

tgross commented Mar 16, 2023

#16490 has been merged and will ship in the next patch version. We're still working on figuring out the schedule for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

4 participants