Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workload identity auth failure when audit is enabled #15768

Closed
louievandyke opened this issue Jan 13, 2023 · 3 comments
Closed

workload identity auth failure when audit is enabled #15768

louievandyke opened this issue Jan 13, 2023 · 3 comments
Labels
hcc/cst Admin - internal hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/auth theme/enterprise Issues related to Enterprise features theme/workload-identity type/bug

Comments

@louievandyke
Copy link
Contributor

Nomad version

Output from nomad version

root@ubuntu-focal:/home/vagrant# nomad --version
Nomad v1.4.0+ent (ea16107)

Operating system and Environment details

root@ubuntu-focal:/home/vagrant# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal

Issue

After upgrading to Nomad v1.4.0+ent I noticed that restarting certain jobs resulted in a pending error related to the Nomad service defined in my jobspec template.

Template | Missing: nomad.services

I had, before the upgrade, in place acl { enabled = true } and audit { enabled = true } on both the Servers and Clients agent configs. This was working with no issues.

After the upgrade I restarted the two jobs. One of the jobs (sleep1), which registers the Nomad service and then just sleeps forever, started fine. The second job, where I try and discover that service, fails complaining about a missing service.

Missing: nomad.service(consensus)

A work around I found, is if you have acl { enabled = true } and audit { enabled = true } on the Servers and then on the Client disable the acls and leave audit enabled acl { enabled = false } and audit {enabled = true} it will start working after an agent restart on the client. I believe this may only leave vulnerable the /v1/client endpoints as you still need a token to get to the UI and to run CLI commands on both the Servers and Clients.

Reproduction steps

on v1.3.2+ent I ran the two specs pasted below sleep1 and service-discovery while I had acl { enabled = true } and audit { enabled = true } on the Servers and Clients

job "sleep1" {

  datacenters = ["dc1"]

  meta {
    mydata = "hello2"
  }

  type = "service"

  group "testing1" {

    network {
      mode = "host"

      port "test-port" {
        to     = 80
        static = 80
      }

    }

    task "sleepy-1" {

      service {
        name     = "consensus"
        provider = "nomad"
        port     = "test-port"
        tags     = ["test1-sleep"]
      }

      driver = "exec"
      config {
        command = "sleep"
        args    = ["infinity"]
      }
    }

  }

}
job "service-discovery" {

  datacenters = ["dc1"]

  meta {
    test = "a"
  }

  type = "service"

  group "discovering" {

    task "fill-template" {

      driver = "exec"
      config {
        command = "sleep"
        args    = ["infinity"]
      }

      template {
        data        = <<-EOH
TESTING THINGS

consensus loop test
{{ range $i, $s :=  nomadService "consensus" }}
    Service beacon-{{ $i }} : {{ .Address }}:{{ .Port }};
{{ end }}
EOH
        destination = "local/test.cfg"
      }

    }

  }
}

I then upgraded the binary to Nomad v1.4.0+ent on both Server and Client and restarted the Nomad agent. Both jobs continued to run fine but when I restarted each, the service-discovery job would not start due to errors about discovering the service. The only way to fix this is to remove the acl block from the clients agent config and restart the Nomad agent.

Expected Result

Audit and ACL behavior to remain consistent on upgrade paths

Actual Result

ACL appears to block the client after upgrading to v1.4.0+ent

Job file (if appropriate)

see above

Nomad Server logs (if appropriate)

Some logs related to this behavior...

2023-01-12T18:51:16.847Z [ERROR] http: request failed: method=GET path="/v1/service/consensus?namespace=default&stale=&wait=60000ms" error="rpc error: acl token lookup failed: index error: UUID must be 36 characters" code=500
2023-01-12T18:51:16.847Z [DEBUG] http: request complete: method=GET path="/v1/service/consensus?namespace=default&stale=&wait=60000ms" duration=4.850001ms
2023-01-12T18:51:16.849Z [WARN]  agent: (view) nomad.service(consensus): Unexpected response code: 500 (rpc error: acl token lookup failed: index error: UUID must be 36 characters) (retry attempt 8 after "32s")
2023-01-12T18:51:48.855Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: acl token lookup failed: index error: UUID must be 36 characters" rpc=ACL.ResolveToken server=172.16.66.130:4647

Nomad Client logs (if appropriate)

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 24, 2023

Hi @louievandyke 👋

I think this may have been fixed by #15140. Would you mind trying this upgrade path but going to 1.4.3+ent instead of 1.4.0+ent?

Thanks!

@lgfa29 lgfa29 self-assigned this Jan 24, 2023
@lgfa29 lgfa29 moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 24, 2023
@louievandyke louievandyke added the hcc/cst Admin - internal label Jan 30, 2023
@louievandyke
Copy link
Contributor Author

Hi @lgfa29

I just tried the upgrade path going to 1.4.3+ent (from 1.3.2+ent) and I ran into the same situation.

The service discovery fails with the below display messages in the events.

"DisplayMessage": "Missing: nomad.service(consensus)",

"DisplayMessage": "Template failed: nomad.service(consensus): Unexpected response code: 500 (rpc error: acl token lookup failed: index error: UUID must be 36 characters)",

@tgross tgross added theme/enterprise Issues related to Enterprise features stage/needs-verification Issue needs verifying it still exists and removed stage/waiting-reply labels Jun 24, 2024
@tgross tgross changed the title Upgrading to Nomad v1.4.0+ent from Nomad v1.3.2+ent causes ACL issues when audit is enabled. workload identity auth failure when audit is enabled Jun 24, 2024
@tgross tgross added theme/workload-identity hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-verification Issue needs verifying it still exists labels Jun 24, 2024
@tgross
Copy link
Member

tgross commented Jul 2, 2024

I investigated this and was able to reproduce on 1.4.3+ent but not on main. After doing a git bisect on the Enterprise code base, I determined that this bug was fixed by @schmichael in #16254. That shipped in Nomad 1.5.0, which is currently older than the oldest supported version of Nomad or Nomad Enterprise, so all users on supported versions have the fix.

@tgross tgross closed this as completed Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/auth theme/enterprise Issues related to Enterprise features theme/workload-identity type/bug
Projects
Development

No branches or pull requests

3 participants