Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow periodic jobs to use workload identity ACL policies #17018

Merged
merged 3 commits into from
May 22, 2023

Conversation

unRob
Copy link
Contributor

@unRob unRob commented Apr 28, 2023

prevent all calls to the Task API from a periodically-dispatched job failing with 403.

This happens because no policy for the job id (one with a periodic-\d+ suffix) matches the generated token claims (that specifically use the parent job id).

steps to reproduce

Create a policy with

nomad acl policy apply -namespace default -job example example-job <(cat <<EOF
namespace "default" {
  policy = "write"
}
EOF
)

Create a job

# example.nomad
job "example" {
  datacenters = ["casa"]
  type = "batch"
  priority = 10

  periodic {
    cron             = "*/15 * * * * *"
    prohibit_overlap = true
  }

  group "example" {
    task "example" {
      driver = "docker"

      config {
        image = "curlimages/curl:7.87.0"
        args = [
          "--unix-socket", "${NOMAD_SECRETS_DIR}/api.sock",
          "-H", "Authorization: Bearer ${NOMAD_TOKEN}",
          "--fail-with-body",
          "--verbose",
          "localhost/v1/client/metadata",
        ]
      }


      identity {
        env = true
        file = false
      }
    }
  }

}

Run and dispatch

nomad run example.nomad 
echo "{}" | nomad job dispatch example -

It'll fail with a 403, and upon inspecting the claims we find

echo "the second part of the JWT, possibly padded with equal signs" | base64 -d | jq
{
  "nomad_namespace": "default",
  "nomad_job_id": "example",
  "nomad_allocation_id": "dcc6477e-1b20-afbd-b46a-d3810d198b53",
  "nomad_task": "example",
  "nbf": 1682647044,
  "iat": 1682647044
}

but current code searches for policies for job id example/periodic-\d+

prevent all calls to the [Task API](https://developer.hashicorp.com/nomad/api-docs/task-api) from a periodically-dispatched job failing with `403`.

Since no policy for the job id (one with a `periodic-\d+` suffix) matches the generated token claims (that specifically use the [parent job id](https://github.com/hashicorp/nomad/blob/891999789689240006b2a6076b73ddc67249d48d/nomad/structs/structs.go#L10912C19-L10914)).

## steps to reproduce

Create a policy with

```sh
nomad acl policy apply -namespace default -job example example-job <(cat <<EOF
namespace "default" {
  policy = "write"
}
EOF
)
```

Create a job

```hcl
# example.nomad
job "example" {
  datacenters = ["casa"]
  type = "batch"
  priority = 10

  periodic {
    cron             = "*/15 * * * * *"
    prohibit_overlap = true
  }

  group "example" {
    task "example" {
      driver = "docker"

      config {
        image = "curlimages/curl:7.87.0"
        args = [
          "--unix-socket", "${NOMAD_SECRETS_DIR}/api.sock",
          "-H", "Authorization: Bearer ${NOMAD_TOKEN}",
          "--fail-with-body",
          "--verbose",
          "localhost/v1/client/metadata",
        ]
      }


      identity {
        env = true
        file = false
      }
    }
  }

}
```

Run and dispatch
```sh
nomad run example.nomad 
echo "{}" | nomad job dispatch example -
```

It'll fail with a 403, and upon inspecting the claims we find

```json
echo "the second part of the JWT, possibly padded with equal signs" | base64 -d | jq
{
  "nomad_namespace": "default",
  "nomad_job_id": "example",
  "nomad_allocation_id": "dcc6477e-1b20-afbd-b46a-d3810d198b53",
  "nomad_task": "example",
  "nbf": 1682647044,
  "iat": 1682647044
}
```

but current code searches for policies for job id `example/periodic-\d+`
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 15, 2023
@tgross tgross self-requested a review May 17, 2023 20:32
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage May 17, 2023
@tgross tgross self-assigned this May 17, 2023
@tgross tgross added theme/auth type/bug theme/batch Issues related to batch jobs and scheduling labels May 17, 2023
@tgross
Copy link
Member

tgross commented May 17, 2023

Thanks for the PR @unRob! This looks right to me, but I'm going to pull this down to run some tests to make sure we're not missing something or introducing this fix at the wrong layer. Sorry for the delayed review, I'll try to get this looked at in detail tomorrow.

@unRob
Copy link
Contributor Author

unRob commented May 18, 2023

Should I invert the jobId passed to tests here and here? Not sure what to do about TestRPC_TLS_Enforcement_RPC errors though, will try rebasing.

@tgross
Copy link
Member

tgross commented May 19, 2023

Hello again @unRob! This PR confused me because I was pretty sure we'd handled the case of the parent ID when we create the identity claims for the job using the parent's ID up front. See structs.go#L10835. And then when I looked at those test failures they sure looked like they were testing what we wanted.

So then I went back and looked at your reproduction and I think the problem is actually the ACL policy you attached. The Node Read Metadata API requires the node:read permission. So if I use the following policy instead:

namespace "default" {
  policy = "write"
}

node {
  policy = "read"
}

And then I run your example, everything works just as I'd expect:

$ nomad acl policy apply -namespace default -job example example-job example-job.hcl
Successfully wrote "example-job" ACL policy!

$ nomad job run example.nomad
==> 2023-05-19T11:17:50-04:00: Monitoring evaluation "e0ee0a6e"
    2023-05-19T11:17:50-04:00: Evaluation triggered by job "example"
    2023-05-19T11:17:50-04:00: Allocation "e026f7cd" created: node "beb06965", group "example"
    2023-05-19T11:17:51-04:00: Allocation "e026f7cd" status changed: "pending" -> "running" (Tasks are running)
    2023-05-19T11:17:51-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-05-19T11:17:51-04:00: Evaluation "e0ee0a6e" finished with status "complete"

$ nomad alloc logs e02
{"Dynamic":{},"Meta":{"connect.proxy_concurrency":"1","connect.sidecar_image":"envoyproxy/envoy:v${NOMAD_envoy_version}","connect.gateway_image":"envoyproxy/envoy:v${NOMAD_envoy_version}","connect.log_level":"info"},"Static":{"connect.sidecar_image":"envoyproxy/envoy:v${NOMAD_envoy_version}","connect.gateway_image":"envoyproxy/envoy:v${NOMAD_envoy_version}","connect.log_level":"info","connect.proxy_concurrency":"1"}}%

So I think this is going to be safe to close out. But thanks so much for making the effort!

@tgross tgross closed this May 19, 2023
Nomad - Community Issues Triage automation moved this from Triaging to Done May 19, 2023
@unRob
Copy link
Contributor Author

unRob commented May 19, 2023

Thanks for the follow up and example @tgross, and sorry to insist! My simplified example was probably a bad choice to represent the actual problem I'm seeing: a task dispatching another task, with a policy like:

nomad acl policy apply \
  -namespace default \
  -job example \
  -group example \
  -task example task-service-dispatches-other-job \
  <(cat <<EOF
namespace "default" {
  policy = "read" //also tried with "write" and no capabilities
  capabilities = ["dispatch-job"]
}
node {
  policy = "read"
}
EOF
)

for example job:

# example.nomad
job "example" {
  datacenters = ["casa"]
  type = "batch"
  priority = 10

  periodic {
    cron             = "*/15 * * * * *"
    prohibit_overlap = true
  }

  group "example" {
    task "example" {
      driver = "docker"

      config {
        image = "curlimages/curl:7.87.0"
        args = [
          "--fail-with-body", "--verbose",
          "--unix-socket", "${NOMAD_SECRETS_DIR}/api.sock",
          "-H", "Authorization: Bearer ${NOMAD_TOKEN}",
          "-H", "Content-type: application/json",
          "-X", "POST",
          "localhost/v1/job/other-job/dispatch",
          "--data-binary", "{}",
        ]
      }

      identity {
        env = true
        file = false
      }
    }
  }

}

and other-job spec

job "other-job" {
  datacenters = ["casa"]
  type = "batch"
  parameterized {}

  group "other-job" {
    task "other-job" {
      // ...
    }
  }
}
+ curl --fail-with-body --verbose \
  --unix-socket /secrets/api.sock \
  -H 'Authorization: Bearer eyJh...UWDg' \
  -H 'Content-type: application/json' \
  -XPOST localhost/v1/job/other-job/dispatch \
  --data-binary '{}'
Note: Unnecessary use of -X or --request, POST is already inferred.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying /secrets/api.sock:0...
* Connected to localhost (/volume1/nomad/alloc/f549e581-66ad-a6ab-0c9b-) port 80 (#0)
> POST /v1/job/other-job/dispatch HTTP/1.1
> Host: localhost
> User-Agent: curl/8.0.1
> Accept: */*
> Authorization: Bearer eyJh...UWDg
> Content-type: application/json
> Content-Length: 2
> 
} [2 bytes data]
< HTTP/1.1 403 Forbidden
< Date: Fri, 19 May 2023 16:01:08 GMT
< Content-Length: 17
< Content-Type: text/plain; charset=utf-8
< 
{ [17 bytes data]

I can create an issue instead if that's preferred.

@tgross
Copy link
Member

tgross commented May 19, 2023

Oops, no, you're right @unRob. I must have been rushing and not rebuilt after testing your patch. 😊 I'll reopen the PR. Thanks for pushing back a bit on this!

Now, that leaves why the heck the tests are (a) passing on main, and (b) failing on this PR. I'm pretty sure we can't just flip the test assertion, because we're signing the claims for "alloc3" which has a parent ID, and verifying that the claims work against the parent ID and don't work for the alloc own job ID.

@tgross tgross reopened this May 19, 2023
Nomad - Community Issues Triage automation moved this from Done to Needs Triage May 19, 2023
@tgross
Copy link
Member

tgross commented May 19, 2023

Ok @unRob I've figured out why the tests failed and it's very silly. We're setting the parent of alloc3 to be the job ID of alloc1 at L50 but then we attach a policy to the parent job that changes access to its own variables at L88-L100 so that has a ripple effect down to the allocation we're testing. That unfortunately lined up with exactly the opposite of what we wanted and that's how the bug slipped thru.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @unRob! Thanks for sticking with it!

The 1.5.6 release pipeline is already in progress as I'm writing this, so this will get merged once that release is out and it'll land in the next release (with backports to 1.5.x, 1.4.x)

@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 19, 2023
@tgross tgross added this to the 1.6.0 milestone May 19, 2023
@tgross tgross added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels May 22, 2023
@tgross tgross merged commit a6f15ba into hashicorp:main May 22, 2023
16 checks passed
Nomad - Community Issues Triage automation moved this from In Progress to Done May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line theme/auth theme/batch Issues related to batch jobs and scheduling type/bug
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants