Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of csi: prevent panic on volume delete into release/1.6.x #18243

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #18234 to be assessed for backporting due to the inclusion of the label backport/1.6.x.

The below text is copied from the body of the original PR.


When a CSI volume is deleted when its plugin is not running, the function volAndPluginLookup returns a nil plugin value resulting a panic in the request handler.

Note to reviewers: the other place where volAndPluginLookup is used already checks for a nil plugin.

nomad/nomad/csi_endpoint.go

Lines 513 to 518 in e21ab7d

// Some plugins support controllers for create/snapshot but not attach. So
// if there's no plugin or the plugin doesn't attach volumes, then we can
// skip the controller publish workflow and return nil.
if plug == nil || !plug.HasControllerCapability(structs.CSIControllerSupportsAttachDetach) {
return nil
}

lgfa29 and others added 30 commits July 11, 2023 10:59
Document and test that if a namespace does not provide an `allow` or
`deny` list than those are treated as `nil` and have a different
behaviour from an empty list (`[]string{}`).
Cannot set a user for raw_exec tasks, because doing so does not work
with the 0700 root owned client data directory that we setup in the e2e
cluster in accordance with the Nomad hardening guide.
* docs: add plugin docs for pledge task driver

Add pledge driver to the set of Community drivers.

* docs: cr feedback
This adds a quick smoke test of our binaries to verify we haven't exceeeded the
maximum GLIBC (2.17) version during linking which would break our ability to
execute on EL7 machines.
* e2e: setup nomad for pledge driver

* e2e: add some e2e tests for pledge task driver
Support for UDS sockets was added to Windows 10.
An ACL policy with a block without label generates unexpected results.
For example, a policy such as this:

```
namespace {
  policy = "read"
}
```

Is applied to a namespace called `policy` instead of the documented
behaviour of applying it to the `default` namespace.

This happens because of the way HCL1 decodes blocks. Since it doesn't
know if a block is expected to have a label it applies the `key` tag to
the content of the block and, in the example above, the first key is
`policy`, so it sets that as the `namespace` block label.

Since this happens internally in the HCL decoder it's not possible to
detect the problem externally.

Fixing the problem inside the decoder is challenging because the JSON
and HCL parsers generate different ASTs that makes impossible to
differentiate between a JSON tree from an invalid HCL tree within the
decoder.

The fix in this commit consists of manually parsing the policy after
decoding to clear labels that were not set in the file. This allows the
validation rules to consistently catch and return any errors, no matter
if the policy is an invalid HCL or JSON.
ACL permissions for the search endpoints are done in three passes. The
first (the `sufficientSearchPerms` method) is for performance and coarsely
rejects requests based on the passed-in context parameter if the user has no
permissions to any object in that context. The second (the
`filteredSearchContexts` method) filters out contexts based on whether the user
has permissions either to the requested namespace or again by context (to catch
the "all" context). Finally, when iterating over the objects available, we do
the usual filtering in the iterator.

Internal testing found several bugs in this filtering:
* CSI plugins can be searched by any authenticated user.
* Variables can be searched if the user has `job:read` permissions to the
  variable's namespace instead of `variable:list`.
* Variables cannot be searched by wildcard namespace.

This is an information leak of the plugin names and variable paths, which we
don't consider to be privileged information but intended to protect anyways.

This changeset fixes these bugs by ensuring CSI plugins are filtered in the 1st
and 2nd pass ACL filters, and changes variables to check `variable:list` in the
2nd pass filter unless the wildcard namespace is passed (at which point we'll
fallback to filtering in the iterator).

Fixes: CVE-2023-3300
Fixes: #17906
sarahethompson and others added 21 commits August 14, 2023 07:08
update copywrite.hcl to exclude MPL subdirs
* build: update to go1.21

* go: eliminate helpers in favor of min/max

* build: run go mod tidy

* build: swap depguard for semgrep

* command: fixup broken tls error check on go1.21
* readme: update readme license badge

* tweak badge color

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
…#18191)

Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 0.6.6 to 0.6.7.
- [Release notes](https://github.com/shoenig/test/releases)
- [Commits](shoenig/test@v0.6.6...v0.6.7)

---
updated-dependencies:
- dependency-name: github.com/shoenig/test
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
the `&` symbol messes up the command when copy pasting into a shell
We've seen test flakiness in the `TestJobEndpoint_Register_NonOverlapping` test,
which asserts that we don't try to placed allocations for blocked evals until
resources have been actually freed by setting the client status of the previous
alloc to complete.

The flaky assertion includes sorting the two allocations by CreateIndex and this
appears to be a non-stable sort in the context of the test run, which results in
failures that shouldn't exist. There's no reason to sort the allocations instead
of just examining them by ID. This changeset does so.
The `TestDrainer_AllTypes_NoDeadline` test has been flaky. It looks like this
might be because the final update of batch allocations to complete is improperly
updating the state store directly rather than by RPC. If the service jobs have
restarted in the meantime, the `allocClientStateSimulator` will have updated the
index on the allocations table and that will prevent the drainer from
unblocking (and being marked complete) when the batch jobs are written with an
earlier index.

This changeset attempts to fix that by making the update via RPC (as it normally
would be in real code).
Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods.

Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
…as root (#18206)

Although nomad officially does not support running the client as a non-root
user, doing so has been more or less possible with the raw_exec driver as
long as you don't expect features to work like networking or running tasks
as specific users. In the cgroups refactoring I bulldozed right over the
special casing we had in place for raw_exec to continue working if the cgroups
were unable to be created. This PR restores that behavior - you can now
(as before) run the nomad client as a non-root user and make use of the
raw_exec task driver.
We use capped exponential backoff in several places in the code when handling
failures. The code we've copy-and-pasted all over has a check to see if the
backoff is greater than the limit, but this check happens after the bitshift and
we always increment the number of attempts. This causes an overflow with a
fairly small number of failures (ex. at one place I tested it occurs after only
24 iterations), resulting in a negative backoff which then never recovers. The
backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC
handler or an external API such as Vault. Note this doesn't occur in places
where we cap the number of iterations so the loop breaks (usually to return an
error), so long as the number of iterations is reasonable.

Introduce a helper with a check on the cap before the bitshift to avoid overflow in all 
places this can occur.

Fixes: #18199
Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
* lang: note that Stack is not concurrency-safe

* client: use more descriptive name for wrangler hook in logs

* numalib: use correct name for receiver parameter
6747ef8 fixes the Nomad client to support using the raw_exec
driver while running as a non-root user. Remove the use of sudo
in the test-e2e workflow for running integration (vaultcompat)
tests.
This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository.

Co-authored-by: hashicorp-copywrite[bot] <110428419+hashicorp-copywrite[bot]@users.noreply.github.com>
@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-fix-csi-panic/tightly-civil-viper branch 2 times, most recently from c626ba0 to a93def1 Compare August 17, 2023 13:58
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit 52e2ad7 into release/1.6.x Aug 17, 2023
@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-fix-csi-panic/tightly-civil-viper branch from 5389a5d to 0e01d9a Compare August 17, 2023 13:58
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/b-fix-csi-panic/tightly-civil-viper branch August 17, 2023 13:58
@github-actions
Copy link

Ember Test Audit comparison

release/1.6.x 0e01d9a change
passes 1506 1506 0
failures 1 1 0
flaky 0 0 0
duration 000ms 000ms -000ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet