Backport of csi: prevent panic on volume delete into release/1.6.x #18243

hc-github-team-nomad-core · 2023-08-17T13:58:13Z

Backport

This PR is auto-generated from #18234 to be assessed for backporting due to the inclusion of the label backport/1.6.x.

The below text is copied from the body of the original PR.

When a CSI volume is deleted when its plugin is not running, the function volAndPluginLookup returns a nil plugin value resulting a panic in the request handler.

Note to reviewers: the other place where volAndPluginLookup is used already checks for a nil plugin.

nomad/nomad/csi_endpoint.go

Lines 513 to 518 in e21ab7d

    
           // Some plugins support controllers for create/snapshot but not attach. So 
        
           // if there's no plugin or the plugin doesn't attach volumes, then we can 
        
           // skip the controller publish workflow and return nil. 
        
           if plug == nil || !plug.HasControllerCapability(structs.CSIControllerSupportsAttachDetach) { 
        
           	return nil 
        
           }

Document and test that if a namespace does not provide an `allow` or `deny` list than those are treated as `nil` and have a different behaviour from an empty list (`[]string{}`).

Cannot set a user for raw_exec tasks, because doing so does not work with the 0700 root owned client data directory that we setup in the e2e cluster in accordance with the Nomad hardening guide.

* docs: add plugin docs for pledge task driver Add pledge driver to the set of Community drivers. * docs: cr feedback

Post 1.6.0 rc.1 release

This adds a quick smoke test of our binaries to verify we haven't exceeeded the maximum GLIBC (2.17) version during linking which would break our ability to execute on EL7 machines.

* e2e: setup nomad for pledge driver * e2e: add some e2e tests for pledge task driver

)

Support for UDS sockets was added to Windows 10.

An ACL policy with a block without label generates unexpected results. For example, a policy such as this: ``` namespace { policy = "read" } ``` Is applied to a namespace called `policy` instead of the documented behaviour of applying it to the `default` namespace. This happens because of the way HCL1 decodes blocks. Since it doesn't know if a block is expected to have a label it applies the `key` tag to the content of the block and, in the example above, the first key is `policy`, so it sets that as the `namespace` block label. Since this happens internally in the HCL decoder it's not possible to detect the problem externally. Fixing the problem inside the decoder is challenging because the JSON and HCL parsers generate different ASTs that makes impossible to differentiate between a JSON tree from an invalid HCL tree within the decoder. The fix in this commit consists of manually parsing the policy after decoding to clear labels that were not set in the file. This allows the validation rules to consistently catch and return any errors, no matter if the policy is an invalid HCL or JSON.

ACL permissions for the search endpoints are done in three passes. The first (the `sufficientSearchPerms` method) is for performance and coarsely rejects requests based on the passed-in context parameter if the user has no permissions to any object in that context. The second (the `filteredSearchContexts` method) filters out contexts based on whether the user has permissions either to the requested namespace or again by context (to catch the "all" context). Finally, when iterating over the objects available, we do the usual filtering in the iterator. Internal testing found several bugs in this filtering: * CSI plugins can be searched by any authenticated user. * Variables can be searched if the user has `job:read` permissions to the variable's namespace instead of `variable:list`. * Variables cannot be searched by wildcard namespace. This is an information leak of the plugin names and variable paths, which we don't consider to be privileged information but intended to protect anyways. This changeset fixes these bugs by ensuring CSI plugins are filtered in the 1st and 2nd pass ACL filters, and changes variables to check `variable:list` in the 2nd pass filter unless the wildcard namespace is passed (at which point we'll fallback to filtering in the iterator). Fixes: CVE-2023-3300 Fixes: #17906

update copywrite.hcl to exclude MPL subdirs

* build: update to go1.21 * go: eliminate helpers in favor of min/max * build: run go mod tidy * build: swap depguard for semgrep * command: fixup broken tls error check on go1.21

* readme: update readme license badge * tweak badge color --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>

…#18191) Bumps [github.com/shoenig/test](https://github.com/shoenig/test) from 0.6.6 to 0.6.7. - [Release notes](https://github.com/shoenig/test/releases) - [Commits](shoenig/test@v0.6.6...v0.6.7) --- updated-dependencies: - dependency-name: github.com/shoenig/test dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

the `&` symbol messes up the command when copy pasting into a shell

We've seen test flakiness in the `TestJobEndpoint_Register_NonOverlapping` test, which asserts that we don't try to placed allocations for blocked evals until resources have been actually freed by setting the client status of the previous alloc to complete. The flaky assertion includes sorting the two allocations by CreateIndex and this appears to be a non-stable sort in the context of the test run, which results in failures that shouldn't exist. There's no reason to sort the allocations instead of just examining them by ID. This changeset does so.

The `TestDrainer_AllTypes_NoDeadline` test has been flaky. It looks like this might be because the final update of batch allocations to complete is improperly updating the state store directly rather than by RPC. If the service jobs have restarted in the meantime, the `allocClientStateSimulator` will have updated the index on the allocations table and that will prevent the drainer from unblocking (and being marked complete) when the batch jobs are written with an earlier index. This changeset attempts to fix that by making the update via RPC (as it normally would be in real code).

Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods. Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>

…as root (#18206) Although nomad officially does not support running the client as a non-root user, doing so has been more or less possible with the raw_exec driver as long as you don't expect features to work like networking or running tasks as specific users. In the cgroups refactoring I bulldozed right over the special casing we had in place for raw_exec to continue working if the cgroups were unable to be created. This PR restores that behavior - you can now (as before) run the nomad client as a non-root user and make use of the raw_exec task driver.

We use capped exponential backoff in several places in the code when handling failures. The code we've copy-and-pasted all over has a check to see if the backoff is greater than the limit, but this check happens after the bitshift and we always increment the number of attempts. This causes an overflow with a fairly small number of failures (ex. at one place I tested it occurs after only 24 iterations), resulting in a negative backoff which then never recovers. The backoff becomes a tight loop consuming resources and/or DoS'ing a Nomad RPC handler or an external API such as Vault. Note this doesn't occur in places where we cap the number of iterations so the loop breaks (usually to return an error), so long as the number of iterations is reasonable. Introduce a helper with a check on the cap before the bitshift to avoid overflow in all places this can occur. Fixes: #18199 Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>

* lang: note that Stack is not concurrency-safe * client: use more descriptive name for wrangler hook in logs * numalib: use correct name for receiver parameter

6747ef8 fixes the Nomad client to support using the raw_exec driver while running as a non-root user. Remove the use of sudo in the test-e2e workflow for running integration (vaultcompat) tests.

This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository. Co-authored-by: hashicorp-copywrite[bot] <110428419+hashicorp-copywrite[bot]@users.noreply.github.com>

github-actions · 2023-08-17T14:20:03Z

Ember Test Audit comparison

	release/1.6.x	`0e01d9a`	change
passes	1506	1506	0
failures	1	1	0
flaky	0	0	0
duration	000ms	000ms	-000ms

lgfa29 and others added 30 commits July 11, 2023 10:59

np: update docs and add test for nil lists (#17899)

99fb36e

Document and test that if a namespace does not provide an `allow` or `deny` list than those are treated as `nil` and have a different behaviour from an empty list (`[]string{}`).

website: ignore user .env.* files (#17898)

01cb47b

docs: clarify update stagger description and alternatives (#17896)

ef89b69

docs: clarify using user on raw_exec driver (#17897)

80b9ff6

docs: update podman driver docs with v0.5.0 changes (#17824)

a4d0dcd

e2e: do not set a user for raw_exec tasks (#17901)

fd50f2b

Cannot set a user for raw_exec tasks, because doing so does not work with the 0700 root owned client data directory that we setup in the e2e cluster in accordance with the Nomad hardening guide.

docs: add plugin docs for pledge task driver (#17823)

8253ec8

* docs: add plugin docs for pledge task driver Add pledge driver to the set of Community drivers. * docs: cr feedback

changelog: fix link to unsupported changelog file. (#17913)

3cfa267

Prepare release 1.6.0-rc.1

3656de6

Generate files for 1.6.0-rc.1 release

335bb8b

Prepare for next release

09c89e7

Merge pull request #17914 from hashicorp/post-1.6.0-rc.1-release

3c481d3

Post 1.6.0 rc.1 release

Smoke test binaries for EL7 compatiblity (#17706)

995b936

This adds a quick smoke test of our binaries to verify we haven't exceeeded the maximum GLIBC (2.17) version during linking which would break our ability to execute on EL7 machines.

ci: add copywrite action to check file headers. (#17889)

74335b3

e2e: add some e2e tests for pledge task driver (#17909)

159bf51

* e2e: setup nomad for pledge driver * e2e: add some e2e tests for pledge task driver

Tells the token to be 2 seconds faster in a 5.5 second test (#17924)

4379418

fix allowed metric values documentation for the Nomad APM plugin (#17928

e53955b

)

docs: fix QEMU driver HCL plugin options example formatting. (#17930)

7f5d39f

docs: note windows requirement for workload identity (#17950)

1e7726c

Support for UDS sockets was added to Windows 10.

Fix typos (#17962)

3a68793

copywrite: add placeholder for OSS/ENT ignore split. (#17965)

3abb112

test: enable exec test previously disabled due to CircleCI (#17975)

81c14de

qemu: fix log lines to use correct QEMU capitalization. (#17961)

81aa274

qemu: add test to cover task store functions. (#17967)

0015d25

redact token before passing to sentinel

d23aaed

Generate files for 1.6.0 release

573cab2

Prepare for next release

bc8b4bd

Merge release 1.6.0 files

47fb70b

sarahethompson and others added 21 commits August 14, 2023 07:08

update linux package license to BUSL-1.1 (#18192)

fd1ae34

update copywrite.hcl to exclude MPL subdirs

update go1.21 (#18184)

d9341f0

* build: update to go1.21 * go: eliminate helpers in favor of min/max * build: run go mod tidy * build: swap depguard for semgrep * command: fixup broken tls error check on go1.21

readme: update readme license badge (#18188)

f9d7016

* readme: update readme license badge * tweak badge color --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>

config: add configurable content security policy (#18085)

65d562b

docs: ampersand and bash backgrounding problem (#18175)

a170499

the `&` symbol messes up the command when copy pasting into a shell

build: use modtool to format go.mod file (#18195)

77e139e

followup to numa/cgroups refactor (#18214)

8833452

* lang: note that Stack is not concurrency-safe * client: use more descriptive name for wrangler hook in logs * numalib: use correct name for receiver parameter

test-e2e: no need to run vaultcomat tests as root (#18215)

6fca4fa

6747ef8 fixes the Nomad client to support using the raw_exec driver while running as a non-root user. Remove the use of sudo in the test-e2e workflow for running integration (vaultcompat) tests.

[COMPLIANCE] License update (#18218)

9af2a9b

docs: expand documentation on node pools (#18109)

01d71ca

docs: move glossary to a top-level menu item (#18223)

6d1a2a0

docs: fix job dispatch documentation (#18225)

e21ab7d

backport of commit f65fc0e

7ac37e4

backport of commit a95ceca

0e01d9a

hc-github-team-nomad-core requested review from a team as code owners August 17, 2023 13:58

hc-github-team-nomad-core force-pushed the backport/b-fix-csi-panic/tightly-civil-viper branch 2 times, most recently from c626ba0 to a93def1 Compare August 17, 2023 13:58

hc-github-team-nomad-core merged commit 52e2ad7 into release/1.6.x Aug 17, 2023

hc-github-team-nomad-core force-pushed the backport/b-fix-csi-panic/tightly-civil-viper branch from 5389a5d to 0e01d9a Compare August 17, 2023 13:58

hc-github-team-nomad-core deleted the backport/b-fix-csi-panic/tightly-civil-viper branch August 17, 2023 13:58

vercel bot deployed to Preview – nomad-storybook-and-ui August 17, 2023 14:08 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of csi: prevent panic on volume delete into release/1.6.x #18243

Backport of csi: prevent panic on volume delete into release/1.6.x #18243

hc-github-team-nomad-core commented Aug 17, 2023

github-actions bot commented Aug 17, 2023

	// Some plugins support controllers for create/snapshot but not attach. So
	// if there's no plugin or the plugin doesn't attach volumes, then we can
	// skip the controller publish workflow and return nil.
	if plug == nil \|\| !plug.HasControllerCapability(structs.CSIControllerSupportsAttachDetach) {
	return nil
	}

Backport of csi: prevent panic on volume delete into release/1.6.x #18243

Backport of csi: prevent panic on volume delete into release/1.6.x #18243

Conversation

hc-github-team-nomad-core commented Aug 17, 2023

Backport

github-actions bot commented Aug 17, 2023

Ember Test Audit comparison