Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI: Change Run Job availability based on ACLs #5944

Merged
merged 52 commits into from
Jan 20, 2020
Merged

Conversation

backspace
Copy link
Contributor

@backspace backspace commented Jul 9, 2019

This builds on API changes in #6017 and #6021 to conditionally turn off the
“Run Job” button based on the current token’s capabilities, or the capabilities
of the anonymous policy if no token is present.

If you try to visit the job-run route directly, it redirects to the job list.

Here’s a GIF to get a sense:

run-job-acl

This isn’t a valid solution, but it gets closer to one by
ensuring the token is loaded when the application boots. Then
I can add another step to load and parse the token’s policies.
The endpoint doesn’t actually support this 😳
This should have been part of dc98403.
It’s cut off by the edge of the viewport for now!
These unrelated tests were failing because they assumed that
the first 404 was stable.
Here’s another similar instance.
@backspace backspace changed the title UI: Fetch and reflect ACLs UI: Change Run Job availability based on ACLs Jul 10, 2019
My placeholder test was incorrect.
The “greatest number of matched characters” part is still
forthcoming.
It’s too bad reduce is so unwieldy in Javascript… maybe this
would be better a for loop ☹️
…pect-acl

# Conflicts:
#	ui/mirage/scenarios/default.js
#	ui/package.json
#	ui/yarn.lock
Since the API expands a policy shorthand like “write” into
its constituent capabilities, examining the policy is
no longer needed.
I’m hesitant to add storage-clearing before every test.
@backspace backspace marked this pull request as ready for review December 4, 2019 17:17
@backspace
Copy link
Contributor Author

I made this gist with scripts and configuration to help test this out locally. Note that after changing tokens, a refresh is required, due to #6492.

Some implementation notes:

  1. The is-right-aligned class for the tooltip is a somewhat-arbitrary translateX(-75%), otherwise it runs off the edge of the page because of where the “Run Job” button lives.

  2. The button was duplicated twice for mobile and non-mobile, so now the logic to turn it off is also duplicated… I wasn’t sure if it reached the level of duplication that made extraction desirable.

  3. It may be overkill to include ember-can for just one “ability” but I foresee using this for conditionally turning off/removing the exec button too, so I think it’ll be of further use.

@backspace backspace requested review from DingoEatingFuzz and a team December 4, 2019 17:40
Copy link
Contributor

@johncowen johncowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @backspace !

I left quite a few comments I think! Most are just me thinking out loud though. The most important one is the CSS selector one, you'll probably want to change that before merging. The others are kind of up to you but I'd defo consider using a disabled <button> here rather than a <div>, although maybe theres a reason here that I'm not aware of as to why you can't use a button?

I did a little run of the app, but I don't know how to set things up so I can see things working (the default dev setup just gives me default, namespace-1 and namespace-2), no biggie though as Michael can probably give it a once over also.

Oh also, is there test coverage here for if you are running nomad without namespace support?

P.S. Guess who missed a couple of commits off the end of here cos he didn't pull 😆 ! Ignore the CSS selector one!

{{#if (can "run job")}}
{{#link-to "jobs.run" data-test-run-job class="button is-primary"}}Run Job{{/link-to}}
{{else}}
<div data-test-run-job class="button tooltip is-right-aligned" aria-label="You don’t have permission to run jobs" disabled>Run Job</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed the disabled here, I think I've seen this working on form like things like fieldsets, does it work with divs also? Actually should this be a button not a div? I suppose its a disabled button/non-interactive button which may as well be a div 😁 , not sure what accessibility things might come into play here? But if it was me personally I'd prefer semantic HTML.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just noticed the curl apostrophe, super nit I know, I'm guessing that will come out ok on all platforms, but thought I'd check just incase, do you use curly punctuation elsewhere in nomad?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I chose a div over a button because the thing it’s in parallel to is an a, but it’s true that having it be a true button makes the most sense, so I’ve changed it, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re the curly apostrophe, maybe it’s the only one that isn’t in code only; I type this way automatically so it doesn’t occur to me. Maybe @DingoEatingFuzz has a preference but I like them haha

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a fan of using proper typographic marks, therefore I am a fan of the curly apostrophe. That said, I haven't been disciplined about ensuring things like curly quotes, ellipses, en dashes, and the like.

@@ -49,6 +49,7 @@
"d3-transition": "^1.1.0",
"ember-ajax": "^5.0.0",
"ember-auto-import": "^1.2.21",
"ember-can": "^2.0.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am literally about to work on something very similar to this PR, so thanks for the hinter for ember-can!

.catch(() => []);
}
} catch (e) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure that you probably did this because you want things to fail silently here, I also noticed the empty catch up above somewhere, just thought I'd check that you don't want any user visible errors here, I'm guessing that's what this means.

Also, just curious more than anything, but there is this mix of try/catch / .then()/.catch() code here, is there anyway to write it so that you always use one style? No prob if not just wondering really.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s possible that something should happen if policy-fetching fails… I’m not sure what that would be, it’s maybe somewhat of a design question. But it’s true that it was overly convoluted, I’ve removed the redundancy so it’s more idiomatic, thanks.

ui/app/styles/components/tooltip.scss Show resolved Hide resolved
ui/app/abilities/job.js Outdated Show resolved Hide resolved
systemService.set('activeNamespace.name', '000-abc-999');
assert.ok(jobAbility.canRun, 'expected to be able to match against more than one wildcard');
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a nitpick here, but to me these are all integration tests for canRun. They are testing more than a 'unit'. A unit test for canRun would be a test to see if or works (ember owns or so it wouldn't be worth unit testing it yourself). A lot of what you are testing here is ember code and how you integrate with it.

I'd say the best thing to unit test here would be the _findMatchingNamespace method, which seems to be the actual logic for this feature. The rest of the code is pretty much integrating that into the framework/app. If you could make it so _findMatchingNamespace was importable you could even test it without all the mocking code you have here, so your test code would be way shorter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing taxonomy/nomenclature is always subjective but to me this does qualify as a unit test because I’m mocking the interfaces with the rest of the application that the unit is interacting with; for me, an integration test wouldn’t use mocking. I don’t tend to extract things like the namespace-matching until they’re useful elsewhere in the application, which this likely will never be, unless I need to for testing reasons, which I don’t think I do in this case. But I recognise this is a realm of many opinions and little consensus 🤓

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the thoughts!

@backspace backspace mentioned this pull request Dec 9, 2019
Copy link
Contributor

@DingoEatingFuzz DingoEatingFuzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work and excellent test coverage!

I'm happy with this as is but I left a bunch of "food for thought" type comments too.

Comment on lines +10 to +12
canRun: or('selfTokenIsManagement', 'policiesSupportRunning'),

selfTokenIsManagement: equal('token.selfToken.type', 'management'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this pattern being repeated a bunch. Might be worth thinking about by one of us as we repeat it for both exec and node drain.

} else if (namespaceNames.includes('default')) {
return 'default';
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading through this whole file, I could make some nitpicks about potential performance gains around caching intermediate values and sorting lists before scanning them, but instead I won't 😛 There is unlikely to ever be so many namespaces that this becomes performance critical code.

However I will point out that there is a lot going on in this ability file. It speaks to the possible need for a better policy primitive that can be used to make authoring abilities easier.

I'm curious what your thoughts here are since you have spent more time working through policy parsing and traversing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there’s a lot happening and that it’ll be worth extracting, I just tend to push that kind of thing off into the future when it’s actually needed to avoid premature abstraction. There was a time when I’d create elaborate generalised structures in anticipation and end up with something that didn’t work as well when it came time to be used elsewhere, so I now err on the side of solving the immediate problem and then trying to generalise when it becomes useful so the solution can be informed by real needs.

I’m planning to check ACLs for the exec button so that time isn’t far off 😆

ui/app/routes/jobs/run.js Show resolved Hide resolved
{{#link-to "jobs.run" data-test-run-job class="button is-primary"}}Run Job{{/link-to}}
{{else}}
<button data-test-run-job class="button tooltip is-right-aligned" aria-label="You don’t have permission to run jobs" disabled>Run Job</button>
{{/if}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplication is a bummer, but it's also not the worst. It's right at the point where it could be dangerous, but at least the two occurrences are co-located.

I also wouldn't be opposed to job-run-button component.

ui/tests/acceptance/allocation-detail-test.js Outdated Show resolved Hide resolved
ui/tests/acceptance/job-evaluations-test.js Show resolved Hide resolved
ui/tests/helpers/resolver.js Outdated Show resolved Hide resolved
ui/tests/pages/jobs/list.js Show resolved Hide resolved
This must have seemed/been necessary at some point but
doesn’t break anything when removed!
@backspace
Copy link
Contributor Author

I’m going to merge! 😯

@backspace backspace merged commit 3adb3cd into master Jan 20, 2020
@backspace backspace deleted the f-ui/respect-acl branch January 20, 2020 20:57
spavell added a commit to samsungnext/nomad that referenced this pull request Mar 17, 2020
* actually always canonicalize alloc.Job

alloc.Job may be stale as well and need to migrate it.  It does cost
extra cycles but should be negligible.

* e2e: improve reusability of provisioning scripts (hashicorp#6942)

This changeset is part of the work to improve our E2E provisioning
process to allow our upgrade tests:

* Move more of the setup into the AMI image creation so it's a little
 more obvious to provisioning config authors which bits are essential
 to deploying a specific version of Nomad.

* Make the service file update do a systemd daemon-reload so that we
  can update an already-running cluster with the same script we use to
  deploy it initially.

* Avoid unnecessary golang version reference

* add a script to update golang version

* Update golang to 1.12.15

* Update ecs.html.md

* Update configuring-tasks.html.md

* ui: Change Run Job availability based on ACLs (hashicorp#5944)

This builds on API changes in hashicorp#6017 and hashicorp#6021 to conditionally turn off the
“Run Job” button based on the current token’s capabilities, or the capabilities
of the anonymous policy if no token is present.

If you try to visit the job-run route directly, it redirects to the job list.

* Update changelog

* e2e: use valid jobspec for group check test (hashicorp#6967)

Group service checks cannot interpolate task fields, because the task
fields are not available at the time the script check hook is created
for the group service. When f31482a was merged this e2e test began
failing because we are now correctly matching the script check ID to
the service ID, which revealed this jobspec was invalid.

* UI: Migrate to Storybook (hashicorp#6507)

I originally planned to add component documentation, but as this dragged on and I found that JSDoc-to-Markdown sometimes needed hand-tuning, I decided to skip it and focus on replicating what was already present in Freestyle. Adding documentation is a finite task that can be revisited in the future.

My goal was to migrate everything from Freestyle with as few changes as possible. Some adaptations that I found necessary:
• the DelayedArray and DelayedTruth utilities that delay component rendering until slightly after initial render because without them:
  ◦ charts were rendering with zero width
  ◦ the JSON viewer was rendering with empty content
• Storybook in Ember renders components in a routerless/controllerless context by default, so some component stories needed changes:
  ◦ table pagination/sorting stories access to query params, which necessitates some reaching into Ember internals to start routing and dynamically generate a Storybook route/controller to render components into
  ◦ some stories have a faux controller as part of their Storybook context that hosts setInterval-linked dynamic computed properties
• some jiggery-pokery with anchor tags
  ◦ inert href='#' had to become href='javascript:;
  ◦ links that are actually meant to navigate need target='_parent' so they don’t navigate inside the Storybook iframe

Maybe some of these could be addressed by fixes in ember-cli-storybook but I’m wary of digging around in there any more than I already have, as I’ve lost a lot of time to Storybook confusion and frustrations already 😞

The STORYBOOK=true environment variable tweaks some environment settings to get things working as expected in the Storybook context.

I chose to:
• use angle bracket invocation within stories rather than have to migrate them soon after having moved to Storybook
• keep Freestyle around for now for its palette and typeface components

* e2e: update framework to allow deploying Nomad (hashicorp#6969)

The e2e framework instantiates clients for Nomad/Consul but the
provisioning of the actual Nomad cluster is left to Terraform. The
Terraform provisioning process uses `remote-exec` to deploy specific
versions of Nomad so that we don't have to bake an AMI every time we
want to test a new version. But Terraform treats the resulting
instances as immutable, so we can't use the same tooling to update the
version of Nomad in-place. This is a prerequisite for upgrade testing.

This changeset extends the e2e framework to provide the option of
deploying Nomad (and, in the future, Consul/Vault) with specific
versions to running infrastructure. This initial implementation is
focused on deploying to a single cluster via `ssh` (because that's our
current need), but provides interfaces to hook the test run at the
start of the run, the start of each suite, or the start of a given
test case.

Terraform work includes:
* provides Terraform output that written to JSON used by the framework
  to configure provisioning via `terraform output provisioning`.
* provides Terraform output that can be used by test operators to
  configure their shell via `$(terraform output environment)`
* drops `remote-exec` provisioning steps from Terraform
* makes changes to the deployment scripts to ensure they can be run
  multiple times w/ different versions against the same host.

* e2e: ensure group script check tests interpolation (hashicorp#6972)

Fixes a bug introduced in 0aa58b9 where we're writing a test file to
a taskdir-interpolated location, which works when we `alloc exec` but
not in the jobspec for a group script check.

This changeset also makes the test safe to run multiple times by
namespacing the file with the alloc ID, which has the added bonus of
exercising our alloc interpolation code for group script checks.

* Return FailedTGAlloc metric instead of no node err

If an existing system allocation is running and the node its running on
is marked as ineligible, subsequent plan/applys return an RPC error
instead of a more helpful plan result.

This change logs the error, and appends a failedTGAlloc for the
placement.

* update changelog

* extract leader step function

* Handle Nomad leadership flapping

Fixes a deadlock in leadership handling if leadership flapped.

Raft propagates leadership transition to Nomad through a NotifyCh channel.
Raft blocks when writing to this channel, so channel must be buffered or
aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader`
until the channel is consumed[1] and does not move on to executing follower
related logic (in `raft.runFollower`).

While Raft `runLeader` defer function blocks, raft cannot process any other
raft operations.  For example, `run{Leader|Follower}` methods consume
`raft.applyCh`, and while runLeader defer is blocked, all raft log applications
or config lookup will block indefinitely.

Sadly, `leaderLoop` and `establishLeader` makes few Raft calls!
`establishLeader` attempts to auto-create autopilot/scheduler config [3]; and
`leaderLoop` attempts to check raft configuration [4].  All of these calls occur
without a timeout.

Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is
invoked and hit any of these Raft calls, Raft handler _deadlock_ forever.

Depending on how many times it flapped and where exactly we get stuck, I suspect
it's possible to get in the following case:

* Agent metrics/stats http and RPC calls hang as they check raft.Configurations
* raft.State remains in Leader state, and server attempts to handle RPC calls
  (e.g. node/alloc updates) and these hang as well

As we create goroutines per RPC call, the number of goroutines grow over time
and may trigger a out of memory errors in addition to missed updates.

[1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193
[2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436
[3] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L198-L202
[4] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L877

* e2e: document e2e provisioning process (hashicorp#6976)

* Add the digital marketing team as the code owners for the website dir

* Mock the eligibility endpoint in mirage

* Implement eligibility toggling in the data layer

* Add isMigrating property to the allocation model

* Mock the drain endpoint

* drain and forceDrain adapter methods

* Update drain methods to properly wrap DrainSpec params

* cancelDrain adapter method

* Reformat the client detail page to use the two-row header design

* Add tooltip to the eligibility control

* Update the underlying node model when toggling eligibility in mirage

* Eligibility toggling behavior

* PopoverMenu component

* Update the dropdown styles to be more similar to button styles

* Multiline modifier for tooltips

* More form styles as needed for the drain form

* Initial layout of the drain options popover

* Let dropdowns assume their full width

* Add triggerClass support to the popover menu

* Factor out the drain popover and implement its behaviors

* Extract the duration parsing into a util

* Test coverage for the parse duration util

* Refactor parseDuration to support multi-character units

* Polish for the drain popover

* Stub out all the markup for the new drain strategy view

* Fill in the drain strategy ribbon values

* Fill out the metrics and time since values in the drain summary

* Drain complete notification

* Drain stop and update and notifications

* Modifiers to the two-step-button

* Make outline buttons have a solid white background

* Force drain button in the drain info box

* New toggle component

* Swap the eligiblity checkbox out for a toggle

* Toggle bugs: focus and multiline alignment

* Switch drain popover checkboxes for toggles

* Clear all notifications when resetting the controller

* Model the notification pattern as a page object component

* Update the client detail page object

* Integration tests for the toggle component

* PopoverMenu integration tests

* Update existing tests

* New test coverage for the drain capabilities

* Stack the popover menu under the subnav

* Use qunit-dom where applicable

* Increase the size and spacing of the toggle component

* Remove superfluous information from the client details ribbon

* Tweak vertical spacing of headings

* Update client detail test given change to the compositeStatus property

* Replace custom parse-duration implementation with an existing lib

* fix comment

* consul: add support for canary meta

* website: add canary meta to api docs

* docs: add Go versioning policy

* consul: fix var name from rebase

* docs: reseting bootstrap doesn't invalidate token

* consul: fix var name from rebase

* Update website/source/guides/security/acl.html.markdown

Co-Authored-By: Tim Gross <tim@0x74696d.com>

* e2e: packer builds should not be public (hashicorp#6998)

* docs: tweaks

* include test and address review comments

* handle channel close signal

Always deliver last value then send close signal.

* tweak leadership flapping log messages

* tests: defer closing shutdownCh

* client: canonicalize alloc.Job on restore

There is a case for always canonicalizing alloc.Job field when
canonicalizing the alloc.  I'm less certain of implications though, and
the job canonicalize hasn't changed for a long time.

Here, we special case client restore from database as it's probably the
most relevant part.  When receiving an alloc from RPC, the data should
be fresh enough.

* Support customizing full scheduler config

* tests: run_for is already a string

* canary_meta will be part of 0.10.3 (not 0.10.2)

I assume this is just an oversight. I tried adding the `canary_meta` stanza to an existing v0.10.2 setup (Nomad v0.10.2 (0d2d6e3) and it did show the error message:
```
* group: 'ggg', task: 'tttt', invalid key: canary_meta
```

* use golang 1.12.16

* Allow nomad monitor command to lookup server UUID

Allows addressing servers with nomad monitor using the servers name or
ID.

Also unifies logic for addressing servers for client_agent_endpoint
commands and makes addressing logic region aware.

rpc getServer test

* fix tests, update changelog

* e2e: add a -suite flag to e2e.Framework

This change allows for providing the -suite=<Name> flag when
running the e2e framework. If set, only the matching e2e/Framework.TestSuite.Component
will be run, and all ther suites will be skipped.

* Document default_scheduler_config option

* document docker's disable_log_collection flag

* batch mahmood's changelog entries

[ci skip]

* incorporate review feedback

* core: add limits to unauthorized connections

Introduce limits to prevent unauthorized users from exhausting all
ephemeral ports on agents:

 * `{https,rpc}_handshake_timeout`
 * `{http,rpc}_max_conns_per_client`

The handshake timeout closes connections that have not completed the TLS
handshake by the deadline (5s by default). For RPC connections this
timeout also separately applies to first byte being read so RPC
connections with TLS enabled have `rpc_handshake_time * 2` as their
deadline.

The connection limit per client prevents a single remote TCP peer from
exhausting all ephemeral ports. The default is 100, but can be lowered
to a minimum of 26. Since streaming RPC connections create a new TCP
connection (until MultiplexV2 is used), 20 connections are reserved for
Raft and non-streaming RPCs to prevent connection exhaustion due to
streaming RPCs.

All limits are configurable and may be disabled by setting them to `0`.

This also includes a fix that closes connections that attempt to create
TLS RPC connections recursively. While only users with valid mTLS
certificates could perform such an operation, it was added as a
safeguard to prevent programming errors before they could cause resource
exhaustion.

* docs: document limits

Taken more or less verbatim from Consul.

* Merge pull request hashicorp#160 from hashicorp/b-mtls-hostname

server: validate role and region for RPC w/ mTLS

* docs: bump 0.10.2 -> 0.10.3

* docs: add v0.10.3 release to changelog

* Add an ability for client permissions

* Refactor ability tests to use a setup hook for ability lookup

* Enable the eligibility toggle conditionally based on acls

* Refetch all ACL things when the token changes

* New disabled buttons story

* Disabled button styles

* Disable options for popover and drain-popover

* hclfmt a test jobspec (hashicorp#7011)

* Update disabled 'Run Job' button to use standard disabled style

* Add an explanatory tooltip to the unauthorized node drain popover

* Fix token referencing from the token controller, as well as resetting

* Handle the case where ACLs aren't enabled in abilities

* Account for disabled ACLs in ability tests

* Acceptance test for disabled node write controls

* Use secret ID for NOMAD_TOKEN

Use secret ID for NOMAD_TOKEN as the accessor ID doesn't seem to work.

I tried with a local micro cluster following the tutorials, and if I do:

```console
$ export NOMAD_TOKEN=85310d07-9afa-ef53-0933-0c043cd673c7
```

Using the accessor ID as in this example, I get an error:

```
Error querying jobs: Unexpected response code: 403 (ACL token not found)
```

But when using the secret ID in that env var it seems to work correctly.

* Pass stats interval colleciton to executor

This fixes a bug where executor based drivers emit stats every second,
regardless of user configuration.

When serializing the Stats request across grpc, the nomad agent dropped
the Interval value, and then executor uses 1s as a default value.

* changelog

* Some fixes to connection pooling

Pick up some fixes from Consul:

* If a stream returns an EOF error, clear session from cache/pool and start a
new one.
* Close the codec when closing StreamClient

* Allow for an icon within the node status light

* Add an icon inside the node status light

* Assign icons to node statuses

* New node initializing icon

* Redo the node-status-light CSS to be icon-based

* Add an animation for the initializing state

* Call out the 'down' status too, since it's a pretty bad one

* command, docs: create and document consul token configuration for connect acls (hashicorpgh-6716)

This change provides an initial pass at setting up the configuration necessary to
enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul
Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert`
commands (similar to Vault tokens).

These values are not actually used yet in this changeset.

* nomad: ensure a unique ClusterID exists when leader (hashicorpgh-6702)

Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for hashicorpgh-6701.

* client: enable nomad client to request and set SI tokens for tasks

When a job is configured with Consul Connect aware tasks (i.e. sidecar),
the Nomad Client should be able to request from Consul (through Nomad Server)
Service Identity tokens specific to those tasks.

* nomad: proxy requests for Service Identity tokens between Clients and Consul

Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.

* client: enable envoy bootstrap hook to set SI token

When creating the envoy bootstrap configuration, we should append
the "-token=<token>" argument in the case where the sidsHook placed
the token in the secrets directory.

* nomad: fixup token policy validation

* nomad: handle SI token revocations concurrently

Be able to revoke SI token accessors concurrently, and also
ratelimit the requests being made to Consul for the various
ACL API uses.

* agent: re-enable the server in dev mode

* client: remove unused indirection for referencing consul executable

Was thinking about using the testing pattern where you create executable
shell scripts as test resources which "mock" the process a bit of code
is meant to fork+exec. Turns out that wasn't really necessary in this case.

* client: skip task SI token file load failure if testing as root

The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when
running as a non-priveleged user, since it deliberately tries to read
an un-readable file to simulate a failure loading the SI token file.

* comments: cleanup some leftover debug comments and such

* nomad,client: apply smaller PR suggestions

Apply smaller suggestions like doc strings, variable names, etc.

Co-Authored-By: Nick Ethier <nethier@hashicorp.com>
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

* nomad,client: apply more comment/style PR tweaks

* client: set context timeout around SI token derivation

The derivation of an SI token needs to be safegaurded by a context
timeout, otherwise an unresponsive Consul could cause the siHook
to block forever on Prestart.

* client: manage TR kill from parent on SI token derivation failure

Re-orient the management of the tr.kill to happen in the parent of
the spawned goroutine that is doing the actual token derivation. This
makes the code a little more straightforward, making it easier to
reason about not leaking the worker goroutine.

* nomad: fix leftover missed refactoring in consul policy checking

* nomad: make TaskGroup.UsesConnect helper a public helper

* client: PR cleanup - shadow context variable

* client: PR cleanup - improved logging around kill task in SIDS hook

* client: additional test cases around failures in SIDS hook

* tests: skip some SIDS hook tests if running tests as root

* e2e: e2e test for connect with consul acls

Provide script for managing Consul ACLs on a TF provisioned cluster for
e2e testing. Script can be used to 'enable' or 'disable' Consul ACLs,
and automatically takes care of the bootstrapping process if necessary.

The bootstrapping process takes a long time, so we may need to
extend the overall e2e timeout (20 minutes seems fine).

Introduces basic tests for Consul Connect with ACLs.

* e2e: remove forgotten unused field from new struct

* e2e: do not use eventually when waiting for allocs

This test is causing panics. Unlike the other similar tests, this
one is using require.Eventually which is doing something bad, and
this change replaces it with a for-loop like the other tests.

Failure:

=== RUN   TestE2E/Connect
=== RUN   TestE2E/Connect/*connect.ConnectE2ETest
=== RUN   TestE2E/Connect/*connect.ConnectE2ETest/TestConnectDemo
=== RUN   TestE2E/Connect/*connect.ConnectE2ETest/TestMultiServiceConnect
=== RUN   TestE2E/Connect/*connect.ConnectClientStateE2ETest
panic: Fail in goroutine after TestE2E/Connect/*connect.ConnectE2ETest has completed

goroutine 38 [running]:
testing.(*common).Fail(0xc000656500)
	/opt/google/go/src/testing/testing.go:565 +0x11e
testing.(*common).Fail(0xc000656100)
	/opt/google/go/src/testing/testing.go:559 +0x96
testing.(*common).FailNow(0xc000656100)
	/opt/google/go/src/testing/testing.go:587 +0x2b
testing.(*common).Fatalf(0xc000656100, 0x1512f90, 0x10, 0xc000675f88, 0x1, 0x1)
	/opt/google/go/src/testing/testing.go:672 +0x91
github.com/hashicorp/nomad/e2e/connect.(*ConnectE2ETest).TestMultiServiceConnect.func1(0x0)
	/home/shoenig/go/src/github.com/hashicorp/nomad/e2e/connect/multi_service.go:72 +0x296
github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually.func1(0xc0004962a0, 0xc0002338f0)
	/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1494 +0x27
created by github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually
	/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1493 +0x272
FAIL	github.com/hashicorp/nomad/e2e	21.427s

* e2e: uncomment test case that is not broken

* e2e: use hclfmt on consul acls policy config files

* e2e: agent token was only being set for server0

* e2e: remove redundant extra API call for getting allocs

* e2e: setup consul ACLs a little more correctly

* tests: set consul token for nomad client for testing SIDS TR hook

* nomad: min cluster version for connect ACLs is now v0.10.4

* nomad: remove unused default schedular variable

This is from a merge conflict resolution that went the wrong direction.

I assumed the block had been added, but really it had been removed. Now,
it is removed once again.

* docs: update chanagelog to mention connect with acls

* nomad/docs: increment version number to 0.10.4

* sentinel: copy jobs to prevent mutation

It's unclear whether Sentinel code can mutate values passed to the eval,
so ensure it cannot by copying the job.

* ignore computed diffs if node is ineligible

test flakey, add temp sleeps for debugging

fix computed class

* make diffSystemAllocsForNode aware of eligibility

diffSystemAllocs -> diffSystemAllocsForNode, this function is only used
for diffing system allocations, but lacked awareness of eligible
nodes and the node ID that the allocation was going to be placed.

This change now ignores a change if its existing allocation is on an
ineligible node. For a new allocation, it also checks tainted and
ineligible nodes in the same function instead of nil-ing out the diff
after computation in diffSystemAllocs

* add test for node eligibility

* comment for filtering reason

* update changelog

* vagrant: disable audio interference

Avoid Vagrant/virtualbox interferring with host audio when the VM boots.

* prehook: fix enterprise repo remote value

* dev: Tweaks to cluster dev scripts

Consolidate all nomad data dir in a single root
`/tmp/nomad-dev-cluster`.  Eases clean up.

Allow running script from any path - don't require devs to cd into
`dev/cluster` directory first.

Also, block while nomad processes are running and prapogate
SIGTERM/SIGINT to nomad processes to shutdown.

* e2e: remove leftover debug println statement

* run "make hclfmt"

* make: emit explanation for /api isolation

Emit a slightly helpful message when /api depends on nomad internal
packages.

* pool: Clear connection before releasing

This to be consistent with other connection clean up handler as well as consul's https://github.com/hashicorp/consul/blob/v1.6.3/agent/pool/pool.go#L468-L479 .

* Fix panic when monitoring a local client node

Fixes a panic when accessing a.agent.Server() when agent is a client
instead. This pr removes a redundant ACL check since ACLs are validated
at the RPC layer. It also nil checks the agent server and uses Client()
when appropriate.

* agent Profile req nil check s.agent.Server()

clean up logic and tests

* update changelog

* docs: fix misspelling

* keep placed canaries aligned with alloc status

* nomad state store must be modified through raft, rm local state change

* add state store test to ensure PlacedCanaries is updated

* docs: add link & reorg hashicorp#6690 in changelog

* docs: fix typo, ordering, & style in changelog

* e2e: turn no-ACLs connect tests back on

Also cleanup more missed debugging things >.>

* e2e: improve provisioning defaults and documentation (hashicorp#7062)

This changeset improves the ergonomics of running the Nomad e2e test
provisioning process by defaulting to a blank `nomad_sha` in the
Terraform configuration. By default, a user will now need to pass in
one of the Nomad version flags. But they won't have to manually edit
the `provisioning.json` file for the common case of deploying a
released version of Nomad, and won't need to put dummy values for
`nomad_sha`.

Includes general documentation improvements.

* e2e: rename linux runner to avoid implicit build tag (hashicorp#7070)

Go implicitly treats files ending with `_linux.go` as build tagged for
Linux only. This broke the e2e provisioning framework on macOS once we
tried importing it into the `e2e/consulacls` module.

* e2e: wait 2m rather than 10s after disabling consul acls

Pretty sure Consul / Nomad clients are often not ready yet after
the ConsulACLs test disables ACLs, by the time the next test starts
running.

Running locally things tend to work, but in TeamCity this seems to
be a recurring problem. However, when running locally sometimes I do
see that the "show status" step after disabling ACLs, some nodes are
still initializing, suggesting we're right on the border of not waiting
long enough

    nomad node status
    ID        DC   Name              Class   Drain  Eligibility  Status
    0e4dfce2  dc1  EC2AMAZ-JB3NF9P   <none>  false  eligible     ready
    6b90aa06  dc2  ip-172-31-16-225  <none>  false  eligible     ready
    7068558a  dc2  ip-172-31-20-143  <none>  false  eligible     ready
    e0ae3c5c  dc1  ip-172-31-25-165  <none>  false  eligible     ready
    15b59ed6  dc1  ip-172-31-23-199  <none>  false  eligible     initializing

Going to try waiting a full 2 minutes after disabling ACLs, hopefully that
will help things Just Work. In the future, we should probably be parsing the
output of the status checks and actually confirming all nodes are ready.

Even better, maybe that's something shipyard will have built-in.

* add e2e test for system sched ineligible nodes

* get test passing, new util func to wait for not pending

* clean up

* rm unused field

* fix check

* simplify job, better error

* docs: hashicorp#6065 shipped in v0.10.0, not v0.9.6

PR hashicorp#6065 was intended to be backported to v0.9.6 to fix issue hashicorp#6223.
However it appears to have not been backported:

 * https://github.com/hashicorp/nomad/blob/v0.9.6/client/allocrunner/taskrunner/task_runner.go#L1349-L1351
 * https://github.com/hashicorp/nomad/blob/v0.9.7/client/allocrunner/taskrunner/task_runner.go#L1349-L1351

The fix was included in v0.10.0:

 * https://github.com/hashicorp/nomad/blob/v0.10.0/client/allocrunner/taskrunner/task_runner.go#L1363-L1370

* e2e: add --quiet flag to s3 copy to reduce log spam (hashicorp#7085)

* Explicit transparent bg on popover actions

* Override the max-width on mobile to avoid losing space due to non-existent gutter menu

* changelog windows binaries being signed

Note that 0.10.4, nomad windows binaries will be signed.

[ci skip]

* change log for remote pprof endpoints

* nomad: unset consul token on job register

* nomad: assert consul token is unset on job register in tests

* command: use consistent CONSUL_HTTP_TOKEN name

Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same.
Note that consul-template uses CONSUL_TOKEN, which Nomad also uses,
so be careful to preserve any reference to that in the consul-template
context.

* docs: update changelog mentioning consul token passthrough

* release: prep 0.10.4

* Generate files for 0.10.4 release

* Release v0.10.4

Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Tim Higgison <TimHiggison@users.noreply.github.com>
Co-authored-by: Buck Doyle <buck@hashicorp.com>
Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Michael Lange <dingoeatingfuzz@gmail.com>
Co-authored-by: Nick Ethier <nethier@hashicorp.com>
Co-authored-by: Tim Gross <tim@0x74696d.com>
Co-authored-by: Shantanu Gadgil <shantanugadgil@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@hashicorp.com>
Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>
Co-authored-by: Nomad Release bot <nomad@hashicorp.com>
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants