Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Consul Connect with Consul ACLs enabled #6701

Closed
shoenig opened this issue Nov 14, 2019 · 2 comments
Closed

Support Consul Connect with Consul ACLs enabled #6701

shoenig opened this issue Nov 14, 2019 · 2 comments
Assignees
Labels
Milestone

Comments

@shoenig
Copy link
Member

shoenig commented Nov 14, 2019

Implement support for Consul Connect with Consul ACLs enabled.

This will be largely centered around Consul Service Identity (SI) Tokens being automatically managed per Nomad Task.

@shoenig shoenig self-assigned this Nov 14, 2019
@shoenig shoenig added theme/consul/connect Consul Connect integration type/enhancement labels Nov 14, 2019
@shoenig shoenig added this to the 0.10.3 milestone Nov 14, 2019
shoenig added a commit that referenced this issue Nov 14, 2019
As leader, check to see if a ClusterID has already been generated
and replicated through the raft log. If one has not been generated,
generate a UUID and push it through. If a ClusterID has been generated,
its value is applied into the state store. The value is not actually
used anywhere in this changeset, but is a prerequisite for gh-6701.
shoenig added a commit that referenced this issue Nov 14, 2019
As leader, check to see if a ClusterID has already been generated
and replicated through the raft log. If one has not been generated,
generate a UUID and push it through. If a ClusterID has been generated,
its value is applied into the state store. The value is not actually
used anywhere in this changeset, but is a prerequisite for gh-6701.
shoenig added a commit that referenced this issue Nov 14, 2019
As leader, check to see if a ClusterID has already been generated
and replicated through the raft log. If one has not been generated,
generate a UUID and push it through. If a ClusterID has been generated,
its value is applied into the state store. The value is not actually
used anywhere in this changeset, but is a prerequisite for gh-6701.
shoenig added a commit that referenced this issue Nov 18, 2019
As leader, check to see if a ClusterID has already been generated
and replicated through the raft log. If one has not been generated,
generate a UUID and push it through. If a ClusterID has been generated,
its value is applied into the state store. The value is not actually
used anywhere in this changeset, but is a prerequisite for gh-6701.
shoenig added a commit that referenced this issue Nov 26, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and push it through.
The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Nov 26, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and push it through.
The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Nov 26, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and push it through.
The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 3, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and push it through.
The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 3, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 4, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 4, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 4, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 4, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 4, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 4, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 6, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 9, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 12, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 13, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 16, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 17, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Dec 23, 2019
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Jan 3, 2020
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Jan 6, 2020
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Jan 15, 2020
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Jan 22, 2020
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Jan 24, 2020
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Jan 31, 2020
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Feb 1, 2020
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
shoenig added a commit that referenced this issue Feb 1, 2020
Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for gh-6701.
@shoenig shoenig closed this as completed Feb 13, 2020
spavell added a commit to samsungnext/nomad that referenced this issue Mar 17, 2020
* actually always canonicalize alloc.Job

alloc.Job may be stale as well and need to migrate it.  It does cost
extra cycles but should be negligible.

* e2e: improve reusability of provisioning scripts (hashicorp#6942)

This changeset is part of the work to improve our E2E provisioning
process to allow our upgrade tests:

* Move more of the setup into the AMI image creation so it's a little
 more obvious to provisioning config authors which bits are essential
 to deploying a specific version of Nomad.

* Make the service file update do a systemd daemon-reload so that we
  can update an already-running cluster with the same script we use to
  deploy it initially.

* Avoid unnecessary golang version reference

* add a script to update golang version

* Update golang to 1.12.15

* Update ecs.html.md

* Update configuring-tasks.html.md

* ui: Change Run Job availability based on ACLs (hashicorp#5944)

This builds on API changes in hashicorp#6017 and hashicorp#6021 to conditionally turn off the
“Run Job” button based on the current token’s capabilities, or the capabilities
of the anonymous policy if no token is present.

If you try to visit the job-run route directly, it redirects to the job list.

* Update changelog

* e2e: use valid jobspec for group check test (hashicorp#6967)

Group service checks cannot interpolate task fields, because the task
fields are not available at the time the script check hook is created
for the group service. When f31482a was merged this e2e test began
failing because we are now correctly matching the script check ID to
the service ID, which revealed this jobspec was invalid.

* UI: Migrate to Storybook (hashicorp#6507)

I originally planned to add component documentation, but as this dragged on and I found that JSDoc-to-Markdown sometimes needed hand-tuning, I decided to skip it and focus on replicating what was already present in Freestyle. Adding documentation is a finite task that can be revisited in the future.

My goal was to migrate everything from Freestyle with as few changes as possible. Some adaptations that I found necessary:
• the DelayedArray and DelayedTruth utilities that delay component rendering until slightly after initial render because without them:
  ◦ charts were rendering with zero width
  ◦ the JSON viewer was rendering with empty content
• Storybook in Ember renders components in a routerless/controllerless context by default, so some component stories needed changes:
  ◦ table pagination/sorting stories access to query params, which necessitates some reaching into Ember internals to start routing and dynamically generate a Storybook route/controller to render components into
  ◦ some stories have a faux controller as part of their Storybook context that hosts setInterval-linked dynamic computed properties
• some jiggery-pokery with anchor tags
  ◦ inert href='#' had to become href='javascript:;
  ◦ links that are actually meant to navigate need target='_parent' so they don’t navigate inside the Storybook iframe

Maybe some of these could be addressed by fixes in ember-cli-storybook but I’m wary of digging around in there any more than I already have, as I’ve lost a lot of time to Storybook confusion and frustrations already 😞

The STORYBOOK=true environment variable tweaks some environment settings to get things working as expected in the Storybook context.

I chose to:
• use angle bracket invocation within stories rather than have to migrate them soon after having moved to Storybook
• keep Freestyle around for now for its palette and typeface components

* e2e: update framework to allow deploying Nomad (hashicorp#6969)

The e2e framework instantiates clients for Nomad/Consul but the
provisioning of the actual Nomad cluster is left to Terraform. The
Terraform provisioning process uses `remote-exec` to deploy specific
versions of Nomad so that we don't have to bake an AMI every time we
want to test a new version. But Terraform treats the resulting
instances as immutable, so we can't use the same tooling to update the
version of Nomad in-place. This is a prerequisite for upgrade testing.

This changeset extends the e2e framework to provide the option of
deploying Nomad (and, in the future, Consul/Vault) with specific
versions to running infrastructure. This initial implementation is
focused on deploying to a single cluster via `ssh` (because that's our
current need), but provides interfaces to hook the test run at the
start of the run, the start of each suite, or the start of a given
test case.

Terraform work includes:
* provides Terraform output that written to JSON used by the framework
  to configure provisioning via `terraform output provisioning`.
* provides Terraform output that can be used by test operators to
  configure their shell via `$(terraform output environment)`
* drops `remote-exec` provisioning steps from Terraform
* makes changes to the deployment scripts to ensure they can be run
  multiple times w/ different versions against the same host.

* e2e: ensure group script check tests interpolation (hashicorp#6972)

Fixes a bug introduced in 0aa58b9 where we're writing a test file to
a taskdir-interpolated location, which works when we `alloc exec` but
not in the jobspec for a group script check.

This changeset also makes the test safe to run multiple times by
namespacing the file with the alloc ID, which has the added bonus of
exercising our alloc interpolation code for group script checks.

* Return FailedTGAlloc metric instead of no node err

If an existing system allocation is running and the node its running on
is marked as ineligible, subsequent plan/applys return an RPC error
instead of a more helpful plan result.

This change logs the error, and appends a failedTGAlloc for the
placement.

* update changelog

* extract leader step function

* Handle Nomad leadership flapping

Fixes a deadlock in leadership handling if leadership flapped.

Raft propagates leadership transition to Nomad through a NotifyCh channel.
Raft blocks when writing to this channel, so channel must be buffered or
aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader`
until the channel is consumed[1] and does not move on to executing follower
related logic (in `raft.runFollower`).

While Raft `runLeader` defer function blocks, raft cannot process any other
raft operations.  For example, `run{Leader|Follower}` methods consume
`raft.applyCh`, and while runLeader defer is blocked, all raft log applications
or config lookup will block indefinitely.

Sadly, `leaderLoop` and `establishLeader` makes few Raft calls!
`establishLeader` attempts to auto-create autopilot/scheduler config [3]; and
`leaderLoop` attempts to check raft configuration [4].  All of these calls occur
without a timeout.

Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is
invoked and hit any of these Raft calls, Raft handler _deadlock_ forever.

Depending on how many times it flapped and where exactly we get stuck, I suspect
it's possible to get in the following case:

* Agent metrics/stats http and RPC calls hang as they check raft.Configurations
* raft.State remains in Leader state, and server attempts to handle RPC calls
  (e.g. node/alloc updates) and these hang as well

As we create goroutines per RPC call, the number of goroutines grow over time
and may trigger a out of memory errors in addition to missed updates.

[1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193
[2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436
[3] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L198-L202
[4] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L877

* e2e: document e2e provisioning process (hashicorp#6976)

* Add the digital marketing team as the code owners for the website dir

* Mock the eligibility endpoint in mirage

* Implement eligibility toggling in the data layer

* Add isMigrating property to the allocation model

* Mock the drain endpoint

* drain and forceDrain adapter methods

* Update drain methods to properly wrap DrainSpec params

* cancelDrain adapter method

* Reformat the client detail page to use the two-row header design

* Add tooltip to the eligibility control

* Update the underlying node model when toggling eligibility in mirage

* Eligibility toggling behavior

* PopoverMenu component

* Update the dropdown styles to be more similar to button styles

* Multiline modifier for tooltips

* More form styles as needed for the drain form

* Initial layout of the drain options popover

* Let dropdowns assume their full width

* Add triggerClass support to the popover menu

* Factor out the drain popover and implement its behaviors

* Extract the duration parsing into a util

* Test coverage for the parse duration util

* Refactor parseDuration to support multi-character units

* Polish for the drain popover

* Stub out all the markup for the new drain strategy view

* Fill in the drain strategy ribbon values

* Fill out the metrics and time since values in the drain summary

* Drain complete notification

* Drain stop and update and notifications

* Modifiers to the two-step-button

* Make outline buttons have a solid white background

* Force drain button in the drain info box

* New toggle component

* Swap the eligiblity checkbox out for a toggle

* Toggle bugs: focus and multiline alignment

* Switch drain popover checkboxes for toggles

* Clear all notifications when resetting the controller

* Model the notification pattern as a page object component

* Update the client detail page object

* Integration tests for the toggle component

* PopoverMenu integration tests

* Update existing tests

* New test coverage for the drain capabilities

* Stack the popover menu under the subnav

* Use qunit-dom where applicable

* Increase the size and spacing of the toggle component

* Remove superfluous information from the client details ribbon

* Tweak vertical spacing of headings

* Update client detail test given change to the compositeStatus property

* Replace custom parse-duration implementation with an existing lib

* fix comment

* consul: add support for canary meta

* website: add canary meta to api docs

* docs: add Go versioning policy

* consul: fix var name from rebase

* docs: reseting bootstrap doesn't invalidate token

* consul: fix var name from rebase

* Update website/source/guides/security/acl.html.markdown

Co-Authored-By: Tim Gross <tim@0x74696d.com>

* e2e: packer builds should not be public (hashicorp#6998)

* docs: tweaks

* include test and address review comments

* handle channel close signal

Always deliver last value then send close signal.

* tweak leadership flapping log messages

* tests: defer closing shutdownCh

* client: canonicalize alloc.Job on restore

There is a case for always canonicalizing alloc.Job field when
canonicalizing the alloc.  I'm less certain of implications though, and
the job canonicalize hasn't changed for a long time.

Here, we special case client restore from database as it's probably the
most relevant part.  When receiving an alloc from RPC, the data should
be fresh enough.

* Support customizing full scheduler config

* tests: run_for is already a string

* canary_meta will be part of 0.10.3 (not 0.10.2)

I assume this is just an oversight. I tried adding the `canary_meta` stanza to an existing v0.10.2 setup (Nomad v0.10.2 (0d2d6e3) and it did show the error message:
```
* group: 'ggg', task: 'tttt', invalid key: canary_meta
```

* use golang 1.12.16

* Allow nomad monitor command to lookup server UUID

Allows addressing servers with nomad monitor using the servers name or
ID.

Also unifies logic for addressing servers for client_agent_endpoint
commands and makes addressing logic region aware.

rpc getServer test

* fix tests, update changelog

* e2e: add a -suite flag to e2e.Framework

This change allows for providing the -suite=<Name> flag when
running the e2e framework. If set, only the matching e2e/Framework.TestSuite.Component
will be run, and all ther suites will be skipped.

* Document default_scheduler_config option

* document docker's disable_log_collection flag

* batch mahmood's changelog entries

[ci skip]

* incorporate review feedback

* core: add limits to unauthorized connections

Introduce limits to prevent unauthorized users from exhausting all
ephemeral ports on agents:

 * `{https,rpc}_handshake_timeout`
 * `{http,rpc}_max_conns_per_client`

The handshake timeout closes connections that have not completed the TLS
handshake by the deadline (5s by default). For RPC connections this
timeout also separately applies to first byte being read so RPC
connections with TLS enabled have `rpc_handshake_time * 2` as their
deadline.

The connection limit per client prevents a single remote TCP peer from
exhausting all ephemeral ports. The default is 100, but can be lowered
to a minimum of 26. Since streaming RPC connections create a new TCP
connection (until MultiplexV2 is used), 20 connections are reserved for
Raft and non-streaming RPCs to prevent connection exhaustion due to
streaming RPCs.

All limits are configurable and may be disabled by setting them to `0`.

This also includes a fix that closes connections that attempt to create
TLS RPC connections recursively. While only users with valid mTLS
certificates could perform such an operation, it was added as a
safeguard to prevent programming errors before they could cause resource
exhaustion.

* docs: document limits

Taken more or less verbatim from Consul.

* Merge pull request hashicorp#160 from hashicorp/b-mtls-hostname

server: validate role and region for RPC w/ mTLS

* docs: bump 0.10.2 -> 0.10.3

* docs: add v0.10.3 release to changelog

* Add an ability for client permissions

* Refactor ability tests to use a setup hook for ability lookup

* Enable the eligibility toggle conditionally based on acls

* Refetch all ACL things when the token changes

* New disabled buttons story

* Disabled button styles

* Disable options for popover and drain-popover

* hclfmt a test jobspec (hashicorp#7011)

* Update disabled 'Run Job' button to use standard disabled style

* Add an explanatory tooltip to the unauthorized node drain popover

* Fix token referencing from the token controller, as well as resetting

* Handle the case where ACLs aren't enabled in abilities

* Account for disabled ACLs in ability tests

* Acceptance test for disabled node write controls

* Use secret ID for NOMAD_TOKEN

Use secret ID for NOMAD_TOKEN as the accessor ID doesn't seem to work.

I tried with a local micro cluster following the tutorials, and if I do:

```console
$ export NOMAD_TOKEN=85310d07-9afa-ef53-0933-0c043cd673c7
```

Using the accessor ID as in this example, I get an error:

```
Error querying jobs: Unexpected response code: 403 (ACL token not found)
```

But when using the secret ID in that env var it seems to work correctly.

* Pass stats interval colleciton to executor

This fixes a bug where executor based drivers emit stats every second,
regardless of user configuration.

When serializing the Stats request across grpc, the nomad agent dropped
the Interval value, and then executor uses 1s as a default value.

* changelog

* Some fixes to connection pooling

Pick up some fixes from Consul:

* If a stream returns an EOF error, clear session from cache/pool and start a
new one.
* Close the codec when closing StreamClient

* Allow for an icon within the node status light

* Add an icon inside the node status light

* Assign icons to node statuses

* New node initializing icon

* Redo the node-status-light CSS to be icon-based

* Add an animation for the initializing state

* Call out the 'down' status too, since it's a pretty bad one

* command, docs: create and document consul token configuration for connect acls (hashicorpgh-6716)

This change provides an initial pass at setting up the configuration necessary to
enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul
Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert`
commands (similar to Vault tokens).

These values are not actually used yet in this changeset.

* nomad: ensure a unique ClusterID exists when leader (hashicorpgh-6702)

Enable any Server to lookup the unique ClusterID. If one has not been
generated, and this node is the leader, generate a UUID and attempt to
apply it through raft.

The value is not yet used anywhere in this changeset, but is a prerequisite
for hashicorpgh-6701.

* client: enable nomad client to request and set SI tokens for tasks

When a job is configured with Consul Connect aware tasks (i.e. sidecar),
the Nomad Client should be able to request from Consul (through Nomad Server)
Service Identity tokens specific to those tasks.

* nomad: proxy requests for Service Identity tokens between Clients and Consul

Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.

* client: enable envoy bootstrap hook to set SI token

When creating the envoy bootstrap configuration, we should append
the "-token=<token>" argument in the case where the sidsHook placed
the token in the secrets directory.

* nomad: fixup token policy validation

* nomad: handle SI token revocations concurrently

Be able to revoke SI token accessors concurrently, and also
ratelimit the requests being made to Consul for the various
ACL API uses.

* agent: re-enable the server in dev mode

* client: remove unused indirection for referencing consul executable

Was thinking about using the testing pattern where you create executable
shell scripts as test resources which "mock" the process a bit of code
is meant to fork+exec. Turns out that wasn't really necessary in this case.

* client: skip task SI token file load failure if testing as root

The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when
running as a non-priveleged user, since it deliberately tries to read
an un-readable file to simulate a failure loading the SI token file.

* comments: cleanup some leftover debug comments and such

* nomad,client: apply smaller PR suggestions

Apply smaller suggestions like doc strings, variable names, etc.

Co-Authored-By: Nick Ethier <nethier@hashicorp.com>
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

* nomad,client: apply more comment/style PR tweaks

* client: set context timeout around SI token derivation

The derivation of an SI token needs to be safegaurded by a context
timeout, otherwise an unresponsive Consul could cause the siHook
to block forever on Prestart.

* client: manage TR kill from parent on SI token derivation failure

Re-orient the management of the tr.kill to happen in the parent of
the spawned goroutine that is doing the actual token derivation. This
makes the code a little more straightforward, making it easier to
reason about not leaking the worker goroutine.

* nomad: fix leftover missed refactoring in consul policy checking

* nomad: make TaskGroup.UsesConnect helper a public helper

* client: PR cleanup - shadow context variable

* client: PR cleanup - improved logging around kill task in SIDS hook

* client: additional test cases around failures in SIDS hook

* tests: skip some SIDS hook tests if running tests as root

* e2e: e2e test for connect with consul acls

Provide script for managing Consul ACLs on a TF provisioned cluster for
e2e testing. Script can be used to 'enable' or 'disable' Consul ACLs,
and automatically takes care of the bootstrapping process if necessary.

The bootstrapping process takes a long time, so we may need to
extend the overall e2e timeout (20 minutes seems fine).

Introduces basic tests for Consul Connect with ACLs.

* e2e: remove forgotten unused field from new struct

* e2e: do not use eventually when waiting for allocs

This test is causing panics. Unlike the other similar tests, this
one is using require.Eventually which is doing something bad, and
this change replaces it with a for-loop like the other tests.

Failure:

=== RUN   TestE2E/Connect
=== RUN   TestE2E/Connect/*connect.ConnectE2ETest
=== RUN   TestE2E/Connect/*connect.ConnectE2ETest/TestConnectDemo
=== RUN   TestE2E/Connect/*connect.ConnectE2ETest/TestMultiServiceConnect
=== RUN   TestE2E/Connect/*connect.ConnectClientStateE2ETest
panic: Fail in goroutine after TestE2E/Connect/*connect.ConnectE2ETest has completed

goroutine 38 [running]:
testing.(*common).Fail(0xc000656500)
	/opt/google/go/src/testing/testing.go:565 +0x11e
testing.(*common).Fail(0xc000656100)
	/opt/google/go/src/testing/testing.go:559 +0x96
testing.(*common).FailNow(0xc000656100)
	/opt/google/go/src/testing/testing.go:587 +0x2b
testing.(*common).Fatalf(0xc000656100, 0x1512f90, 0x10, 0xc000675f88, 0x1, 0x1)
	/opt/google/go/src/testing/testing.go:672 +0x91
github.com/hashicorp/nomad/e2e/connect.(*ConnectE2ETest).TestMultiServiceConnect.func1(0x0)
	/home/shoenig/go/src/github.com/hashicorp/nomad/e2e/connect/multi_service.go:72 +0x296
github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually.func1(0xc0004962a0, 0xc0002338f0)
	/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1494 +0x27
created by github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually
	/home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1493 +0x272
FAIL	github.com/hashicorp/nomad/e2e	21.427s

* e2e: uncomment test case that is not broken

* e2e: use hclfmt on consul acls policy config files

* e2e: agent token was only being set for server0

* e2e: remove redundant extra API call for getting allocs

* e2e: setup consul ACLs a little more correctly

* tests: set consul token for nomad client for testing SIDS TR hook

* nomad: min cluster version for connect ACLs is now v0.10.4

* nomad: remove unused default schedular variable

This is from a merge conflict resolution that went the wrong direction.

I assumed the block had been added, but really it had been removed. Now,
it is removed once again.

* docs: update chanagelog to mention connect with acls

* nomad/docs: increment version number to 0.10.4

* sentinel: copy jobs to prevent mutation

It's unclear whether Sentinel code can mutate values passed to the eval,
so ensure it cannot by copying the job.

* ignore computed diffs if node is ineligible

test flakey, add temp sleeps for debugging

fix computed class

* make diffSystemAllocsForNode aware of eligibility

diffSystemAllocs -> diffSystemAllocsForNode, this function is only used
for diffing system allocations, but lacked awareness of eligible
nodes and the node ID that the allocation was going to be placed.

This change now ignores a change if its existing allocation is on an
ineligible node. For a new allocation, it also checks tainted and
ineligible nodes in the same function instead of nil-ing out the diff
after computation in diffSystemAllocs

* add test for node eligibility

* comment for filtering reason

* update changelog

* vagrant: disable audio interference

Avoid Vagrant/virtualbox interferring with host audio when the VM boots.

* prehook: fix enterprise repo remote value

* dev: Tweaks to cluster dev scripts

Consolidate all nomad data dir in a single root
`/tmp/nomad-dev-cluster`.  Eases clean up.

Allow running script from any path - don't require devs to cd into
`dev/cluster` directory first.

Also, block while nomad processes are running and prapogate
SIGTERM/SIGINT to nomad processes to shutdown.

* e2e: remove leftover debug println statement

* run "make hclfmt"

* make: emit explanation for /api isolation

Emit a slightly helpful message when /api depends on nomad internal
packages.

* pool: Clear connection before releasing

This to be consistent with other connection clean up handler as well as consul's https://github.com/hashicorp/consul/blob/v1.6.3/agent/pool/pool.go#L468-L479 .

* Fix panic when monitoring a local client node

Fixes a panic when accessing a.agent.Server() when agent is a client
instead. This pr removes a redundant ACL check since ACLs are validated
at the RPC layer. It also nil checks the agent server and uses Client()
when appropriate.

* agent Profile req nil check s.agent.Server()

clean up logic and tests

* update changelog

* docs: fix misspelling

* keep placed canaries aligned with alloc status

* nomad state store must be modified through raft, rm local state change

* add state store test to ensure PlacedCanaries is updated

* docs: add link & reorg hashicorp#6690 in changelog

* docs: fix typo, ordering, & style in changelog

* e2e: turn no-ACLs connect tests back on

Also cleanup more missed debugging things >.>

* e2e: improve provisioning defaults and documentation (hashicorp#7062)

This changeset improves the ergonomics of running the Nomad e2e test
provisioning process by defaulting to a blank `nomad_sha` in the
Terraform configuration. By default, a user will now need to pass in
one of the Nomad version flags. But they won't have to manually edit
the `provisioning.json` file for the common case of deploying a
released version of Nomad, and won't need to put dummy values for
`nomad_sha`.

Includes general documentation improvements.

* e2e: rename linux runner to avoid implicit build tag (hashicorp#7070)

Go implicitly treats files ending with `_linux.go` as build tagged for
Linux only. This broke the e2e provisioning framework on macOS once we
tried importing it into the `e2e/consulacls` module.

* e2e: wait 2m rather than 10s after disabling consul acls

Pretty sure Consul / Nomad clients are often not ready yet after
the ConsulACLs test disables ACLs, by the time the next test starts
running.

Running locally things tend to work, but in TeamCity this seems to
be a recurring problem. However, when running locally sometimes I do
see that the "show status" step after disabling ACLs, some nodes are
still initializing, suggesting we're right on the border of not waiting
long enough

    nomad node status
    ID        DC   Name              Class   Drain  Eligibility  Status
    0e4dfce2  dc1  EC2AMAZ-JB3NF9P   <none>  false  eligible     ready
    6b90aa06  dc2  ip-172-31-16-225  <none>  false  eligible     ready
    7068558a  dc2  ip-172-31-20-143  <none>  false  eligible     ready
    e0ae3c5c  dc1  ip-172-31-25-165  <none>  false  eligible     ready
    15b59ed6  dc1  ip-172-31-23-199  <none>  false  eligible     initializing

Going to try waiting a full 2 minutes after disabling ACLs, hopefully that
will help things Just Work. In the future, we should probably be parsing the
output of the status checks and actually confirming all nodes are ready.

Even better, maybe that's something shipyard will have built-in.

* add e2e test for system sched ineligible nodes

* get test passing, new util func to wait for not pending

* clean up

* rm unused field

* fix check

* simplify job, better error

* docs: hashicorp#6065 shipped in v0.10.0, not v0.9.6

PR hashicorp#6065 was intended to be backported to v0.9.6 to fix issue hashicorp#6223.
However it appears to have not been backported:

 * https://github.com/hashicorp/nomad/blob/v0.9.6/client/allocrunner/taskrunner/task_runner.go#L1349-L1351
 * https://github.com/hashicorp/nomad/blob/v0.9.7/client/allocrunner/taskrunner/task_runner.go#L1349-L1351

The fix was included in v0.10.0:

 * https://github.com/hashicorp/nomad/blob/v0.10.0/client/allocrunner/taskrunner/task_runner.go#L1363-L1370

* e2e: add --quiet flag to s3 copy to reduce log spam (hashicorp#7085)

* Explicit transparent bg on popover actions

* Override the max-width on mobile to avoid losing space due to non-existent gutter menu

* changelog windows binaries being signed

Note that 0.10.4, nomad windows binaries will be signed.

[ci skip]

* change log for remote pprof endpoints

* nomad: unset consul token on job register

* nomad: assert consul token is unset on job register in tests

* command: use consistent CONSUL_HTTP_TOKEN name

Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same.
Note that consul-template uses CONSUL_TOKEN, which Nomad also uses,
so be careful to preserve any reference to that in the consul-template
context.

* docs: update changelog mentioning consul token passthrough

* release: prep 0.10.4

* Generate files for 0.10.4 release

* Release v0.10.4

Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Tim Higgison <TimHiggison@users.noreply.github.com>
Co-authored-by: Buck Doyle <buck@hashicorp.com>
Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com>
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Michael Lange <dingoeatingfuzz@gmail.com>
Co-authored-by: Nick Ethier <nethier@hashicorp.com>
Co-authored-by: Tim Gross <tim@0x74696d.com>
Co-authored-by: Shantanu Gadgil <shantanugadgil@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@hashicorp.com>
Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>
Co-authored-by: Nomad Release bot <nomad@hashicorp.com>
@spuder
Copy link
Contributor

spuder commented Mar 27, 2020

Is there documentation about how to use nomad with consul-connect when ACL's are enabled yet?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants