Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of fix 'default' alias not added to interface specified by network_interface into release/1.4.x #18114

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #18096 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

The below text is copied from the body of the original PR.


closes #18097

consider the following config.hcl

# config.hcl
client {
  network_interface = "tailscale0"

  host_network "tailscale" {
    interface = "tailscale0"
  }

  host_network "public" {
    interface = "wlan0"
  }
}

whenever you would schedule the following job

# job.hcl
job "docs" {
  datacenters = ["dc1"]

  group "example" {
    network {
      port "http" {}
    }
    task "server" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo"
        ports = ["http"]
        args = [
          "-listen",
          ":5678",
          "-text",
          "hello world",
        ]
      }
    }
  }
}

you would see * Constraint "missing host network \"default\" for port \"http\"": 1 nodes excluded by filter because the 'default' host_network alias is being replaces with 'tailscale'. The follow PR appends the 'default' alias regardless to the addresses specified by network_interface.

lgfa29 and others added 30 commits June 6, 2023 10:43
Implements the HTTP API associated with the `NodePool.ListJobs` RPC, including
the `api` package for the public API and documentation.

Update the `NodePool.ListJobs` RPC to fix the missing handling of the special
"all" pool.
Implement scheduler support for node pool:

* When a scheduler is invoked, we get a set of the ready nodes in the DCs that
  are allowed for that job. Extend the filter to include the node pool.
* Ensure that changes to a job's node pool are picked up as destructive
  allocation updates.
* Add `NodesInPool` as a metric to all reporting done by the scheduler.
* Add the node-in-pool the filter to the `Node.Register` RPC so that we don't
  generate spurious evals for nodes in the wrong pool.
Go released a security update to fix build-time code injection and execution via
CGO. This doesn't impact already-released versions of Nomad, just the build
toolchain, so we won't be releasing a Nomad security update to go with it.
During shutdown of a client with drain_on_shutdown there is a race between
the Client ending the cgroup and the task's cpuset manager cleaning up
the cgroup. During the path traversal, skip anything we cannot read, which
avoids the nil DirEntry we try to dereference now.
* Exam to parallelize tests

* Logging to try to solve test flakiness

* Logging in another failure

* Hardening for one test and snapshot for another

* Explicitly set the first one as the servicedAlloc instead of randomly picking

* A wild CircleCI test failure appears

* de-log
…#17455)

This PR fixes a bug where the docker network pause container would not be
stopped and removed in the case where a node is restarted, the alloc is
moved to another node, the node comes back up. See the issue below for
full repro conditions.

Basically in the DestroyNetwork PostRun hook we would depend on the
NetworkIsolationSpec field not being nil - which is only the case
if the Client stays alive all the way from network creation to network
teardown. If the node is rebooted we lose that state and previously
would not be able to find the pause container to remove. Now, we manually
find the pause container by scanning them and looking for the associated
allocID.

Fixes #17299
…ning (#17465)

* Fix: dont show a service as healthy when its parent alloc is not running

* Test for Health Unknown
If the authoritative region has been upgraded to a version of Nomad that has new
replicated objects (such as ACL Auth Methods, ACL Binding Rules, etc.), the
non-authoritative regions will start replicating those objects as soon as their
leader is upgraded. If a server in the non-authoritative region is upgraded and
then becomes the leader before all the other servers in the region have been
upgraded, then it will attempt to write a Raft log entry that the followers
don't understand. The followers will then panic.

Add same the minimum version checks that we do for RPC writes to the leader's
replication loop.
Whenever we write a Raft log entry for node pools, we need to first make sure
that all servers can safely apply the log without panicking. Gate upsert and
delete RPCs on all servers being upgraded to the minimum version.
Upserts and deletes of node pools are forwarded to the authoritative region,
just like we do for namespaces, quotas, ACL policies, etc. Replicate node pools
from the authoritative region.
all of our workflows are in GitHub Actions now 🎉
… be removed (#17481)

Add preemption_config to the set of keys which should be pruned from the server
config as described in #17480.
We don't want to delete node pools that have nodes or non-terminal jobs. Add a
check in the `DeleteNodePools` RPC to check locally and in federated regions,
similar to how we check that it's safe to delete namespaces.
When registering a node with a new node pool in a non-authoritative
region we can't create the node pool because this new pool will not be
replicated to other regions.

This commit modifies the node registration logic to only allow automatic
node pool creation in the authoritative region.

In non-authoritative regions, the client is registered, but the node
pool is not created. The client is kept in the `initialing` status until
its node pool is created in the authoritative region and replicated to
the client's region.
Implement a `nomad node pool init` command that generates an example spec file
in either HCL or JSON format.
Provide a no-op implementation of the drivers.DriverNetoworkManager
interface to be used by systems that don't support network isolation and
prevent panics where a network manager is expected.
* Tooltip on individual allocs in the panel

* Isolate allocation cells to their own component

* Tipsy trigger

* Aria label for failed-or-lost tooltips

* Buildfix

* Try adding percy exec back to exam run
This changeset includes some fixes to documentation discovered while working on
node pools, but we didn't want to include in the node pool PRs so they can get
backported easily:

* namespace apply/delete commands are forwarded to the authoritative region
* deleting a namespace requires there are no non-terminal jobs in any of the
  federated regions
* fixed a typo in the name of the `nomad.client.allocated.disk` metric
The `var init` command was intended to have support for a `-quiet` flag but it
was not documented and never parsed.
This changeset adds the node pool as a label anywhere we're already emitting
labels with additional information such as node class or ID about the client.
philrenaud and others added 24 commits July 24, 2023 14:25
Makefile changes required for supporting s390x builds and a corresponding
changelog entry.
Add JWKS endpoint to HTTP API for exposing the root public signing keys used for signing workload identity JWTs.

Part 1 of N components as part of making workload identities consumable by third party services such as Consul and Vault. Identity attenuation (audience) and expiration (+renewal) are necessary to securely use workload identities with 3rd parties, so this merge does not yet document this endpoint.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Add missing help entry for the `-consul-namespace` flag in `nomad job
run`.
Trusted Supply Chain Component Registry (TSCCR) enforcement starts Monday and an
internal report shows our semgrep action is pinned to a version that's not
currently permitted. Update all the action versions to whatever's the new
hotness to maximum the time-to-live on these until we have automated pinning
setup.

Also version bumps our chromedriver action, which randomly broke upstream today.
This feature is necessary when user want to explicitly re-render all templates on task restart.
E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.
…18100)

In #18054 we introduced a new field `render_templates` in the `restart`
block. Previously changes to the `restart` block were always non-destructive in
the scheduler but we now need to check the new field so that we can update the
template runner. The check assumed that the block was always non-nil, which
causes panics in our scheduler tests.
* Attempt at a varied end-result when sorting and searching

* Consider sort direction as well

* computed property dep update

* prioritizeSearchOrder and test

* Side-effecty but resets sort on search etc

* changelog
The alloc exec and filesystem/logs commands allow passing the `-job` flag to
select a random allocation. If the namespace for the command is set to `*`, the
RPC handler doesn't handle this correctly as it's expecting to query for a
specific job. Most commands handle this ambiguity by first verifying that only a
single object of the type in question exists (ex. a single node or job).

Update these commands so that when the `-job` flag is set we first verify
there's a single job that matches. This also allows us to extend the
functionality to allow for the `-job` flag to support prefix matching.

Fixes: #12097
* Bones of a component that has job variable awareness

* Got vars listed woo

* Variables as its own subnav and some pathLinkedVariable perf fixes

* Automatic Access to Variables alerter

* Helper and component to conditionally render the right link

* A bit of cleanup post-template stuff

* testfix for looping right-arrow keynav bc we have a new subnav section

* A very roundabout way of ensuring that, if a job exists when saving a variable with a pathLinkedEntity of that job, its saved right through to the job itself

* hacky but an async version of pathLinkedVariable

* model-driven and async fetcher driven with cleanup

* Only run the update-job func if jobname is detected in var path

* Test cases begun

* Management token for variables to appear in tests

* Its a management token so it gets to see the clients tab under system jobs

* Pre-review cleanup

* More tests

* Number of requests test and small fix to groups-by-way-or-resource-arrays elsewhere

* Variable intro text tests

* Variable name re-use

* Simplifying our wording a bit

* parse json vs plainId

* Addressed PR feedback, including de-waterfalling
Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](jonschlinkert/word-wrap@1.2.3...1.2.4)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/kschoon/multi-alias-default/largely-well-quagga branch from d3485fe to 8b4adff Compare August 1, 2023 12:36
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit cdd9d05 into release/1.4.x Aug 1, 2023
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/kschoon/multi-alias-default/largely-well-quagga branch August 1, 2023 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet