Backport of fix 'default' alias not added to interface specified by `network_interface` into release/1.4.x #18114

hc-github-team-nomad-core · 2023-08-01T12:35:59Z

Backport

This PR is auto-generated from #18096 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

The below text is copied from the body of the original PR.

consider the following config.hcl

# config.hcl
client {
  network_interface = "tailscale0"

  host_network "tailscale" {
    interface = "tailscale0"
  }

  host_network "public" {
    interface = "wlan0"
  }
}

whenever you would schedule the following job

# job.hcl
job "docs" {
  datacenters = ["dc1"]

  group "example" {
    network {
      port "http" {}
    }
    task "server" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo"
        ports = ["http"]
        args = [
          "-listen",
          ":5678",
          "-text",
          "hello world",
        ]
      }
    }
  }
}

you would see * Constraint "missing host network \"default\" for port \"http\"": 1 nodes excluded by filter because the 'default' host_network alias is being replaces with 'tailscale'. The follow PR appends the 'default' alias regardless to the addresses specified by network_interface.

Implements the HTTP API associated with the `NodePool.ListJobs` RPC, including the `api` package for the public API and documentation. Update the `NodePool.ListJobs` RPC to fix the missing handling of the special "all" pool.

Implement scheduler support for node pool: * When a scheduler is invoked, we get a set of the ready nodes in the DCs that are allowed for that job. Extend the filter to include the node pool. * Ensure that changes to a job's node pool are picked up as destructive allocation updates. * Add `NodesInPool` as a metric to all reporting done by the scheduler. * Add the node-in-pool the filter to the `Node.Register` RPC so that we don't generate spurious evals for nodes in the wrong pool.

Go released a security update to fix build-time code injection and execution via CGO. This doesn't impact already-released versions of Nomad, just the build toolchain, so we won't be releasing a Nomad security update to go with it.

During shutdown of a client with drain_on_shutdown there is a race between the Client ending the cgroup and the task's cpuset manager cleaning up the cgroup. During the path traversal, skip anything we cannot read, which avoids the nil DirEntry we try to dereference now.

* Exam to parallelize tests * Logging to try to solve test flakiness * Logging in another failure * Hardening for one test and snapshot for another * Explicitly set the first one as the servicedAlloc instead of randomly picking * A wild CircleCI test failure appears * de-log

…#17455) This PR fixes a bug where the docker network pause container would not be stopped and removed in the case where a node is restarted, the alloc is moved to another node, the node comes back up. See the issue below for full repro conditions. Basically in the DestroyNetwork PostRun hook we would depend on the NetworkIsolationSpec field not being nil - which is only the case if the Client stays alive all the way from network creation to network teardown. If the node is rebooted we lose that state and previously would not be able to find the pause container to remove. Now, we manually find the pause container by scanning them and looking for the associated allocID. Fixes #17299

#17470)

…ning (#17465) * Fix: dont show a service as healthy when its parent alloc is not running * Test for Health Unknown

…7487)

If the authoritative region has been upgraded to a version of Nomad that has new replicated objects (such as ACL Auth Methods, ACL Binding Rules, etc.), the non-authoritative regions will start replicating those objects as soon as their leader is upgraded. If a server in the non-authoritative region is upgraded and then becomes the leader before all the other servers in the region have been upgraded, then it will attempt to write a Raft log entry that the followers don't understand. The followers will then panic. Add same the minimum version checks that we do for RPC writes to the leader's replication loop.

Whenever we write a Raft log entry for node pools, we need to first make sure that all servers can safely apply the log without panicking. Gate upsert and delete RPCs on all servers being upgraded to the minimum version.

…#17486)

Upserts and deletes of node pools are forwarded to the authoritative region, just like we do for namespaces, quotas, ACL policies, etc. Replicate node pools from the authoritative region.

all of our workflows are in GitHub Actions now 🎉

… be removed (#17481) Add preemption_config to the set of keys which should be pruned from the server config as described in #17480.

We don't want to delete node pools that have nodes or non-terminal jobs. Add a check in the `DeleteNodePools` RPC to check locally and in federated regions, similar to how we check that it's safe to delete namespaces.

When registering a node with a new node pool in a non-authoritative region we can't create the node pool because this new pool will not be replicated to other regions. This commit modifies the node registration logic to only allow automatic node pool creation in the authoritative region. In non-authoritative regions, the client is registered, but the node pool is not created. The client is kept in the `initialing` status until its node pool is created in the authoritative region and replicated to the client's region.

Implement a `nomad node pool init` command that generates an example spec file in either HCL or JSON format.

Provide a no-op implementation of the drivers.DriverNetoworkManager interface to be used by systems that don't support network isolation and prevent panics where a network manager is expected.

* Tooltip on individual allocs in the panel * Isolate allocation cells to their own component * Tipsy trigger * Aria label for failed-or-lost tooltips * Buildfix * Try adding percy exec back to exam run

This changeset includes some fixes to documentation discovered while working on node pools, but we didn't want to include in the node pool PRs so they can get backported easily: * namespace apply/delete commands are forwarded to the authoritative region * deleting a namespace requires there are no non-terminal jobs in any of the federated regions * fixed a typo in the name of the `nomad.client.allocated.disk` metric

The `var init` command was intended to have support for a `-quiet` flag but it was not documented and never parsed.

This changeset adds the node pool as a label anywhere we're already emitting labels with additional information such as node class or ID about the client.

…8061)

…8062)

#17968)

Makefile changes required for supporting s390x builds and a corresponding changelog entry.

Add JWKS endpoint to HTTP API for exposing the root public signing keys used for signing workload identity JWTs. Part 1 of N components as part of making workload identities consumable by third party services such as Consul and Vault. Identity attenuation (audience) and expiration (+renewal) are necessary to securely use workload identities with 3rd parties, so this merge does not yet document this endpoint. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>

Add missing help entry for the `-consul-namespace` flag in `nomad job run`.

Trusted Supply Chain Component Registry (TSCCR) enforcement starts Monday and an internal report shows our semgrep action is pinned to a version that's not currently permitted. Update all the action versions to whatever's the new hotness to maximum the time-to-live on these until we have automated pinning setup. Also version bumps our chromedriver action, which randomly broke upstream today.

This feature is necessary when user want to explicitly re-render all templates on task restart. E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.

…18100) In #18054 we introduced a new field `render_templates` in the `restart` block. Previously changes to the `restart` block were always non-destructive in the scheduler but we now need to check the new field so that we can update the template runner. The check assumed that the block was always non-nil, which causes panics in our scheduler tests.

* Attempt at a varied end-result when sorting and searching * Consider sort direction as well * computed property dep update * prioritizeSearchOrder and test * Side-effecty but resets sort on search etc * changelog

The alloc exec and filesystem/logs commands allow passing the `-job` flag to select a random allocation. If the namespace for the command is set to `*`, the RPC handler doesn't handle this correctly as it's expecting to query for a specific job. Most commands handle this ambiguity by first verifying that only a single object of the type in question exists (ex. a single node or job). Update these commands so that when the `-job` flag is set we first verify there's a single job that matches. This also allows us to extend the functionality to allow for the `-job` flag to support prefix matching. Fixes: #12097

* Bones of a component that has job variable awareness * Got vars listed woo * Variables as its own subnav and some pathLinkedVariable perf fixes * Automatic Access to Variables alerter * Helper and component to conditionally render the right link * A bit of cleanup post-template stuff * testfix for looping right-arrow keynav bc we have a new subnav section * A very roundabout way of ensuring that, if a job exists when saving a variable with a pathLinkedEntity of that job, its saved right through to the job itself * hacky but an async version of pathLinkedVariable * model-driven and async fetcher driven with cleanup * Only run the update-job func if jobname is detected in var path * Test cases begun * Management token for variables to appear in tests * Its a management token so it gets to see the clients tab under system jobs * Pre-review cleanup * More tests * Number of requests test and small fix to groups-by-way-or-resource-arrays elsewhere * Variable intro text tests * Variable name re-use * Simplifying our wording a bit * parse json vs plainId * Addressed PR feedback, including de-waterfalling

Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4. - [Release notes](https://github.com/jonschlinkert/word-wrap/releases) - [Commits](jonschlinkert/word-wrap@1.2.3...1.2.4) --- updated-dependencies: - dependency-name: word-wrap dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…quagga

lgfa29 and others added 30 commits June 6, 2023 10:43

node pools: list nodes in pool (#17413)

f0f4cbb

node pools: implement HTTP API to list jobs in pool (#17431)

385dbfb

Implements the HTTP API associated with the `NodePool.ListJobs` RPC, including the `api` package for the public API and documentation. Update the `NodePool.ListJobs` RPC to fix the missing handling of the special "all" pool.

node pools: implement CLI for node pool jobs command (#17432)

84e7cf3

node pool: implement nomad node pool nodes CLI (#17444)

354d741

build: update to go1.20.5 (#17451)

ceb3b4c

Go released a security update to fix build-time code injection and execution via CGO. This doesn't impact already-released versions of Nomad, just the build toolchain, so we won't be releasing a Nomad security update to go with it.

docs: add missing login API endpoint documentation (#17467)

c1a9fe9

docs: corrections and additional information for OIDC-related concepts (

be8f04e

#17470)

[ui] Don't show a service as healthy when its parent alloc is not run…

667d002

…ning (#17465) * Fix: dont show a service as healthy when its parent alloc is not running * Test for Health Unknown

build(deps): bump github.com/fatih/color from 1.13.0 to 1.15.0 (#17485)

1f1c0a1

build(deps): bump github.com/shoenig/go-m1cpu from 0.1.5 to 0.1.6 (#1…

b94cb32

…7487)

node pools: prevent panic on upsert during upgrades (#17474)

95b6d7a

Whenever we write a Raft log entry for node pools, we need to first make sure that all servers can safely apply the log without panicking. Gate upsert and delete RPCs on all servers being upgraded to the minimum version.

build(deps): bump github.com/hashicorp/go-plugin from 1.4.9 to 1.4.10 (…

ca86582

…#17486)

node pools: replicate from authoritative region (#17456)

5bb6e57

Upserts and deletes of node pools are forwarded to the authoritative region, just like we do for namespaces, quotas, ACL policies, etc. Replicate node pools from the authoritative region.

ci: remove circleci (#17502)

5733fa7

all of our workflows are in GitHub Actions now 🎉

conf: Add preemption_config to the server extra HCL keys which should…

887d306

… be removed (#17481) Add preemption_config to the set of keys which should be pruned from the server config as described in #17480.

node pools: protect against deleting occupied pools (#17457)

2c77bf7

We don't want to delete node pools that have nodes or non-terminal jobs. Add a check in the `DeleteNodePools` RPC to check locally and in federated regions, similar to how we check that it's safe to delete namespaces.

node pools: implement node pool init command (#17479)

0aeeaf1

Implement a `nomad node pool init` command that generates an example spec file in either HCL or JSON format.

build: add agent bindata file to copywrite ignore list. (#17507)

b30f76e

client: fix panic on alloc stop in non-Linux environments (#17515)

30921a1

Provide a no-op implementation of the drivers.DriverNetoworkManager interface to be used by systems that don't support network isolation and prevent panics where a network manager is expected.

[ui] Job status panel: tooltips on individual allocs (#17514)

ee8cf15

* Tooltip on individual allocs in the panel * Isolate allocation cells to their own component * Tipsy trigger * Aria label for failed-or-lost tooltips * Buildfix * Try adding percy exec back to exam run

cli: fix missing -quiet flag for var init (#17526)

0ac85db

The `var init` command was intended to have support for a `-quiet` flag but it was not documented and never parsed.

node pools: add pool as label on client metrics (#17528)

068d0ea

This changeset adds the node pool as a label anywhere we're already emitting labels with additional information such as node class or ID about the client.

docs: clarify node pool apply/delete behavior (#17529)

eee2315

philrenaud and others added 24 commits July 24, 2023 14:25

Default-sort variable keyvalues at serialization (#18051)

937d927

changelog: add entry for #18044 (#18056)

7f30444

chore(nodepool): Go stdlib vars for HTTP methods and status codes (#1…

a8fd803

…8061)

chore(variable): Go stdlib vars for HTTP methods and status codes (#1…

5c9cd35

…8062)

chore(lint): use Go stdlib variables for HTTP methods and status codes (

2c463bb

#17968)

changelog entry for nomad-enterprise#1201 (#18071)

0a5667c

build: support s390x architecture for linux (ent) (#18069)

ee0b104

Makefile changes required for supporting s390x builds and a corresponding changelog entry.

docs: add allocation checks API documentation. (#18078)

0a32d7f

cli: add help message for -consul-namespace (#18081)

ee31916

Add missing help entry for the `-consul-namespace` flag in `nomad job run`.

feature: Add new field render_templates on restart block (#18054)

9e98d69

This feature is necessary when user want to explicitly re-render all templates on task restart. E.g. to fetch all new secrets from Vault, even if the lease on the existing secrets has not been expired.

docs: added accessor info to Tuples in template.mdx (#18101)

76ebb3f

[ui] Search results are overloading filter with sorted results (#18053)

66649d1

* Attempt at a varied end-result when sorting and searching * Consider sort direction as well * computed property dep update * prioritizeSearchOrder and test * Side-effecty but resets sort on search etc * changelog

fix 'default' alias not added to NetworkInterface

ee917be

add cl

4202288

fix cl

d956837

backport of commit d956837

7555fee

Merge d956837 into backport/kschoon/multi-alias-default/largely-well-…

c5ced5b

…quagga

backport of commit 0b9a86f

8b4adff

hc-github-team-nomad-core requested review from a team as code owners August 1, 2023 12:35

hc-github-team-nomad-core force-pushed the backport/kschoon/multi-alias-default/largely-well-quagga branch from d3485fe to 8b4adff Compare August 1, 2023 12:36

hc-github-team-nomad-core merged commit cdd9d05 into release/1.4.x Aug 1, 2023

hc-github-team-nomad-core deleted the backport/kschoon/multi-alias-default/largely-well-quagga branch August 1, 2023 12:36

vercel bot deployed to Preview – nomad-storybook-and-ui August 1, 2023 12:44 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of fix 'default' alias not added to interface specified by `network_interface` into release/1.4.x #18114

Backport of fix 'default' alias not added to interface specified by `network_interface` into release/1.4.x #18114

hc-github-team-nomad-core commented Aug 1, 2023

Backport of fix 'default' alias not added to interface specified by network_interface into release/1.4.x #18114

Backport of fix 'default' alias not added to interface specified by network_interface into release/1.4.x #18114

Conversation

hc-github-team-nomad-core commented Aug 1, 2023

Backport

Backport of fix 'default' alias not added to interface specified by `network_interface` into release/1.4.x #18114

Backport of fix 'default' alias not added to interface specified by `network_interface` into release/1.4.x #18114