Backport of Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true into release/1.4.x #16660

hc-github-team-nomad-core · 2023-03-27T15:25:58Z

Backport

This PR is auto-generated from #16583 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

WARNING automatic cherry-pick of commits failed. Commits will require human attention.

The below text is copied from the body of the original PR.

This PR addresses the bug reported on #11052

When a leader change happens, the periodic dispatcher on the new leader starts by re running all periodic jobs by force, without checking if there is an instance of the said job already.
A new check is introduced that skips the job if prohibit_overlap is set and there is already a instance running.

…` to `_` (#15940) Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

Bumps [github.com/docker/cli](https://github.com/docker/cli) from 20.10.22+incompatible to 20.10.23+incompatible. - [Release notes](https://github.com/docker/cli/releases) - [Commits](docker/cli@v20.10.22...v20.10.23) --- updated-dependencies: - dependency-name: github.com/docker/cli dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…15847) Bumps [github.com/hashicorp/vault/api](https://github.com/hashicorp/vault) from 1.8.2 to 1.8.3. - [Release notes](https://github.com/hashicorp/vault/releases) - [Changelog](https://github.com/hashicorp/vault/blob/main/CHANGELOG.md) - [Commits](hashicorp/vault@v1.8.2...v1.8.3) --- updated-dependencies: - dependency-name: github.com/hashicorp/vault/api dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

….1 (#15846) Bumps [github.com/brianvoe/gofakeit/v6](https://github.com/brianvoe/gofakeit) from 6.19.0 to 6.20.1. - [Release notes](https://github.com/brianvoe/gofakeit/releases) - [Commits](brianvoe/gofakeit@v6.19.0...v6.20.1) --- updated-dependencies: - dependency-name: github.com/brianvoe/gofakeit/v6 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

… to 20.10.23+incompatible (#15848) * build(deps): bump github.com/docker/docker Bumps [github.com/docker/docker](https://github.com/docker/docker) from 20.10.21+incompatible to 20.10.23+incompatible. - [Release notes](https://github.com/docker/docker/releases) - [Commits](moby/moby@v20.10.21...v20.10.23) --- updated-dependencies: - dependency-name: github.com/docker/docker dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * changelog: add entry for docker/docker --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>

* Change api Fields for expose and paths * Add changelog entry * changelog: add deprecation notes about connect fields * api: minor style tweaks --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>

* Ensure infra_image gets proper label used for reconciliation Currently infra containers are not cleaned up as part of the dangling container cleanup routine. The reason is that Nomad checks if a container is a Nomad owned container by verifying the existence of the: `com.hashicorp.nomad.alloc_id` label. Ensure we set this label on the infra container as well. * fix unit test * changelog: add entry --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>

While you can use any string value for a variable Item's key name using characters that are outside of the set [unicode.Letter, unicode.Number,`_`] will require the `index` function for direct access.

… multiple tags (#15962) * docker: set force=true on remove image to handle images referenced by multiple tags This PR changes our call of docker client RemoveImage() to RemoveImageExtended with the Force=true option set. This fixes a bug where an image referenced by more than one tag could never be garbage collected by Nomad. The Force option only applies to stopped containers; it does not affect running workloads. * docker: add note about image_delay and multiple tags

Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.

…a documentation (#15963)

* refact: add conditional error handling * test: test conditional logic

This pile was deprecated when we starting using HCP Consul for e2e instead of standing up our own cluster and managing Consuls at test runtime.

The ACL token decoding was not correctly handling time duration syntax such as "1h" which forced people to use the nanosecond representation via the HTTP API. The change adds an unmarshal function which allows this syntax to be used, along with other styles correctly.

* consul: reset consul token on job during registration of a reversion * e2e: add test for reverting a job with a consul service * cl: fixup cl entry

…16018)

Also allows for default value of `datacenters = ["*"]`

Matches the "normal" HTTP error detection logic in the same file.

* fix: fix broken test * fix: fix broken test for quota status

* Copyable server and client attribute values * Changelog

* Generate files for 1.5.2 release * Prepare for next release * add 1.4.7 and 1.3.12 to the changelog --------- Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>

#16612) This changeset refactors the tests of the draining node watcher so that we don't mock the node watcher's `Remove` and `Update` methods for its own tests. Instead we'll mock the node watcher's dependencies (the job watcher and deadline notifier) and now unit tests can cover the real code. This allows us to remove a bunch of TODOs in `watch_nodes.go` around testing and clarify some important behaviors: * Nodes that are down or disconnected will still be watched until the scheduler decides what to do with their allocations. This will drive the job watcher but not the node watcher, and that lets the node watcher gracefully handle cases where a heartbeat fails but the node heartbeats again before its allocs can be evicted. * Stop watching nodes that have been deleted. The blocking query for nodes set the maximum index to the highest index of a node it found, rather than the index of the nodes table. This misses updates to the index from deleting nodes. This was done as an performance optimization to avoid excessive unblocking, but because the query is over all nodes anyways there's no optimization to be had here. Remove the optimization so we can detect deleted nodes without having to wait for an update to an unrelated node.

Implement the new `nomad job restart` command that allows operators to restart allocations tasks or reschedule then entire allocation. Restarts can be batched to target multiple allocations in parallel. Between each batch the command can stop and hold for a predefined time or until the user confirms that the process should proceed. This implements the "Stateless Restarts" alternative from the original RFC (https://gist.github.com/schmichael/e0b8b2ec1eb146301175fd87ddd46180). The original concept is still worth implementing, as it allows this functionality to be exposed over an API that can be consumed by the Nomad UI and other clients. But the implementation turned out to be more complex than we initially expected so we thought it would be better to release a stateless CLI-based implementation first to gather feedback and validate the restart behaviour. Co-authored-by: Shishir Mahajan <smahajan@roblox.com>

When a disconnect client reconnects the `allocReconciler` must find the allocations that were created to replace the original disconnected allocations. This process was being done in only a subset of non-terminal untainted allocations, meaning that, if the replacement allocations were not in this state the reconciler didn't stop them, leaving the job in an inconsistent state. This inconsistency is only solved in a future job evaluation, but at that point the allocation is considered reconnected and so the specific reconnection logic was not applied, leading to unexpected outcomes. This commit fixes the problem by running reconnecting allocation reconciliation logic earlier into the process, leaving the rest of the reconciler oblivious of reconnecting allocations. It also uses the full set of allocations to search for replacements, stopping them even if they are not in the `untainted` set. The system `SystemScheduler` is not affected by this bug because disconnected clients don't trigger replacements: every eligible client is already running an allocation.

…hibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

…us-mammoth

jrasell and others added 30 commits January 30, 2023 11:00

docs: add ACL concepts page to introduce objects. (#15895)

06e0393

cli: separate auth method config output for easier reading. (#15892)

166aee7

Fix documentation for meta block: string replacement in key from `-…

031765b

…` to `_` (#15940) Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>

volume: Add the missing option propagation_mode (#15626)

69b08bb

renamed stanza to block for consistency with other projects (#15941)

949a6f6

Rename fields on proxyConfig (#15541)

340ad2d

* Change api Fields for expose and paths * Add changelog entry * changelog: add deprecation notes about connect fields * api: minor style tweaks --------- Co-authored-by: Seth Hoenig <shoenig@duck.com>

docs: Add info about variable item key name restrictions (#15966)

ef3a42c

While you can use any string value for a variable Item's key name using characters that are outside of the set [unicode.Letter, unicode.Number,`_`] will require the `index` function for direct access.

Corrected a typo (#15942)

a5f568a

Fix broken link, typo, style edits. (#15968)

ca597f7

Fix typo in documentation (#15970)

ecf5a51

tests: bump consul and vault versions in test-core (#15979)

4a7a721

Increases max variable size to 64KiB from 16KiB (#15983)

c2491e9

docs: removed extra 'end' in one of the code blocks in template stanz…

813fd6e

…a documentation (#15963)

15154/alloc redirect (#15969)

b8bd6bb

* refact: add conditional error handling * test: test conditional logic

docs: add upgrade notice for batch GC changes (#15985)

e23e366

e2e: remove unused consulacls directory (#15995)

d375f60

This pile was deprecated when we starting using HCP Consul for e2e instead of standing up our own cluster and managing Consuls at test runtime.

acl: return 400 not 404 code when creating an invalid policy. (#16000)

0052596

consul: restore consul token when reverting a job (#15996)

fcc6cfa

* consul: reset consul token on job during registration of a reversion * e2e: add test for reverting a job with a consul service * cl: fixup cl entry

fix(#13844): canonicalize job to avoid nil pointer deference (#13845)

67f8f22

job parsing: fix panic when variable validation is missing condition (#…

00d5749

…16018)

changelog: fix entries for #15522 and #15819 (#15998)

41065ef

Allow wildcard datacenters to be specified in job file (#11170)

46f3977

Also allows for default value of `datacenters = ["*"]`

schmichael and others added 15 commits March 21, 2023 14:38

taskapi: use HasSuffix to detect errors from rpcs (#16594)

4d31fd3

Matches the "normal" HTTP error detection logic in the same file.

docs: detail support for Nomad checks in service block. (#16598)

39ec124

Fix broken test for quotas CLI (#16610)

cb9ce8b

* fix: fix broken test * fix: fix broken test for quota status

[ui] Copyable server and client attribute values (#16548)

2a22d71

* Copyable server and client attribute values * Changelog

Post 1.5.2 release (#16614)

1a53d9c

* Generate files for 1.5.2 release * Prepare for next release * add 1.4.7 and 1.3.12 to the changelog --------- Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>

ci: send notification when prepare is complete (#16627)

1061ddd

docs: added section of needed ACL rules for Nomad UI (#16494)

b84c455

style: rename ForceRun to ForceEval, for clarity (#16617)

6626965

Multiple instances of a periodic job are run simultaneously, when pro…

51249fc

…hibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

backport of commit 51249fc

45442a5

Merge 51249fc into backport/b-gh-11052/actually-enormous-mammoth

71a09f5

backport of commit e9850f3

f2a900f

hc-github-team-nomad-core requested a review from a team March 27, 2023 15:25

hc-github-team-nomad-core requested a review from a team as a code owner March 27, 2023 15:25

hc-github-team-nomad-core removed the request for review from a team March 27, 2023 15:25

hc-github-team-nomad-core force-pushed the backport/b-gh-11052/actually-enormous-mammoth branch from 29f5258 to f2a900f Compare March 27, 2023 15:25

hc-github-team-nomad-core requested review from sarahethompson, claire-labry and Juanadelacuesta March 27, 2023 15:26

hc-github-team-nomad-core force-pushed the backport/b-gh-11052/actually-enormous-mammoth branch 2 times, most recently from 6ddd97a to f2a900f Compare March 27, 2023 15:26

Merge branch 'release/1.4.x' into backport/b-gh-11052/actually-enormo…

50e83d7

…us-mammoth

vercel bot deployed to Preview – nomad-storybook-and-ui March 27, 2023 16:30 View deployment

Update leader.go

729c523

vercel bot deployed to Preview – nomad-storybook-and-ui March 27, 2023 16:36 View deployment

Juanadelacuesta closed this Mar 28, 2023

Juanadelacuesta deleted the backport/b-gh-11052/actually-enormous-mammoth branch May 10, 2023 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true into release/1.4.x #16660

Backport of Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true into release/1.4.x #16660

hc-github-team-nomad-core commented Mar 27, 2023

Backport of Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true into release/1.4.x #16660

Backport of Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true into release/1.4.x #16660

Conversation

hc-github-team-nomad-core commented Mar 27, 2023

Backport