template: trigger change_mode for dynamic secrets on restore #9636

tgross · 2020-12-14T22:04:24Z

When a task is restored after a client restart, the template runner will
create a new lease for any dynamic secret (ex. Consul or PKI secrets
engines). But because this lease is being created in the prestart hook, we
don't trigger the change_mode.

This changeset uses the the existence of the task handle to detect a
previously running task that's been restored, so that we can trigger the
template change_mode if the template is changed, as it will be only with
dynamic secrets.

Note to reviewers: because this requires a bit of a refactor of the onTemplateRendered so that it can be used for both first-pass and re-render, this PR is probably best reviewed commit-by-commit. #9491 (comment) and the diagram in the comment that follow it can walk you thru the problem.

notnoop

LGTM. I have one question mostly so I understand the low level mechanics of the consul-template library

notnoop · 2020-12-15T14:22:01Z

client/allocrunner/taskrunner/interfaces/lifecycle.go

+	// driver, which is useful for distinguishing restored tasks during
+	// prestart hooks. Note: prestart should be idempotent whenever possible
+	// to handle restored tasks safely; use this as an escape hatch.
+	HasHandle() bool


Is this true only if the task is running? If so, would IsRunning() be a more appropriate name? I also suspect we need to watch out for check-then-act races (e.g. process killed immediately after check) - I suspect in this context, it's fine. Might be worth noting the issue in the comment.

Thanks for pointing that out... it's probably worth me double-checking that if the driver handle goes away we can't NPE in the template runner. At that point we're in a non-working state for this task to be sure but it shouldn't blow up the client process.

I went back-and-forth on the name because we only know that there's a driver handle, not that the process is running and I didn't want future callers to be confused by that. But given the TOCTOU concern you've pointed out, I don't think the caller can assume all that much here except a best effort so I'll change the name and add some commentary about that.

Name changed in 8c66608

notnoop · 2020-12-15T14:27:48Z

client/allocrunner/taskrunner/template/template.go


-			events := tm.runner.RenderEvents()
-			for id, event := range events {
+func (tm *TaskTemplateManager) onTemplateRendered(handledRenders map[string]time.Time, allRenderedTime time.Time) {


👍 on extracting the function. This commit is best viewed with ?w=1: 815f01c?w=1 .

notnoop · 2020-12-15T14:48:06Z

client/allocrunner/taskrunner/template/template_test.go

+	select {
+	case <-harness.mockHooks.RestartCh:
+		t.Fatalf("should not have restarted: %+v", harness.mockHooks)
+	case <-harness.mockHooks.SignalCh:
+		t.Fatalf("should not have restarted: %+v", harness.mockHooks)
+	case <-time.After(time.Duration(1*testutil.TestMultiplier()) * time.Second):
+	}


👍 to testing that process doesn't get restarted if the template doesn't change. However, I'm unclear how consul-template state management handles that? I assumed consul-template library kept template rendition in memory to compare with re-renditions. On nomad agent restart, how does CT library refresh the latest state - does it read the template file and refresh its state?

On nomad agent restart, how does CT library refresh the latest state - does it read the template file and refresh its state?

Yeah, there are effectively 3 chunks of state that CT has:

The source which gets reloaded on every client restart.

The Consul/Vault contents, which gets refreshed on every client restart. That's actually the source of our problem here because for dynamic secrets that's a write which creates a new lease and not a read. Every time this triggers, the RenderEvent.WouldRender is true. And that includes a client restart.

The destination: the CT library then compares the results with the current contents of the destination and the RenderEvent.DidRender is set to false if it didn't actually change on-disk.

So what we're doing is differentiating between the RenderEvent.WouldRender and RenderEvent.DidRender cases to avoid doing extra restarts we don't need.

Also: this is making me realize we should clarify the behavior around changing the source out of band; if you do that while Nomad is running, CT doesn't see it. But then if you restart it will see the change.

I'll have a CHANGELOG and upgrade guide docs PR to go with this (holding off until the giant docs replatform change lands). I'll update the docs on source in that PR.

client/allocrunner/taskrunner/template/template_test.go

tgross · 2020-12-15T20:12:07Z

Move this back into draft as per #9491 (comment)

After a long soak test it looks like the logmon error is a little misleading and the problem really is in the task restart, so I've got to rework the solution I've got in #9636. I haven't quite pinned down the problem yet.

tgross · 2020-12-16T16:33:04Z

Added changelog and upgrade docs.

See #9491 (comment) for why this is now ready-to-go.
Broken UI test most likely blocked by #9649

Pull this function out so it can be used by the first-pass renderer when we restore a task from a restarted client.

In the task hook implementation, `HasHandle` returns true if the task runner has a handle to the task driver, so that we can distinguish restored tasks during prestart hooks without passing around additional state.

When a task is restored after a client restart, the template runner will create a new lease for any dynamic secret (ex. Consul or PKI secrets engines). But because this lease is being created in the prestart hook, we don't trigger the `change_mode`. This changeset uses the the existence of the task handle to detect a previously running task that's been restored, so that we can trigger the template `change_mode` if the template is changed, as it will be only with dynamic secrets.

krishicks

This code is a bit beyond my ability to review in-depth; there's too much that I'm unfamiliar with.

client/allocrunner/taskrunner/interfaces/lifecycle.go

client/allocrunner/taskrunner/template/template.go

When a task is restored after a client restart, the template runner will create a new lease for any dynamic secret (ex. Consul or PKI secrets engines). But because this lease is being created in the prestart hook, we don't trigger the `change_mode`. This changeset uses the the existence of the task handle to detect a previously running task that's been restored, so that we can trigger the template `change_mode` if the template is changed, as it will be only with dynamic secrets.

github-actions · 2022-12-05T02:18:04Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross marked this pull request as ready for review December 14, 2020 22:10

tgross requested review from notnoop and krishicks December 14, 2020 22:10

notnoop approved these changes Dec 15, 2020

View reviewed changes

vercel bot temporarily deployed to Preview – nomad December 15, 2020 15:44 Inactive

vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 15, 2020 15:44 Inactive

tgross force-pushed the b-template-first-render branch from 2da0b1f to 8c66608 Compare December 15, 2020 15:48

vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 15, 2020 15:48 Inactive

vercel bot temporarily deployed to Preview – nomad December 15, 2020 15:48 Inactive

tgross mentioned this pull request Dec 15, 2020

Nomad doesn’t retain watcher state on restarts which causes secret engine lease issues #9491

Closed

tgross marked this pull request as draft December 15, 2020 20:12

tgross marked this pull request as ready for review December 16, 2020 16:06

tgross force-pushed the b-template-first-render branch from 8c66608 to b40ae67 Compare December 16, 2020 16:33

vercel bot deployed to Preview – nomad December 16, 2020 16:33 View deployment

vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 16, 2020 16:33 Inactive

tgross requested a review from notnoop December 16, 2020 16:33

tgross added 7 commits December 16, 2020 12:02

template: factor onTemplateRendered out of template re-renderer

43bd410

Pull this function out so it can be used by the first-pass renderer when we restore a task from a restarted client.

template: add HasHandle method to task hook lifecycle interface

0eb32ea

In the task hook implementation, `HasHandle` returns true if the task runner has a handle to the task driver, so that we can distinguish restored tasks during prestart hooks without passing around additional state.

unit test exercising template change_mode for restore

1c34a3d

address code review: use require in template test

77e660d

address code review: change interface method name to IsRunning

745a2b4

changelog and upgrade docs

f357837

tgross force-pushed the b-template-first-render branch from b40ae67 to f357837 Compare December 16, 2020 17:02

vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 16, 2020 17:02 Inactive

vercel bot deployed to Preview – nomad December 16, 2020 17:02 View deployment

tgross added this to the 1.0.2 milestone Dec 16, 2020

krishicks reviewed Dec 16, 2020

View reviewed changes

client/allocrunner/taskrunner/interfaces/lifecycle.go Outdated Show resolved Hide resolved

client/allocrunner/taskrunner/template/template.go Outdated Show resolved Hide resolved

code review: comment fixes

62bf55f

vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 16, 2020 18:13 Inactive

vercel bot temporarily deployed to Preview – nomad December 16, 2020 18:13 Inactive

tgross merged commit 004f1c9 into master Dec 16, 2020

tgross deleted the b-template-first-render branch December 16, 2020 18:36

tgross mentioned this pull request Dec 16, 2020

Nomad agent does not notify the task, about the change of dynamic secrets after restart #4226

Closed

idrennanvmware mentioned this pull request Mar 9, 2021

Nomad Agent Restart causes jobs to restart when job has template stanza referencing Vault #10152

Closed

tgross mentioned this pull request Jun 10, 2022

Improvements to Template + Vault during Nomad Client restarts #13313

Open

github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

template: trigger change_mode for dynamic secrets on restore #9636

template: trigger change_mode for dynamic secrets on restore #9636

tgross commented Dec 14, 2020 •

edited

Loading

notnoop left a comment

notnoop Dec 15, 2020

tgross Dec 15, 2020

tgross Dec 15, 2020 •

edited

Loading

notnoop Dec 15, 2020

notnoop Dec 15, 2020

tgross Dec 15, 2020

tgross Dec 15, 2020

tgross commented Dec 15, 2020

tgross commented Dec 16, 2020 •

edited

Loading

krishicks left a comment

github-actions bot commented Dec 5, 2022

template: trigger change_mode for dynamic secrets on restore #9636

template: trigger change_mode for dynamic secrets on restore #9636

Conversation

tgross commented Dec 14, 2020 • edited Loading

notnoop left a comment

Choose a reason for hiding this comment

notnoop Dec 15, 2020

Choose a reason for hiding this comment

tgross Dec 15, 2020

Choose a reason for hiding this comment

tgross Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

notnoop Dec 15, 2020

Choose a reason for hiding this comment

notnoop Dec 15, 2020

Choose a reason for hiding this comment

tgross Dec 15, 2020

Choose a reason for hiding this comment

tgross Dec 15, 2020

Choose a reason for hiding this comment

tgross commented Dec 15, 2020

tgross commented Dec 16, 2020 • edited Loading

krishicks left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 5, 2022

tgross commented Dec 14, 2020 •

edited

Loading

tgross Dec 15, 2020 •

edited

Loading

tgross commented Dec 16, 2020 •

edited

Loading