Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad doesn’t retain watcher state on restarts which causes secret engine lease issues #9491

Closed
gulavanir opened this issue Dec 2, 2020 · 24 comments · Fixed by #9636
Closed
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul-template theme/vault type/bug
Milestone

Comments

@gulavanir
Copy link

Nomad version

v0.12.5

Operating system and Environment details

Linux

Issue

Consul token is not auto-renewed by nomad when dynamic secrets are used in nomad job template with consul secrets engine.

Reproduction steps

Followed steps as specified in https://learn.hashicorp.com/tutorials/nomad/vault-postgres.

  1. Enable consul secrets engine
  2. Add role and policy
  3. Configure nomad job template stanza to retrieve consul token

The default ttl for this token in 30 days after which the lease expires and nomad doesn't auto renew the token.

Job file (if appropriate)

template {
data = <<EOH
{{ with secret “consul/creds/service-fabio” }}
registry.consul.token = {{ .Data.token }}
{{ end }}
registry.consul.kvpath = /path-to-config
registry.consul.noroutehtmlpath = /fabio/noroute.html
proxy.addr =
ui.addr =
registry.consul.register.addr =
metrics.target =
metrics.statsd.addr =
metrics.prefix =
metrics.names =
EOH
destination = “secrets/fabio.properties”
}

Nomad logs

agent.http: Request error: method=GET url=/v1/kv/fabio/noroute.html?consistent=&index=1143000&recurse= from=127.0.0.1:60596 error="rpc error making call

@idrennanvmware
Copy link
Contributor

@gulavanir - I was able to get this to work by setting an aggressive renewal time.

Basically I configured consul to NOT use the defaults of 32 days by running

vault secrets tune -default-lease-ttl=30s -max-lease-ttl=1m consul/

I was able to see Nomad/Consul-template refreshing the token every few seconds (given the aggressive expiry I set).

With the above settings I could see Nomad refresh the token and then when it reached the max-least-ttl value it deprecated the old token and issued a new one (which then caused Nomad to rerender the template and start the container)

I am wondering if what you see has to do with a watcher failing (since the renew time is so long by default) and it never recovers?

@gulavanir gulavanir changed the title Nomad job doesn't renew consul token generated using nomad job templates and consul secrets engine CRITICAL: Nomad job doesn't renew consul token generated using nomad job templates and consul secrets engine Dec 3, 2020
@idrennanvmware
Copy link
Contributor

idrennanvmware commented Dec 3, 2020

@fredwangwang was able to reproduce this with some pretty catastrophic results. I've asked him to update this thread, but here's his comment on the Hashicorp Discuss forums https://discuss.hashicorp.com/t/nomad-job-doesnt-renew-consul-token-generated-using-nomad-job-templates-and-consul-secrets-engine/18316/9

@tgross - this is a nasty one

@fredwangwang
Copy link
Contributor

fredwangwang commented Dec 3, 2020

tldr: state loss during nomad restart cause token lease leak in template

link to discussion for context: https://discuss.hashicorp.com/t/nomad-job-doesnt-renew-consul-token-generated-using-nomad-job-templates-and-consul-secrets-engine/18316/9

The issue is that during nomad restart, nomad loss its state. The new nomad client will:

  1. start the job again assuming its new
  2. render the new template with new token lease
  3. put the watcher on the new token lease.
  4. start the container (which was already running before nomad restart, but new nomad client does not know it because it does not have that state)

However, at this point the configuration (hence token) held by the process running inside the container is still the old one. Once the old token lease expires, the process start to fail.

Logs (after nomad restart):

2020/12/03 05:30:24.733069 [INFO] (runner) creating watcher             ### nomad creates a new watcher unknown the previous state
2020/12/03 05:30:24.733194 [INFO] (runner) starting
2020/12/03 05:30:24.733247 [DEBUG] (runner) running initial templates
2020/12/03 05:30:24.733311 [DEBUG] (runner) initiating run                 ### nomad renders the new template assuming a new job start, however the container was already running
2020/12/03 05:30:24.733362 [DEBUG] (runner) checking template 8bac18e0a2e861cfa119af4401831db6
2020/12/03 05:30:24.733883 [DEBUG] (runner) missing data for 1 dependencies
2020/12/03 05:30:24.733984 [DEBUG] (runner) missing dependency: vault.read(consul/creds/service-fabio)
2020/12/03 05:30:24.734036 [DEBUG] (runner) add used dependency vault.read(consul/creds/service-fabio) to missing since isLeader but do not have a watcher
2020/12/03 05:30:24.734103 [DEBUG] (runner) was not watching 1 dependencies
2020/12/03 05:30:24.734174 [DEBUG] (watcher) adding vault.read(consul/creds/service-fabio)
2020/12/03 05:30:24.734257 [TRACE] (watcher) vault.read(consul/creds/service-fabio) starting

@gulavanir gulavanir changed the title CRITICAL: Nomad job doesn't renew consul token generated using nomad job templates and consul secrets engine Catastrophic: Nomad doesn’t retain watcher state on restarts which causes secret engine lease issues Dec 3, 2020
@tgross tgross changed the title Catastrophic: Nomad doesn’t retain watcher state on restarts which causes secret engine lease issues Nomad doesn’t retain watcher state on restarts which causes secret engine lease issues Dec 3, 2020
@tgross
Copy link
Member

tgross commented Dec 3, 2020

I think this is hitting a case a long-open PR #4226 is attempting to fix, but which we didn't have enough context for us to understand what was going on. I'm not sure why that hasn't had further review yet but we should make sure that gets onto our radar.

Also folks, please let's avoid editorializing in issue titles.

@idrennanvmware
Copy link
Contributor

Ahh - thanks @tgross - good point on the title, and noted.

Seems like the long open PR is related - We can reliably replicate it at this point so any details you need our team can provide.

@fredwangwang
Copy link
Contributor

@tgross was thinking something similar to this #4226 (comment) last night. It works, however, it will trigger a coordinated restart of all the jobs running on that particular nodes whose template gets re-rendered every time nomad restarts.
Which is fine for some use cases where only a small number of job has dynamic credentials, it would be a non-insignificant impact to the system if almost all jobs running on that particular node restarts the same time.

So it is a quick patch to this urgent problem, but to me feels like to properly correct this issue nomad should persist state across restart tho.

@tgross
Copy link
Member

tgross commented Dec 8, 2020

Some "thinking out loud" after some initial investigation:

  • It looks like we do persist a token to the allocation's secret directory in the vault hook: vault_hook.go#L129-L143, which gets used during the restoration of the task runner during client restart.
  • The RenewToken should be getting a renewal from that token. I haven't yet identified if that's failing or what.
  • I still need to dig into the interaction between the consul-template runner and the vault hook.
  • If the token can't be renewed, I'm not sure we have much choice but to restart the tasks after a client restart, as disruptive as that would be. So we should try to avoid that situation for sure.

@tgross tgross self-assigned this Dec 8, 2020
@tgross
Copy link
Member

tgross commented Dec 9, 2020

Wanted to leave a status update on the investigation so far. There are a few layers to this problem as presented, that make it a little harder to understand:

  • Nomad client asks the server for its Vault and Consul token.
  • Nomad client passes these tokens to our embedded consul-template runner so it can talk to Consul and Vault.
  • consul-template accesses Vault's Consul secret engine to fetch a Consul ACL for direct use by our workload

When Nomad is persisting a token for the task runner in the vault_hook.go#L129-L143, that's Nomad's token. The secrets for anything we render to a template have no saved state and that's the TTL that's expiring between restarts. This is effectively the same thing that would happen if we ran consul-template in watch mode from the command line, stopped it, and then ran it again. Even if we had a root Vault token here, if that were the problem it wouldn't help.

That also means we should be able to reproduce this with a simpler setup for any dynamic secrets, like a x509 cert. Fortunately we have an E2E test we can borrow for that.

setup and configuration files

Single-node Vault config:

listener "tcp" {
  address     = "0.0.0.0:8200"
  tls_disable = 1
}

storage "raft" {
  path    = "/var/vault/raft"
  node_id = "raft_node_1"
}

api_addr = "http://127.0.0.1:8200"
cluster_addr = "https://127.0.0.1:8201"
ui = true

Bootstrap Vault with Nomad policy:

export VAULT_ADDRESS="http://127.0.0.1:8200"
vault operator init -key-shares=1 -key-threshold=1
vault operator unseal
export VAULT_TOKEN=<redacted>
vault policy write nomad-server ./e2e/terraform/scripts/vault-nomad-server-policy.hcl
vault write /auth/token/roles/nomad-cluster @./e2e/terraform/scripts/vault-nomad-cluster-role.json

vault token create -policy nomad-server -period 72h -orphan
# get the resulting token for our Nomad config

Single-node Nomad config:

name = "standalone"

log_level = "DEBUG"
enable_debug = true
data_dir = "/var/nomad"

server {
  enabled          = true
  bootstrap_expect = 1
}

client {
  enabled = true
}

plugin "docker" {
  config {
    volumes {
      enabled = true
    }
  }
}

vault {
  enabled          = true
  address          = "http://localhost:8200"
  create_from_role = "nomad-cluster"
  token            = "<redacted>"
}

Configure PKI engine

pki-policy.hcl file:

path "pki-test9491/issue/nomad" {
  capabilities = ["create", "update", "read"]
}
vault secrets enable -path=pki-test9491 pki
vault write pki-test9491/root/generate/internal common_name=service.consul ttl=168h
vault write pki-test9491/roles/nomad \
    allowed_domains=service.consul \
	allow_subdomains=true \
	generate_lease=true \
	max_ttl=1m
vault secrets tune -max-lease-ttl=1m pki-test9491
vault policy write pki-test9491 ./pki-policy.hcl
jobspec
job "example" {
  datacenters = ["dc1"]

  group "group" {

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/www"]
        volumes = ["/tmp/www:/www:ro"]
      }

      vault {
        policies = ["pki-test9491"]
      }

      template {
        data = <<EOT
{{ with secret "pki-test9491/issue/nomad" "common_name=nomad.service.consul" "ip_sans=127.0.0.1" }}
{{- .Data.certificate -}}
{{ end }}
EOT

        destination = "${NOMAD_SECRETS_DIR}/certificate.crt"
        change_mode = "noop"
      }

      resources {
        cpu    = 128
        memory = 64
      }
    }

  }
}

Note in our jobspec that the change_mode is "noop". Let's verify the normal secret renewal behavior works as we'd expect:

$ nomad job run ./example.nomad
==> Monitoring evaluation "8ff50f77"
    Evaluation triggered by job "example"
==> Monitoring evaluation "8ff50f77"
    Evaluation within deployment: "0aee964a"
    Allocation "d9b1a5c8" created: node "53d42465", group "group"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "8ff50f77" finished with status "complete"

$ nomad alloc status d9b
...
Recent Events:
Time                  Type        Description
2020-12-09T19:19:58Z  Started     Task started by client
2020-12-09T19:19:57Z  Task Setup  Building Task Directory
2020-12-09T19:19:57Z  Received    Task received by client

# fingerprint the cert
$ nomad alloc exec d9b cat /secrets/certificate.crt  | md5sum
2e5c01b4bf659e298664a88bf3798663  -

# wait until the TTL expires (plus some room), and verify the cert has changed
$ sleep 90 && nomad alloc exec d9b cat /secrets/certificate.crt  | md5sum
7cae1b2510b24b02a0194ca6a231239e  -

Now let's test restarting Nomad and checking that the certificate is renewed when we expect:

test1.sh
#!/bin/bash

alloc_id=$1

now=$(date +%s)
echo "now: $now"
expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
    openssl x509 -noout -enddate |
    cut -d= -f2 |
    xargs -I^ date -d '^' +%s)
echo "expires at: $expiry"

echo "restarting Nomad..."
sudo systemctl restart nomad

now=$(date +%s)
sleep_for=$(( $expiry - $now ))
echo "waiting ${sleep_for}s for cert to expire..."
sleep $sleep_for

new_expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
    openssl x509 -noout -enddate |
    cut -d= -f2 |
    xargs -I^ date -d '^' +%s)
echo "new expiry: $new_expiry"

if [[ "$new_expiry" != "$expiry" ]]; then
    echo "certificate was renewed!"
fi

Results look good:


$ ./test1.sh d9b
now: 1607544427
expires at: 1607544475
restarting Nomad...
waiting 48s for cert to expire...
new expiry: 1607544493
certificate was renewed!

Now we'll test stopping Nomad, waiting for expiry, and then restarting Nomad. In this test we wait to stop Nomad so that we're not risking the allocation being marked as lost and confounding the problem.

test2.sh
#!/bin/bash

alloc_id=$1

start=$(date +%s)
echo "now: $start"
expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
    openssl x509 -noout -enddate |
    cut -d= -f2 |
    xargs -I^ date -d '^' +%s)
echo "expires at: $expiry"

sleep_for=$(( $expiry - $start ))
if [[ "$sleep_for" -gt "30" ]]; then
    echo "waiting 30s to stop nomad..."
    sleep 30
else
    echo "only ${sleep_for}s left till expiry, stopping now..."
fi

echo "stopping Nomad..."
sudo systemctl stop nomad

start=$(date +%s)
sleep_for=$(( $expiry - $start ))

if [[ "$sleep_for" -gt "0" ]]; then
    echo "waiting ${sleep_for}s for cert to expire, +5s..."
    sleep $sleep_for
    sleep 5
fi

echo "restarting nomad..."
sudo systemctl start nomad

# wait for nomad to be live again before querying it
while :
do
    nomad status 2> /dev/null && break
    sleep 1
done

new_expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
    openssl x509 -noout -enddate |
    cut -d= -f2 |
    xargs -I^ date -d '^' +%s)
echo "new expiry: $new_expiry"

if [[ "$new_expiry" != "$expiry" ]]; then
    echo "certificate was renewed!"
else
    echo "certificate was not renewed!"
fi

The results look good there too!

$ ./test2.sh d9b
now: 1607546652
expires at: 1607546709
waiting 30s to stop nomad...
stopping Nomad...
waiting 26s for cert to expire, +5s...
restarting nomad...
ID       Type     Priority  Status   Submit Date
example  service  50        running  2020-12-09T20:41:46Z
new expiry: 1607546776
certificate was renewed

Logs, for posterity.

logs
    2020/12/09 20:45:16.365188 [INFO] (runner) creating watcher
    2020/12/09 20:45:16.365289 [INFO] (runner) starting
    2020/12/09 20:45:16.365296 [DEBUG] (runner) running initial templates
    2020/12/09 20:45:16.365299 [DEBUG] (runner) initiating run
    2020/12/09 20:45:16.365302 [DEBUG] (runner) checking template 9b5c5086546eb25832bf8367ee8a95da
    2020/12/09 20:45:16.365560 [DEBUG] (runner) missing data for 1 dependencies
    2020/12/09 20:45:16.365571 [DEBUG] (runner) missing dependency: vault.write(pki-test9491/issue/nomad -> 6d0de468)
    2020/12/09 20:45:16.365634 [DEBUG] (runner) add used dependency vault.write(pki-test9491/issue/nomad -> 6d0de468) to missing since isLeader but do not have a wat
    2020/12/09 20:45:16.365670 [DEBUG] (runner) was not watching 1 dependencies
    2020/12/09 20:45:16.365676 [DEBUG] (watcher) adding vault.write(pki-test9491/issue/nomad -> 6d0de468)
    2020/12/09 20:45:16.365687 [DEBUG] (runner) diffing and updating dependencies
    2020/12/09 20:45:16.365691 [DEBUG] (runner) watching 1 dependencies
    2020/12/09 20:45:16.473485 [DEBUG] (runner) receiving dependency vault.write(pki-test9491/issue/nomad -> 6d0de468)
    2020/12/09 20:45:16.473554 [DEBUG] (runner) initiating run
    2020/12/09 20:45:16.473562 [DEBUG] (runner) checking template 9b5c5086546eb25832bf8367ee8a95da
    2020/12/09 20:45:16.474122 [DEBUG] (runner) rendering "(dynamic)" => "/var/nomad/alloc/d9b1a5c8-741d-228a-3b8f-c81118041b72/task/secrets/certificate.crt"
    2020/12/09 20:45:16.474241 [INFO] (runner) rendered "(dynamic)" => "/var/nomad/alloc/d9b1a5c8-741d-228a-3b8f-c81118041b72/task/secrets/certificate.crt"
    2020/12/09 20:45:16.474249 [DEBUG] (runner) diffing and updating dependencies
    2020/12/09 20:45:16.474256 [DEBUG] (runner) vault.write(pki-test9491/issue/nomad -> 6d0de468) is still needed
    2020/12/09 20:45:16.474262 [DEBUG] (runner) watching 1 dependencies
    2020/12/09 20:45:16.474267 [DEBUG] (runner) all templates rendered

So we see the secret being renewed as expected. But what about the allocation itself? Let's change out the change_mode to restart:

$ ./test2.sh eb0
now: 1607547812
expires at: 1607547841
only 29s left till expiry, stopping now...
stopping Nomad...
waiting 28s for cert to expire, +5s...
restarting nomad...
ID       Type     Priority  Status   Submit Date
example  service  50        running  2020-12-09T21:02:56Z
new expiry: 1607547910
certificate was renewed!

But the task isn't being restarted:

$ nomad alloc status eb0
...
Recent Events:
Time                  Type        Description
2020-12-09T21:03:02Z  Started     Task started by client
2020-12-09T21:03:01Z  Task Setup  Building Task Directory
2020-12-09T21:02:56Z  Received    Task received by client

If we wait a minute for that new cert to expire, we're back in business:

$ nomad alloc status eb0
...
Recent Events:
Time                  Type              Description
2020-12-09T21:05:17Z  Started           Task started by client
2020-12-09T21:05:14Z  Restarting        Task restarting in 0s
2020-12-09T21:05:14Z  Terminated        Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2020-12-09T21:05:09Z  Restart Signaled  Template with change_mode restart re-rendered
2020-12-09T21:03:02Z  Started           Task started by client
2020-12-09T21:03:01Z  Task Setup        Building Task Directory
2020-12-09T21:02:56Z  Received          Task received by client

So it appears that the problem is not that Nomad is losing access to the secret engine, or that consul-template isn't trying to renew the secret when it's supposed to. It's that when we restart the Nomad client we don't treat a change to the secret as a trigger for a task restart (or signal). We have a new secret, but haven't told the workload that there is one. I feel reasonably certain about this conclusion but don't yet have an answer as to how to fix it. That'll be what I dig into tomorrow morning.

@tgross
Copy link
Member

tgross commented Dec 10, 2020

Some diagrams summarizing the problem... mostly for me and so I can share this with colleagues to figure out the best fix.

Here's how template rendering with dynamic secrets is supposed to work:
GH9491_001

Here's how it's working when the client is restarted. Note that the template runner is getting the new secret, but that we never trigger the change_mode so at some point in time later when the TTL of the original secret expires, our task will be unhappy:

GH9491_002

@idrennanvmware
Copy link
Contributor

idrennanvmware commented Dec 10, 2020

This is fantastic stuff @tgross - thanks for taking the time to write such a thorough analysis. I need to read it a few times to absorb it all, but the first pass makes sense to me and I understand what you're saying and it clearly lines up with the behavior we have seen

@tgross
Copy link
Member

tgross commented Dec 10, 2020

After doing a bit of smoke testing and printf debugging, it looks like the RenderEvent that we get from consul-template after the rendering seems to give us the hooks we'd need here.

// WouldRender determines if the template would have been rendered. A template
// would have been rendered if all the dependencies are satisfied, but may
// not have actually rendered if the file was already present or if an error
// occurred when trying to write the file.
WouldRender bool

// DidRender determines if the Template was actually written to disk. In dry
// mode, this will always be false, since templates are not written to disk
// in dry mode. A template is only rendered to disk if all dependencies are
// satisfied and the template is not already in place with the same contents.
DidRender bool

The first time a template is rendered, these fields will both be true. After a client restart, if the secret hasn't changed DidRender will be false. But because a dynamic secret is always changed (because it's a write and not a read), DidRender will be true after a client restart if the task has a dynamic secret. So that can give the task runner the information we need to know that we need to trigger the change_mode of the task, even if we're in the prestart. That does mean that tasks with dynamic secrets would always be restarted if the client restarts. That's not ideal but it at least prevents a situation where the secret becomes suddenly stale and causes application-level issues.

@idrennanvmware
Copy link
Contributor

idrennanvmware commented Dec 10, 2020

"That does mean that tasks with dynamic secrets would always be restarted if the client restarts."

FWIW that's exactly the hack we are putting in place. We using consul template task in jobs with secrets to watch for a KV entry for a node (that we change every time a nomad agent restarts) then using that to restart the container. Effectively the same thing but it impacts all our jobs as each one needs to implement this hack. I can say for sure that we would rather have nomad restart our containers with secrets than having to handle that ourselves in our jobs :)

@fredwangwang
Copy link
Contributor

Hi @tgross Thanks for this great analysis! Regarding the potential solution in this comment RenderEvent that we get from consul-template after the rendering seems to give us the hooks we'd need here., seems like that's pretty much #4226 (comment) has implemented, is that correct?

@tgross
Copy link
Member

tgross commented Dec 11, 2020

Yes. But that patch is against an older version of Nomad that had a good bit of changes. #6324 from the same author is more recent (although doesn't merge cleanly with the current HEAD). There are some problems in the details of that approach:

  • The on-change handling is copied from handleTemplateRerenders but with some small tweaks, which is likely to cause a maintenance burden as the implementations drift apart.
  • The way the restore state is saved is not correct (because prestart hooks can fail).
  • Most importantly: in that implementation all tasks with templates will get restarted on client restart, which we don't need to do; we only need to restart tasks with dynamic secrets.

So rather than trying to inject state into the hooks that bridges restarts, I'm working up a patch that detects whether there's an existing driver handle present (which covers a wider range of edge cases). Then I'm factoring out the on-change handling so that both the handleFirstRender and handleTemplateRerenders can use it.

@fredwangwang
Copy link
Contributor

in that implementation all tasks with templates will get restarted on client restart, which we don't need to do

💯 agree, thanks for the update!

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Dec 11, 2020
@tgross
Copy link
Member

tgross commented Dec 11, 2020

My working branch for the patch is https://github.com/hashicorp/nomad/compare/b-template-first-render. I've only smoke-tested this locally and haven't written automated testing to explore the edge cases (so definitely don't run this build anywhere you care about! 😀 ). Hoping to wrap that up early next week.

@gulavanir
Copy link
Author

That's great. Thanks @tgross for a detailed write up and quick turnaround.

@tgross
Copy link
Member

tgross commented Dec 14, 2020

That branch works fine in automated unit testing. And when I tried running it through end-to-end the change signal is being triggered as we wanted. This works perfectly with change_mode = "signal".

But with change_mode = "restart" the task restart is causing a problem with other task hooks. So the restart fails and then Nomad will reschedule the allocation. That's not the behavior we want here. I suspect that we didn't really design the prestart hooks to allow for restarts within the prestart process, as opposed to simply exiting and getting unwound by the task runner. The logmon error we're hitting here happens before the template hook, which implies some interesting ordering issues. That'll be the next thing I dig into.

Alloc status for the failed alloc looks like:

Recent Events:
Time                  Type              Description
2020-12-14T19:35:33Z  Alloc Unhealthy   Unhealthy because of failed task
2020-12-14T19:35:33Z  Not Restarting    Error was unrecoverable
2020-12-14T19:35:33Z  Task hook failed  logmon: rpc error: code = Unknown desc = failed to create stdout logfile for "task.stdout": open /var/nomad/alloc/25c75368-4a19-8796-7767-5905dae4e76d/alloc/logs: no such file or directory
2020-12-14T19:35:16Z  Restarting        Task restarting in 16.491523779s
2020-12-14T19:35:16Z  Terminated        Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2020-12-14T19:35:12Z  Started           Task started by client
2020-12-14T19:35:12Z  Task Setup        Building Task Directory
2020-12-14T19:35:12Z  Received          Task received by client
2020-12-14T19:34:07Z  Started           Task started by client
2020-12-14T19:34:06Z  Task Setup        Building Task Directory

Edit: ☹️ I've only hit the above problem the one time and haven't been able to reproduce it. Which may mean it's racy or it may mean that there's a completely unrelated bug in the logmon code.
Edit 2: Doing some soak testing of this to make sure but it looks like the logmon thing is unrelated.

@idrennanvmware
Copy link
Contributor

"Edit: ☹️ I've only hit the above problem the one time and haven't been able to reproduce it. Which may mean it's racy or it may mean that there's a completely unrelated bug in the logmon code.
Edit 2: Doing some soak testing of this to make sure but it looks like the logmon thing is unrelated."

Better now than later! Appreciate the updates. We are tracking this one really closely

@tgross
Copy link
Member

tgross commented Dec 15, 2020

After a long soak test it looks like the logmon error is a little misleading and the problem really is in the task restart, so I've got to rework the solution I've got in #9636. I haven't quite pinned down the problem yet.

(The fun bit of this is that turning up logging to trace changes the timing and therefore the symptoms!)

@tgross
Copy link
Member

tgross commented Dec 16, 2020

Ok, have that conclusively figured out now with a whole lot of log grubbing. It's mostly an artifact of the test rig; the PKI cert expires after 1 minute so my test rig has to line up the restart such that it lands before the cert expires. But that timing requirement can cause the client to be shutdown long enough that when it comes back up the server thinks the allocation is lost and tries to reschedule it. (I'd missed the single log line from the server asking for the alloc to stop among the ~30k entries. 🤦 )

There's a kind of gross but mostly unavoidable scenario here where because I'm testing with a single client, the scheduler will always place it on the same host and when that's happening right on top of a client restart we can be in the middle of trying to start the alloc while the shutdown notice from the server comes, which results in a taskrunner hook state that isn't viable and we're forced to reschedule again. It looks like there's a narrow possibility of running into this any time the client is shut down long enough to be lost, if the task is actually still running. But the reschedule is everything working as intended from that state.

@tgross tgross added this to the 1.0.2 milestone Dec 16, 2020
@tgross
Copy link
Member

tgross commented Dec 16, 2020

Fixed in #9636, which will ship in the upcoming Nomad 1.0.2.

@idrennanvmware
Copy link
Contributor

This is great stuff @tgross, thanks for all the work on it - we will definitely be upgrading to 1.0.2 and will be using this :)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul-template theme/vault type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants