Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: enforce max_kill_timeout client configuration #13626

Merged
merged 1 commit into from
Jul 7, 2022

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Jul 6, 2022

This PR fixes a bug where client configuration max_kill_timeout was
not being enforced. The feature was introduced in 9f44780 but seems
to have been accidentally dropped during the major drivers refactoring.

We can make sure the value is enforced by pluming it through the DriverHandler,
which now uses the lesser of the task::killTimeout or client::maxKillTimeout.
Also updates Event.SetKillTimeout to require both the task.killTimeout and
client.maxKillTimeout so that we don't make the mistake of using the wrong
value - as it was being given only the task.killTimeout before.

Fixes #10005

This PR fixes a bug where client configuration max_kill_timeout was
not being enforced. The feature was introduced in 9f44780 but seems
to have been removed during the major drivers refactoring.

We can make sure the value is enforced by pluming it through the DriverHandler,
which now uses the lesser of the task.killTimeout or client.maxKillTimeout.
Also updates Event.SetKillTimeout to require both the task.killTimeout and
client.maxKillTimeout so that we don't make the mistake of using the wrong
value - as it was being given only the task.killTimeout before.
@shoenig
Copy link
Member Author

shoenig commented Jul 6, 2022

Spot checking

nomad.hcl

client {
  max_kill_timeout = "5s"
}

rexec.hcl

job "rexec" {
  datacenters = ["dc1"]

  group "rexec" {
    task "rexec" {
      driver       = "raw_exec"
      kill_timeout = "1m"

      config {
        command = "/bin/bash"
        args    = ["local/script.sh"]
      }

      template {
        data        = <<EOF
trap 'echo "Received SIGINT"' INT
while true
do
  echo 'Running'
  sleep 1
done
EOF
        destination = "local/script.sh"
      }
    }
  }
}

start nomad with config

➜ make dev && sudo nomad agent -dev -config=/home/shoenig/Work/tickets/max-timeout/nomad.hcl

run job

➜ nomad job run rexec.nomad

stop alloc

➜ nomad alloc stop d9

Task is correctly force-killed after max_kill_timeout

Recent Events:
Time                       Type        Description
2022-07-06T16:48:49-05:00  Killed      Task successfully killed
2022-07-06T16:48:49-05:00  Terminated  Exit Code: 137, Signal: 9
2022-07-06T16:48:43-05:00  Killing     Sent interrupt. Waiting 5s before force killing
2022-07-06T16:46:43-05:00  Started     Task started by client
2022-07-06T16:46:43-05:00  Task Setup  Building Task Directory
2022-07-06T16:46:43-05:00  Received    Task received by client

Making sure we respect the task setting if it is less than max_kill_timeout:

      kill_timeout = "2s"
Recent Events:
Time                       Type        Description
2022-07-06T16:52:04-05:00  Killed      Task successfully killed
2022-07-06T16:52:04-05:00  Terminated  Exit Code: 137, Signal: 9
2022-07-06T16:52:02-05:00  Killing     Sent interrupt. Waiting 2s before force killing
2022-07-06T16:51:15-05:00  Started     Task started by client
2022-07-06T16:51:15-05:00  Task Setup  Building Task Directory
2022-07-06T16:51:15-05:00  Received    Task received by client

@shoenig shoenig added this to the 1.3.2 milestone Jul 6, 2022
@shoenig shoenig added backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line labels Jul 6, 2022
@shoenig shoenig marked this pull request as ready for review July 6, 2022 21:56
@shoenig shoenig requested review from lgfa29 and schmichael July 6, 2022 21:56
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be working well 🎉
image

I think it would be worth adding an upgrade notice. Even though this is how it was always supposed to work, this fix will impact any task with kill_timeout > 30s if the client config is not set.

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 7, 2022

Just notice that it doesn't have the backport/1.1.x label, is there a reason we're not able to backport?

@shoenig shoenig added the backport/1.1.x backport to 1.1.x release line label Jul 7, 2022
@shoenig
Copy link
Member Author

shoenig commented Jul 7, 2022

Added backport/1.1.x! totally just forgot

I'll add the upgrade note in a follow up PR ... don't wanna miss this chance for a perfect score on CI

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client configuration max_kill_timeout is not used.
3 participants