Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set killmode/killsignal in systemd example #4305

Merged
merged 2 commits into from
May 21, 2018

Conversation

insanejudge
Copy link
Contributor

@insanejudge insanejudge commented May 17, 2018

Fixes #4302

The default (control-group) kill mode in systemd will kill the associated executors, leading to a commonly seen behavior of nomad client restarts losing all current allocations.

setting the KillSignal to SIGINT seems to be a reasonable default, and allows things like leave_on_interrupt to function

The default (control-group) kill mode in systemd will kill the associated executors, leading to a commonly seen behavior of nomad client restarts losing all current allocations.

setting the KillSignal to SIGINT seems to be a reasonable default, and allows things like leave_on_interrupt to function
@dadgar
Copy link
Contributor

dadgar commented May 21, 2018

Thanks @insanejudge

@dadgar dadgar merged commit 56c5ff0 into hashicorp:master May 21, 2018
@onlyjob
Copy link
Contributor

onlyjob commented Aug 14, 2018

I think KillMode=process can be very dangerous: in my case I need to ensure that there is no more than one particular job is running at any time (on any node) because it is an application that accesses a shared network folder assuming exclusive access (without locks). With KillMode=process when I stop nomad (while its jobs are continue to run) then other nomad nodes start another job somewhere else having more than one running at the same time. Without KillMode=process job is nicely terminated when nomad is stopped and remaining nomad nodes promptly restart the job on another node, as expected.

Certain apps just must not be started more than once with the same shared data foder (e.g. MySQL, MariaDB that cause data (binary log) corruption) to avoid race conditions. With KillMode=process is it possible to guarantee job exclusivity - i.e. that the job is not started without stopping currently running one first?

@insanejudge
Copy link
Contributor Author

There are excellent facilities in nomad for draining nodes in a controlled way, and you should definitely be doing that instead of unceremoniously killing all of the executors - that is not 'nicely terminated' and you have no guarantees that the job is actually exited before rescheduling.

Having a nomad node drain -enable -self as part of an ExecStop= could be argued for a default, but this would require a corresponding node drain -disable -self in ExecStartPost= and start becoming prescriptive about how you're gating scheduling.

In short - with KillMode=process you have control over how scheduling is gated, without it you are requiring awkward termination of every job when restarting for any reason (e.g agent upgrade)

@schmichael
Copy link
Member

@insanejudge did a good job of describing why KillMode=process is the desired behavior.

it is an application that accesses a shared network folder assuming exclusive access (without locks)

If you're also using Consul this is a perfect use case for distributed locks / leader election as there are a variety of circumstances in which Nomad could be running a single job more than once: https://www.consul.io/docs/guides/leader-election.html

In the future we're planning on adding the ability to automatically drain-on-shutdown which should also address your use case.

Also please open new issues instead of commenting on old closed pull requests.

Hope this helps! Thanks!

@ahjohannessen
Copy link

@schmichael Anything new with regards to automatically drain-on-shutdown for clients. Here I think about distros like Flatcar / Fedora CoreOS that automatically reboot. Perhaps until Nomad I could do as @insanejudge suggested:

Having a nomad node drain -enable -self as part of an ExecStop= could be argued for a default, but this would require a corresponding node drain -disable -self in ExecStartPost= and start becoming prescriptive about how you're gating scheduling.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants