Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] What is the best way to run a job on all nodes while draining them? #9857

Closed
ghost opened this issue Jan 20, 2021 · 5 comments
Closed

Comments

@ghost
Copy link

ghost commented Jan 20, 2021

There is a job I would like to run on all machines from time to time; a simple maintenance script that will ensure the dependencies on my nodes are always up to date. To do this, I will first need to drain all nodes of all my production jobs.

While drained, I would like to run one instance of my maintenance job on every one of the nodes. When this is done, I will allow traffic back.

This question has two parts:

  • Is it possible to veto only a specific type of job when draining, so I can still schedule the maintenance job?
  • What is the best practice to run one instance of a parametrized job across all nodes?

NOTE: I believe a system job is not what I need here. I need to be able to run it on demand, not triggered by any events like the node becoming ready (e.g. a restart).

@tgross
Copy link
Member

tgross commented Jan 20, 2021

Is it possible to veto only a specific type of job when draining, so I can still schedule the maintenance job?

Normally there are two "knobs" you have: drain and eligibility. Eligibility is useful to toggle if you want to prevent scheduling of new tasks, but not drain the ones that are currently running. Whereas the case you have here is that you want to run tasks on a node that isn't otherwise eligible for scheduling. The scheduler isn't going to want to run workloads on ineligible nodes.

But what you're trying to do here might be possible with a little bit of cleverness, especially given the scenario you're trying to do. You could give jobs a constraint for a node's meta data, and change the meta as part of the update procedure you're running. So something like:

  • drain the node
  • update meta { ready_to_use = "0" }
  • restart the client
  • disable drain on the node
  • run the update job
  • update meta { ready_to_use = "0" }
  • restart the client

What is the best practice to run one instance of a parametrized job across all nodes?

There's a "system batch" scheduler type that's being worked on in #9160, likely to ship in Nomad 1.1. In the meantime you might be able to workaround that with a batch job that has a count == the number nodes, and the distinct_hosts field.

@ghost
Copy link
Author

ghost commented Jan 20, 2021

Thank you, that makes sense.

I’m also wondering if there is a way to do this by shifting part of the administrative responsibilities to the node itself. In that case,

  • the administrative job could be run with highest priority,
  • this job would read the allocation, infer the node ID and mark "itself" as ineligible,
  • wait until all jobs in progress are finished,
  • complete the maintenance and
  • mark itself as eligible again.

If my understanding is correct, this requires that each node have rather broad management permissions on the cluster, but leaving that aside for a second, does this sound like a plausible scenario?

@tgross
Copy link
Member

tgross commented Jan 21, 2021

If my understanding is correct, this requires that each node have rather broad management permissions on the cluster

You can scope this down a bit by giving the administrative job a Nomad ACL token that has only read-job, list-jobs, and node:write (sourcing the ACL token from a secrets store like Vault or whatever you're using). Those permissions aren't much worse than the information the node itself already has access to.

does this sound like a plausible scenario?

That could definitely work! You're relying on the notion that the jobs you care about will all finish, which is only going to be the case with batch workloads. But if that's the case then you're all set and don't need to worry about draining.

@ghost
Copy link
Author

ghost commented Jan 21, 2021

Thank you very much. Once again, very insightful!

@ghost ghost closed this as completed Jan 21, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant