Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: prevent start on cgroups init error #19915

Merged
merged 2 commits into from
Feb 9, 2024

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Feb 7, 2024

The Nomad client expects certain cgroups paths to exist in order to manage tasks. These paths are created when the agent first starts, but if process fails the agent would just log the error and proceed with its initialization, despite not being able to run tasks.

This commit surfaces the errors back to the client initialization so the process can stop early and make clear to operators that something went wrong.

Closes #19847

The Nomad client expects certain cgroups paths to exist in order to
manage tasks. These paths are created when the agent first starts, but
if process fails the agent would just log the error and proceed with its
initialization, despite not being able to run tasks.

This commit surfaces the errors back to the client initialization so the
process can stop early and make clear to operators that something went
wrong.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lgfa29 lgfa29 added the backport/1.7.x backport to 1.7.x release line label Feb 9, 2024
@lgfa29 lgfa29 merged commit db5ffde into main Feb 9, 2024
20 of 21 checks passed
@lgfa29 lgfa29 deleted the b-prevent-start-on-cgroup-error branch February 9, 2024 18:45
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
The Nomad client expects certain cgroups paths to exist in order to
manage tasks. These paths are created when the agent first starts, but
if process fails the agent would just log the error and proceed with its
initialization, despite not being able to run tasks.

This commit surfaces the errors back to the client initialization so the
process can stop early and make clear to operators that something went
wrong.
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
The Nomad client expects certain cgroups paths to exist in order to
manage tasks. These paths are created when the agent first starts, but
if process fails the agent would just log the error and proceed with its
initialization, despite not being able to run tasks.

This commit surfaces the errors back to the client initialization so the
process can stop early and make clear to operators that something went
wrong.
lgfa29 added a commit to hashicorp-forge/nomad-bench that referenced this pull request Mar 27, 2024
PR hashicorp/nomad#19915 added an explicit error
check to prevent silent failures when clients are unable to properly
setup cgroups.

This prevents `nomad-nodesim` jobs to start unless `/sys/fs/cgroup` is
available for write inside the container.

Since `nomad-nodesim` doesn't run real allocations, mounting the host
path _should_ be fine (famous last words).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.7.x backport to 1.7.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

migrate cpuset reserved partition when upgrading to 1.7+
2 participants