Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: config option to reject job registration #11610

Merged
merged 2 commits into from
Dec 6, 2021

Conversation

tgross
Copy link
Member

@tgross tgross commented Dec 2, 2021

Fixes #11450

During incident response, operators may find that automated processes
elsewhere in the organization can be generating new workloads on Nomad
clusters that are unable to handle the workload. This changeset adds a
field to the SchedulerConfiguration API that causes all job
registration calls to be rejected unless the request has a management
ACL token.


Note this doesn't include any new CLI commands (see #11450 (comment) for why)

nomad/job_endpoint.go Outdated Show resolved Hide resolved
@@ -13,6 +13,7 @@ const (
errNoRegionPath = "No path to region"
errTokenNotFound = "ACL token not found"
errPermissionDenied = "Permission denied"
errJobRegistrationDisabled = "Job registration, dispatch, and scale are disabled by the scheduler configuration"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikenomitch @davemay99 I updated the error message we emit as discussed, and it'll look like so:

$ nomad job run ./example.nomad
Error submitting job: Unexpected response code: 403 (Job registration, dispatch, and scale are disabled by the scheduler configuration)

I'd be happy to take any suggestions on improving this message!

Copy link
Contributor

@DerekStrickland DerekStrickland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some suggestions. Non blocking, but interested in your thoughts.

@@ -451,6 +451,9 @@ func errCodeFromHandler(err error) (int, string) {
} else if strings.HasSuffix(errMsg, structs.ErrTokenNotFound.Error()) {
errMsg = structs.ErrTokenNotFound.Error()
code = 403
} else if strings.HasSuffix(errMsg, structs.ErrJobRegistrationDisabled.Error()) {
errMsg = structs.ErrJobRegistrationDisabled.Error()
code = 403
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about 409 here instead of 403? We're trying to get more intentional and consistent with status codes and error messages when the opportunity arrises. 409 is documented as

409 Conflict
The request could not be completed due to a conflict with the current state of the resource

This seems more in line, and the error message should let clients know what the conflict is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of 409 is to avoid write skew and the direction in the standard is to adjust the request and submit it again. Which is exactly what we don't want here.

Copy link
Member Author

@tgross tgross Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to the idea of finding a more specific error code but I feel like it needs to be one that doesn't tell any automated system "keep retrying!" (ex. 420 or 429). The 403 tells them "you can't submit because the permissions to submit have been changed"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was afraid this might be the answer. I'll keep looking for a possible alternative code.

@@ -487,6 +490,9 @@ func (s *HTTPServer) wrap(handler func(resp http.ResponseWriter, req *http.Reque
} else if strings.HasSuffix(errMsg, structs.ErrTokenNotFound.Error()) {
errMsg = structs.ErrTokenNotFound.Error()
code = 403
} else if strings.HasSuffix(errMsg, structs.ErrJobRegistrationDisabled.Error()) {
errMsg = structs.ErrJobRegistrationDisabled.Error()
code = 403
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

nomad/job_endpoint.go Outdated Show resolved Hide resolved
nomad/job_endpoint.go Outdated Show resolved Hide resolved
@@ -128,6 +130,8 @@ server state is authoritative.

- `MemoryOversubscriptionEnabled` `(bool: false)` <sup>1.1 Beta</sup> - When `true`, tasks may exceed their reserved memory limit, if the client has excess memory capacity. Tasks must specify [`memory_max`](/docs/job-specification/resources#memory_max) to take advantage of memory oversubscription.

- `RejectJobRegistration` `(bool: false)` - When `true`, the server will return permission denied errors for job registration, job dispatch, and job scale APIs, unless the ACL token for the request is a management token. If ACLs are disabled, no user will be able to register jobs. This allows operators to shed load from automated proceses during incident response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you switch to 409 this text will need to be updated to say conflict errors rather than permission denied errors. Thank you for explaining the load shedding usage clearly ❤️

During incident response, operators may find that automated processes
elsewhere in the organization can be generating new workloads on Nomad
clusters that are unable to handle the workload. This changeset adds a
field to the `SchedulerConfiguration` API that causes all job
registration calls to be rejected unless the request has a management
ACL token.
@tgross
Copy link
Member Author

tgross commented Dec 6, 2021

@DerekStrickland I've addressed most of your comments. But for the error code for the time being I'm going to merge as-is. We can always go back and trivially tweak the error code if we want any time up until the day we ship. 😀

@tgross tgross merged commit 2c3db7e into main Dec 6, 2021
@tgross tgross deleted the config-to-load-shed-job-register branch December 6, 2021 20:20
@DerekStrickland
Copy link
Contributor

For sure! Sorry. Wasn't trying to block. Also, I can't find anything better that isn't repurposing something else inappropriately.

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add operator API to reject all job registrations
2 participants