Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate BatchJobDeregister evals updates #4744

Conversation

notnoopci
Copy link
Contributor

@notnoopci notnoopci commented Oct 3, 2018

Fixes #4299

There is a race between job purging and job-deregister evaluations
handling that seem to cause completed jobs to re-run unexpectedly when
they are being GCed.

Here, we make BatchDeregister behave just like Deregister, where we
submit the job-deregister evals after JobBatchDeregisterRequest is
committed to raft.

Open Questions:

  • How can we test this racey behavior effectively in local/integration tests?
    • I have run this patch with my test environment setup in https://github.com/notnoopci/nomad-0.8.3-job-rerun-bug for over 12 hours, ran ~12k jobs, with ~7k gc invocation and ~211 invocations of BatchDeregister over that period without ever invoking the race and running a batch job multiple times -- while the bug would occur for as little as 15 minutes without this patch.

Fixes hashicorp#4299

There is a race between job purging and job-deregister evaluations
handling that seem to cause completed jobs to re-run unexpectedly when
they are being GCed.

Here, we make `BatchDeregister` behave just like `Deregister`, where we
submit the `job-deregister` evals after `JobBatchDeregisterRequest` is
committed to raft.
}

// Populate the reply with job information
reply.Index = index
Copy link
Contributor Author

@notnoopci notnoopci Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is actually needed but followed the pattern in Dispatch - I guess it's somewhat relevant if we exit early if we hit an error below - but I'm unsure of the ramifications. Are Dispatch or BatchDispatch expected to be atomic operations?

ID: uuid.Generate(),
Namespace: jobNS.Namespace,
Priority: priority,
Type: jtype,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that previous code sets priority and type from the job - but Deregister method doesn't. Should we do that there as well?

@notnoop
Copy link
Contributor

notnoop commented Nov 9, 2018

Will using a different approach to fix the bug. Closing and opening a new PR shortly.

@notnoop notnoop closed this Nov 9, 2018
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 25, 2023
@brentmmarks brentmmarks deleted the apply-batch-deregister-evals-separately branch November 20, 2023 16:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants