Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: detect and log unexpected scheduling collisions #11793

Merged
merged 29 commits into from
Jan 15, 2022

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Jan 6, 2022

#9506 and others have reported sporadically eval plan failures on job submission.

The root of the problem is the PlanApplier rejecting plans from scheduler workers due to port conflicts. Since the scheduler did not observe the conflict it will likely make the same plan again and the job will never finish scheduling. We need more data on why the port conflict happens in the first place.

This PR doesn't solve the problem, as there are multiple causes and not all are known, but brings the worker scheduler logic closer to the logic used by the PlanApplier during scheduling.

More specifically, one discrepancy we noticed is the BinPackIterator not ignoring potential port collisions, whereas the PlanApplier would not and reject the plan if a collision is detected.

This flow violates the assumption made by AllocsFit:

// If the netIdx is provided, it is assumed that the client has already
// ensured there are no collisions.

The PlanApplier passes a nil netIdx to AllocsFit, while the scheduler worker passes a non-nil value, but also doesn't check for conflicts.

We believe the initial logic behaved like this because, when the iterator is evaluating a node, the netIdx is initialized with its current state, which should never have a collision, but this may not always hold true (corrupt state, incorrect copying values, mutating the state store directly, incorrect upgrade paths etc.).

Since we're not sure what could be causing this issue, this PR also adds additional logging when this problem is detected. To aid debugging, extensive data is collected (full node struct, list of allocations, NetworkIndex etc.). Since this can result in a lot of data, these events are sent back to the server to be centralized and deduplicated by node ID to avoid polluting the agent log.

After these changes, any unexpected port collision will result in the node being skipped, which will allow for evaluations to progress into a healthy node.

nomad/job_endpoint.go Outdated Show resolved Hide resolved
scheduler/testing.go Outdated Show resolved Hide resolved
@lgfa29 lgfa29 changed the title tmp scheduler: detect and log unexpected scheduling collisions Jan 10, 2022
@lgfa29 lgfa29 added this to the 1.2.4 milestone Jan 10, 2022
@lgfa29 lgfa29 self-assigned this Jan 10, 2022
@vercel vercel bot temporarily deployed to Preview – nomad January 11, 2022 00:31 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad January 11, 2022 01:59 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad January 11, 2022 02:12 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad January 11, 2022 02:26 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad January 12, 2022 00:18 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad January 12, 2022 01:02 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad January 12, 2022 23:09 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad January 12, 2022 23:39 Inactive
scheduler/context.go Outdated Show resolved Hide resolved
nomad/server.go Outdated Show resolved Hide resolved
scheduler/rank.go Outdated Show resolved Hide resolved
scheduler/context.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – nomad January 15, 2022 00:36 Inactive
@lgfa29 lgfa29 merged commit 8a427a4 into main Jan 15, 2022
@lgfa29 lgfa29 deleted the improve-port-collision-handling branch January 15, 2022 01:09
lgfa29 added a commit that referenced this pull request Jan 15, 2022
lgfa29 added a commit that referenced this pull request Jan 17, 2022
lgfa29 added a commit that referenced this pull request Jan 17, 2022
lgfa29 added a commit that referenced this pull request Jan 17, 2022
lgfa29 added a commit that referenced this pull request Jan 18, 2022
lgfa29 added a commit that referenced this pull request Jan 19, 2022
lgfa29 added a commit that referenced this pull request Jan 19, 2022
lgfa29 added a commit that referenced this pull request Jan 19, 2022
lgfa29 added a commit that referenced this pull request Jan 19, 2022
@github-actions
Copy link

github-actions bot commented Nov 3, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants