Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: always edge merge in one direction #4559

Merged
merged 5 commits into from
Jan 18, 2024

Conversation

jedevc
Copy link
Member

@jedevc jedevc commented Jan 16, 2024

Resolves #4347 (comment).

Previously, the scheduler could create cycles in the edges it merged:

Essentially, what appears to happen is that we have two vertexes, A and B. Both of these vertexes seem to have at least two edges, A1/A2 and B1/B2. Somehow, we perform a merge that merges A1 into B1, but then merges B2 into A2.

Prior to #4347 this could (at worst) lead to a slight unnecessary inefficiency, where we would execute both A and B (even though both of their edges be merged so we'd never need to perform both of them). However, after this PR, we can accidentally deadlock, due to attempting to nest progress writers.

@jedevc jedevc marked this pull request as ready for review January 16, 2024 13:47
@jedevc jedevc force-pushed the fix-edge-merge-progress-deadlock branch 2 times, most recently from 41fd3e9 to 9290b84 Compare January 16, 2024 17:28
@jedevc
Copy link
Member Author

jedevc commented Jan 16, 2024

Woops, initial pass wasn't right - this was still vulnerable to deadlocks if:

  • A0 merges to B0
  • A1 merges to C1
  • B2 merges to A2 (causing a cycle - and the previous merge would have set the "owner" to C, so we wouldn't find this)

The correct answer is to track every owner we merged into.

solver/jobs.go Outdated Show resolved Hide resolved
@jedevc jedevc marked this pull request as draft January 16, 2024 18:05
@tonistiigi
Copy link
Member

Essentially, what appears to happen is that we have two vertexes, A and B. Both of these vertexes seem to have at least two edges, A1/A2 and B1/B2. Somehow, we perform a merge that merges A1 into B1, but then merges B2 into A2.

Some stupid questions: Are these edges for different indexes? Then why would it matter that E1=A1->B1 and E2=B2->A2. As the edges are different there isn't a loop between edges but vertexes, and vertexes are not being merged.

@jedevc
Copy link
Member Author

jedevc commented Jan 16, 2024

Essentially, what appears to happen is that we have two vertexes, A and B. Both of these vertexes seem to have at least two edges, A1/A2 and B1/B2. Somehow, we perform a merge that merges A1 into B1, but then merges B2 into A2.

Some stupid questions: Are these edges for different indexes? Then why would it matter that E1=A1->B1 and E2=B2->A2. As the edges are different there isn't a loop between edges but vertexes, and vertexes are not being merged.

Yeah sorry 😁

The edges are unique, so are for different indexes.

The vertices aren't merged, true - they're still entirely separate. The issue is that:

  1. With the progress writer change, there's now a dependency between states (which map to vertices). That dependency is directed, and so cycles in the vertices means cycles in the progress writer - and cycles in the progress multi writer means deadlocks.
  2. If we have the cycle in the graph and just don't allow them in the progress writer we get suboptimal behavior (the "easy" fix) - we have to execute both of the vertices (A and B) to compute both the edges, while if we could eliminate the cycles (so that both edges went from A to B or vice versa), we'd only need to execute the "owner" (which is super useful if it's an expensive operation, that's run lots of times).

solver/jobs.go Outdated
@@ -178,6 +180,7 @@ func (s *state) setEdge(index Index, targetEdge *edge, targetState *state) {
targetState.mpw.Add(s.mpw)
targetState.allPw[s.mpw] = struct{}{}
}
targetState.owners = append(targetState.owners, s.vtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this extra property? Can't we just cycle the .owner chains of all sibling edges?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks like that works as well 🎉

solver/jobs.go Outdated
@@ -280,6 +283,39 @@ func NewSolver(opts SolverOpt) *Solver {
return jl
}

func (jl *Solver) findMergeDirection(dest *edge, src *edge) (*edge, *edge) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer if instead we used langauge "find if any sibling edge has already owned by the vertex of the targetedge"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have refactored this into a hasOwner helper, then the caller determines whether to switch.

@@ -351,6 +352,7 @@ func (s *scheduler) mergeTo(target, src *edge) bool {
type edgeFactory interface {
getEdge(Edge) *edge
setEdge(Edge, *edge)
findMergeDirection(*edge, *edge) (*edge, *edge)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this logic be inside the setEdge instead of exposed in this interface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly not (this would be nice though).

We need to make sure we call s.ef.setEdge in the right direction, but we also need to make sure we call s.mergeTo in the right direction as well.

Signed-off-by: Justin Chadwell <me@jedevc.com>
Signed-off-by: Justin Chadwell <me@jedevc.com>
Previously, if this case occured, we would deadlock, so as to make this
case more easily identifiable and debuggable, we should panic instead.
Potentially, we could also just *skip* this case, but given that this is
an internal logical error that should not occur, it would be better to
identify this.

Signed-off-by: Justin Chadwell <me@jedevc.com>
When we perform a vertex merge, we should explicitly track the vertex
that it was merged into. This way, we can avoid the case where we merge
an index 0 edge from A->B and, later an index 1 edge from B->A.

With this patch, this scenario instead flips the direction of the merge
to merge from A->B for index 1.

Signed-off-by: Justin Chadwell <me@jedevc.com>
Signed-off-by: Justin Chadwell <me@jedevc.com>
@jedevc jedevc force-pushed the fix-edge-merge-progress-deadlock branch from 9290b84 to b7a0282 Compare January 17, 2024 15:15
@jedevc jedevc marked this pull request as ready for review January 17, 2024 15:16
@jedevc jedevc merged commit 9d84cdc into moby:master Jan 18, 2024
60 checks passed
@jedevc jedevc deleted the fix-edge-merge-progress-deadlock branch January 18, 2024 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants