-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Create new boundary to run onExit nodes #5478
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5478 +/- ##
==========================================
- Coverage 16.50% 16.48% -0.03%
==========================================
Files 243 243
Lines 43774 43779 +5
==========================================
- Hits 7225 7215 -10
- Misses 35568 35583 +15
Partials 981 981
Continue to review full report at Codecov.
|
workflow/controller/operator.go
Outdated
if templateRef != "" && woc.GetShutdownStrategy().ShouldExecute(true) { | ||
woc.log.Infof("Running OnExit handler: %s", templateRef) | ||
onExitNodeName := common.GenerateOnExitNodeName(parentDisplayName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parentDisplayName
is non unique, causing nodes with the same display name (e.g. when using withParam
) to create onExit nodes of the same name. After the first onExit node is created, subsequent onExit nodes created by different parent nodes (but with the same display name) will fail creation as they already exist. Use parentNodeName
which is always unique instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix has been split to #5486
workflow/controller/steps.go
Outdated
@@ -260,7 +260,7 @@ func (woc *wfOperationCtx) executeStepGroup(ctx context.Context, stepGroup []wfv | |||
if !childNode.Fulfilled() { | |||
completed = false | |||
} else if childNode.Completed() { | |||
hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, step.Name, childNode.Name, stepsCtx.boundaryID, stepsCtx.tmplCtx) | |||
hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, childNode.Name, childNodeID, stepsCtx.tmplCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On exit nodes should run under the boundary of the node that called them (the "exited" node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before:
STEP TEMPLATE PODNAME DURATION MESSAGE
✔ dag-diamond-wzqnc diamond
├─✔ A echo dag-diamond-wzqnc-2374547985 6s
├─✔ A.onExit oe dag-diamond-wzqnc-1294891656 6s
├─✔ B echo dag-diamond-wzqnc-2324215128 6s
├─✔ C echo dag-diamond-wzqnc-2340992747 6s
├─✔ B.onExit oe dag-diamond-wzqnc-250094443 6s
├─✔ C.onExit oe dag-diamond-wzqnc-728915506 6s
├─✔ D echo dag-diamond-wzqnc-2424880842 6s
└─✔ D.onExit oe dag-diamond-wzqnc-2186811989 6s
After:
STEP TEMPLATE PODNAME DURATION MESSAGE
✔ dag-diamond-ztvr2 diamond
├─✔ A echo dag-diamond-ztvr2-3230871262 2s
│ └─✔ onExit oe dag-diamond-ztvr2-1416350393 6s
├─✔ B echo dag-diamond-ztvr2-3214093643 6s
│ └─✔ onExit oe dag-diamond-ztvr2-2959263954 6s
├─✔ C echo dag-diamond-ztvr2-3197316024 6s
│ └─✔ onExit oe dag-diamond-ztvr2-3577844235 6s
└─✔ D echo dag-diamond-ztvr2-3314759357 6s
└─✔ onExit oe dag-diamond-ztvr2-2103494724 6s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is not only for cosmetic reasons: boundary parallelism is the main reason for making this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will happen to in-flight workflows at time of upgrade?
They will not fail, but there is a tiny chance that they end up redoing work (e.g. an onExit node executed twice) if the onExit node with the old name has been executed but not completed. If it has executed and completed, then its parent node will be succeeded and Argo won't try to schedule a new onExit node. Regardless, I don't believe we provide any guarantees that you will be able to upgrade while running workflows seamlessly. |
This PR currently fixes two issues in one:
|
Fixes: #5463