Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Create new boundary to run onExit nodes #5478

Closed
wants to merge 1 commit into from

Conversation

simster7
Copy link
Member

Fixes: #5463

@codecov
Copy link

codecov bot commented Mar 22, 2021

Codecov Report

Merging #5478 (1ca5f6a) into master (5bd7ce8) will decrease coverage by 0.02%.
The diff coverage is 44.33%.

❗ Current head 1ca5f6a differs from pull request most recent head 6cb9fac. Consider uploading reports for the commit 6cb9fac to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5478      +/-   ##
==========================================
- Coverage   16.50%   16.48%   -0.03%     
==========================================
  Files         243      243              
  Lines       43774    43779       +5     
==========================================
- Hits         7225     7215      -10     
- Misses      35568    35583      +15     
  Partials      981      981              
Impacted Files Coverage Δ
cmd/argo/commands/clustertemplate/lint.go 0.00% <0.00%> (ø)
cmd/argo/commands/cron/lint.go 0.00% <0.00%> (ø)
cmd/argo/commands/lint.go 0.00% <0.00%> (ø)
cmd/argo/lint/lint.go 53.12% <40.00%> (-7.71%) ⬇️
cmd/argo/lint/formatter_pretty.go 68.75% <66.66%> (-20.14%) ⬇️
cmd/argo/lint/formatter_simple.go 83.33% <81.81%> (+1.51%) ⬆️
cmd/argo/commands/get.go 56.95% <100.00%> (-1.05%) ⬇️
workflow/controller/dag.go 72.36% <100.00%> (ø)
workflow/controller/operator.go 70.25% <100.00%> (-0.26%) ⬇️
workflow/controller/steps.go 70.50% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bd7ce8...6cb9fac. Read the comment docs.

if templateRef != "" && woc.GetShutdownStrategy().ShouldExecute(true) {
woc.log.Infof("Running OnExit handler: %s", templateRef)
onExitNodeName := common.GenerateOnExitNodeName(parentDisplayName)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parentDisplayName is non unique, causing nodes with the same display name (e.g. when using withParam) to create onExit nodes of the same name. After the first onExit node is created, subsequent onExit nodes created by different parent nodes (but with the same display name) will fail creation as they already exist. Use parentNodeName which is always unique instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix has been split to #5486

@@ -260,7 +260,7 @@ func (woc *wfOperationCtx) executeStepGroup(ctx context.Context, stepGroup []wfv
if !childNode.Fulfilled() {
completed = false
} else if childNode.Completed() {
hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, step.Name, childNode.Name, stepsCtx.boundaryID, stepsCtx.tmplCtx)
hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, childNode.Name, childNodeID, stepsCtx.tmplCtx)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On exit nodes should run under the boundary of the node that called them (the "exited" node)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before:

STEP                  TEMPLATE  PODNAME                       DURATION  MESSAGE
 ✔ dag-diamond-wzqnc  diamond
 ├─✔ A                echo      dag-diamond-wzqnc-2374547985  6s
 ├─✔ A.onExit         oe        dag-diamond-wzqnc-1294891656  6s
 ├─✔ B                echo      dag-diamond-wzqnc-2324215128  6s
 ├─✔ C                echo      dag-diamond-wzqnc-2340992747  6s
 ├─✔ B.onExit         oe        dag-diamond-wzqnc-250094443   6s
 ├─✔ C.onExit         oe        dag-diamond-wzqnc-728915506   6s
 ├─✔ D                echo      dag-diamond-wzqnc-2424880842  6s
 └─✔ D.onExit         oe        dag-diamond-wzqnc-2186811989  6s

After:

STEP                  TEMPLATE  PODNAME                       DURATION  MESSAGE
 ✔ dag-diamond-ztvr2  diamond
 ├─✔ A                echo      dag-diamond-ztvr2-3230871262  2s
 │ └─✔ onExit         oe        dag-diamond-ztvr2-1416350393  6s
 ├─✔ B                echo      dag-diamond-ztvr2-3214093643  6s
 │ └─✔ onExit         oe        dag-diamond-ztvr2-2959263954  6s
 ├─✔ C                echo      dag-diamond-ztvr2-3197316024  6s
 │ └─✔ onExit         oe        dag-diamond-ztvr2-3577844235  6s
 └─✔ D                echo      dag-diamond-ztvr2-3314759357  6s
   └─✔ onExit         oe        dag-diamond-ztvr2-2103494724  6s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is not only for cosmetic reasons: boundary parallelism is the main reason for making this change

Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen to in-flight workflows at time of upgrade?

@simster7
Copy link
Member Author

what will happen to in-flight workflows at time of upgrade?

They will not fail, but there is a tiny chance that they end up redoing work (e.g. an onExit node executed twice) if the onExit node with the old name has been executed but not completed. If it has executed and completed, then its parent node will be succeeded and Argo won't try to schedule a new onExit node.

Regardless, I don't believe we provide any guarantees that you will be able to upgrade while running workflows seamlessly.

@simster7
Copy link
Member Author

simster7 commented Mar 22, 2021

This PR currently fixes two issues in one:

Should I split into this into two PRs so that we can merge the first fix independently? Done

Signed-off-by: Simon Behar <simbeh7@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weird behaviours of exit handlers in loop
2 participants