-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(controller): Use node.Name instead of node.DisplayName for onExit nodes #5486
Conversation
Signed-off-by: Simon Behar <simbeh7@gmail.com>
workflow/controller/steps.go
Outdated
@@ -260,7 +260,7 @@ func (woc *wfOperationCtx) executeStepGroup(ctx context.Context, stepGroup []wfv | |||
if !childNode.Fulfilled() { | |||
completed = false | |||
} else if childNode.Completed() { | |||
hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, step.Name, childNode.Name, stepsCtx.boundaryID, stepsCtx.tmplCtx) | |||
hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, childNode.Name, stepsCtx.boundaryID, stepsCtx.tmplCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parentDisplayName
is non unique, causing nodes with the same display name (e.g. when using withParam
) to create onExit nodes of the same name. After the first onExit node is created, subsequent onExit nodes created by different parent nodes (but with the same display name) will fail creation as they already exist. Use parentNodeName
which is always unique instead.
Codecov Report
@@ Coverage Diff @@
## master #5486 +/- ##
==========================================
+ Coverage 46.54% 46.74% +0.20%
==========================================
Files 240 240
Lines 15001 15004 +3
==========================================
+ Hits 6982 7014 +32
+ Misses 7119 7091 -28
+ Partials 900 899 -1
Continue to review full report at Codecov.
|
what happens to in-flight workflows at the time of upgrade? can do they end up with 2x on-exit nodes? |
Yes, there is a window of time where onExit nodes could get duplicated in the event of an upgrade while a cluster is running. note this window is only from the time an onExit node with the old name is created until the time its boundary node (Step Group, DAG) completes. Once the boundary node completes, we won't recurse down to the onExit node again. I don't believe we make guarantees on the consistency of a running workflow if there is a version upgrade while it's running. |
I don't think many users expect an upgrade to typically break workflows. Is there an alternative you can see? |
@alexec I don't believe so. This is a narrow issue: are we using the right name or not? We are currently not using the right name, which means there are bugs in the code (see #5486 (comment)). If it's critical that no running workflows are broken during the upgrade, I could introduce some code that:
We'd have to keep this scaffold code in the codebase for a couple of versions to ensure that users upgrading and skipping some patch versions (e.g. |
Also perhaps we could include a notice in the release notes the advise users to not upgrade with running workflows. And we can include that notice whenever applicable. It seems to me that we must have already made similar bug fixes that would have broken running nodes in the past. This could be a good middle ground between users not seeing unexpected, one-time issues during upgraded and us being able to fix bugs that currently exist in the codebase. |
Decision: add scaffolding code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexec Now fully backwards compatible, with tests that ensure it. Ready for another review 🙂
@@ -301,6 +301,31 @@ func withAnnotation(key, val string) with { | |||
return func(pod *apiv1.Pod) { pod.Annotations[key] = val } | |||
} | |||
|
|||
// createRunningPods creates the pods that are marked as running in a given test so that they can be accessed by the | |||
// pod assessor | |||
func createRunningPods(ctx context.Context, woc *wfOperationCtx, with ...with) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful utility testing function to create pods in the client that are supposed to be running according to the test workflow status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused? delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the with
parameter was unused. Removed.
// Previously we used `depNode.DisplayName` to generate all onExit node names. However, as these can be non-unique | ||
// we transitioned to using `depNode.Name` instead, which are guaranteed to be unique. In order to not disrupt | ||
// running workflows during upgrade time, we do an additional check to see if there is an onExit node with the old | ||
// name (`depNode.DisplayName`). | ||
// TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0 | ||
// See more: https://github.com/argoproj/argo-workflows/issues/5502 | ||
legacyOnExitNodeName := common.GenerateOnExitNodeName(depNode.DisplayName) | ||
if onExitNode := d.wf.GetNodeByName(legacyOnExitNodeName); onExitNode != nil && d.wf.GetNodeByName(depNode.Name).HasChild(onExitNode.ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tested by TestOnExitDAGStatusCompatibility
in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@@ -6314,3 +6314,223 @@ func TestGenerateOutputResultRegex(t *testing.T) { | |||
assert.Equal(t, `steps\.template-name\.outputs\.result`, ref) | |||
assert.Equal(t, `steps\[['\"]template-name['\"]\]\.outputs.result`, expr) | |||
} | |||
|
|||
const testOnExitNameBackwardsCompatibility = ` | |||
apiVersion: argoproj.io/v1alpha1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already ensured these test workflows are minimal.
// Previously we used `parentDisplayName` to generate all onExit node names. However, as these can be non-unique | ||
// we transitioned to using `parentNodeName` instead, which are guaranteed to be unique. In order to not disrupt | ||
// running workflows during upgrade time, we first check if there is an onExit node that currently exists with the | ||
// legacy name AND said node is a child of the parent node. If it does, we continue execution with the legacy name. | ||
// If it doesn't, we use the new (and unique) name for all operations henceforth. | ||
// TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0 | ||
// When the scaffold code is removed, we should only have the following: | ||
// | ||
// onExitNodeName := common.GenerateOnExitNodeName(parentNodeName) | ||
// | ||
// See more: https://github.com/argoproj/argo-workflows/issues/5502 | ||
onExitNodeName := common.GenerateOnExitNodeName(parentNodeName) | ||
legacyOnExitNodeName := common.GenerateOnExitNodeName(parentDisplayName) | ||
if legacyNameNode := woc.wf.GetNodeByName(legacyOnExitNodeName); legacyNameNode != nil && woc.wf.GetNodeByName(parentNodeName).HasChild(legacyNameNode.ID) { | ||
onExitNodeName = legacyOnExitNodeName | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tested by TestOnExitNameBackwardsCompatibility
in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the block of code looks to be duplicated - maybe create a utility func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a utility function before, but once all the parameters were taken into account it didn't really reduce complexity so I decided to keep it like so.
@@ -1806,6 +1806,15 @@ func (n NodeStatus) GetDuration() time.Duration { | |||
return n.FinishedAt.Sub(n.StartedAt.Time) | |||
} | |||
|
|||
func (n NodeStatus) HasChild(childID string) bool { | |||
for _, nodeID := range n.Children { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit test please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on it right now, will push soon
@@ -301,6 +301,31 @@ func withAnnotation(key, val string) with { | |||
return func(pod *apiv1.Pod) { pod.Annotations[key] = val } | |||
} | |||
|
|||
// createRunningPods creates the pods that are marked as running in a given test so that they can be accessed by the | |||
// pod assessor | |||
func createRunningPods(ctx context.Context, woc *wfOperationCtx, with ...with) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused? delete?
// Previously we used `depNode.DisplayName` to generate all onExit node names. However, as these can be non-unique | ||
// we transitioned to using `depNode.Name` instead, which are guaranteed to be unique. In order to not disrupt | ||
// running workflows during upgrade time, we do an additional check to see if there is an onExit node with the old | ||
// name (`depNode.DisplayName`). | ||
// TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0 | ||
// See more: https://github.com/argoproj/argo-workflows/issues/5502 | ||
legacyOnExitNodeName := common.GenerateOnExitNodeName(depNode.DisplayName) | ||
if onExitNode := d.wf.GetNodeByName(legacyOnExitNodeName); onExitNode != nil && d.wf.GetNodeByName(depNode.Name).HasChild(onExitNode.ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
// Previously we used `parentDisplayName` to generate all onExit node names. However, as these can be non-unique | ||
// we transitioned to using `parentNodeName` instead, which are guaranteed to be unique. In order to not disrupt | ||
// running workflows during upgrade time, we first check if there is an onExit node that currently exists with the | ||
// legacy name AND said node is a child of the parent node. If it does, we continue execution with the legacy name. | ||
// If it doesn't, we use the new (and unique) name for all operations henceforth. | ||
// TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0 | ||
// When the scaffold code is removed, we should only have the following: | ||
// | ||
// onExitNodeName := common.GenerateOnExitNodeName(parentNodeName) | ||
// | ||
// See more: https://github.com/argoproj/argo-workflows/issues/5502 | ||
onExitNodeName := common.GenerateOnExitNodeName(parentNodeName) | ||
legacyOnExitNodeName := common.GenerateOnExitNodeName(parentDisplayName) | ||
if legacyNameNode := woc.wf.GetNodeByName(legacyOnExitNodeName); legacyNameNode != nil && woc.wf.GetNodeByName(parentNodeName).HasChild(legacyNameNode.ID) { | ||
onExitNodeName = legacyOnExitNodeName | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the block of code looks to be duplicated - maybe create a utility func?
… nodes (#5486) Signed-off-by: Simon Behar <simbeh7@gmail.com>
Split from #5478