Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Use node.Name instead of node.DisplayName for onExit nodes #5486

Merged
merged 8 commits into from
Mar 24, 2021

Conversation

simster7
Copy link
Member

@simster7 simster7 commented Mar 22, 2021

Split from #5478

Signed-off-by: Simon Behar <simbeh7@gmail.com>
@@ -260,7 +260,7 @@ func (woc *wfOperationCtx) executeStepGroup(ctx context.Context, stepGroup []wfv
if !childNode.Fulfilled() {
completed = false
} else if childNode.Completed() {
hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, step.Name, childNode.Name, stepsCtx.boundaryID, stepsCtx.tmplCtx)
hasOnExitNode, onExitNode, err := woc.runOnExitNode(ctx, step.OnExit, childNode.Name, stepsCtx.boundaryID, stepsCtx.tmplCtx)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parentDisplayName is non unique, causing nodes with the same display name (e.g. when using withParam) to create onExit nodes of the same name. After the first onExit node is created, subsequent onExit nodes created by different parent nodes (but with the same display name) will fail creation as they already exist. Use parentNodeName which is always unique instead.

@codecov
Copy link

codecov bot commented Mar 22, 2021

Codecov Report

Merging #5486 (885c05c) into master (3065941) will increase coverage by 0.20%.
The diff coverage is 94.73%.

❗ Current head 885c05c differs from pull request most recent head 63dec32. Consider uploading reports for the commit 63dec32 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5486      +/-   ##
==========================================
+ Coverage   46.54%   46.74%   +0.20%     
==========================================
  Files         240      240              
  Lines       15001    15004       +3     
==========================================
+ Hits         6982     7014      +32     
+ Misses       7119     7091      -28     
+ Partials      900      899       -1     
Impacted Files Coverage Δ
workflow/common/util.go 8.10% <0.00%> (ø)
pkg/apis/workflow/v1alpha1/workflow_types.go 45.78% <100.00%> (+0.25%) ⬆️
util/errors/errors.go 100.00% <100.00%> (ø)
workflow/controller/dag.go 72.67% <100.00%> (+0.31%) ⬆️
workflow/controller/operator.go 70.69% <100.00%> (+0.43%) ⬆️
pkg/apiclient/apiclient.go 5.55% <0.00%> (-4.45%) ⬇️
cmd/argo/commands/get.go 56.66% <0.00%> (ø)
pkg/apiclient/http1-client.go 0.00% <0.00%> (ø)
pkg/apiclient/logs-intermediary.go 0.00% <0.00%> (ø)
pkg/apiclient/http1/event-watch-client.go 0.00% <0.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3065941...63dec32. Read the comment docs.

@simster7 simster7 requested a review from alexec March 22, 2021 20:06
@alexec
Copy link
Contributor

alexec commented Mar 22, 2021

what happens to in-flight workflows at the time of upgrade? can do they end up with 2x on-exit nodes?

@simster7
Copy link
Member Author

what happens to in-flight workflows at the time of upgrade? can do they end up with 2x on-exit nodes?

Yes, there is a window of time where onExit nodes could get duplicated in the event of an upgrade while a cluster is running. note this window is only from the time an onExit node with the old name is created until the time its boundary node (Step Group, DAG) completes. Once the boundary node completes, we won't recurse down to the onExit node again.

I don't believe we make guarantees on the consistency of a running workflow if there is a version upgrade while it's running.

@alexec
Copy link
Contributor

alexec commented Mar 22, 2021

I don't think many users expect an upgrade to typically break workflows. Is there an alternative you can see?

@simster7
Copy link
Member Author

I don't think many users expect an upgrade to typically break workflows. Is there an alternative you can see?

@alexec I don't believe so. This is a narrow issue: are we using the right name or not? We are currently not using the right name, which means there are bugs in the code (see #5486 (comment)).

If it's critical that no running workflows are broken during the upgrade, I could introduce some code that:

  • first checks if a node exists with the old name
  • if it does, proceed with the old name
  • if it doesn't, proceed with the new name

We'd have to keep this scaffold code in the codebase for a couple of versions to ensure that users upgrading and skipping some patch versions (e.g. v3.0.1 -> v3.0.3) are also not affected. We'd then remove the scaffold after a "tolerance period". (How would we determine the length of that tolerance period?)

@simster7
Copy link
Member Author

Also perhaps we could include a notice in the release notes the advise users to not upgrade with running workflows. And we can include that notice whenever applicable. It seems to me that we must have already made similar bug fixes that would have broken running nodes in the past.

This could be a good middle ground between users not seeing unexpected, one-time issues during upgraded and us being able to fix bugs that currently exist in the codebase.

@simster7
Copy link
Member Author

Decision: add scaffolding code

Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
Copy link
Member Author

@simster7 simster7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexec Now fully backwards compatible, with tests that ensure it. Ready for another review 🙂

@@ -301,6 +301,31 @@ func withAnnotation(key, val string) with {
return func(pod *apiv1.Pod) { pod.Annotations[key] = val }
}

// createRunningPods creates the pods that are marked as running in a given test so that they can be accessed by the
// pod assessor
func createRunningPods(ctx context.Context, woc *wfOperationCtx, with ...with) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful utility testing function to create pods in the client that are supposed to be running according to the test workflow status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused? delete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the with parameter was unused. Removed.

Comment on lines +704 to +711
// Previously we used `depNode.DisplayName` to generate all onExit node names. However, as these can be non-unique
// we transitioned to using `depNode.Name` instead, which are guaranteed to be unique. In order to not disrupt
// running workflows during upgrade time, we do an additional check to see if there is an onExit node with the old
// name (`depNode.DisplayName`).
// TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0
// See more: https://github.com/argoproj/argo-workflows/issues/5502
legacyOnExitNodeName := common.GenerateOnExitNodeName(depNode.DisplayName)
if onExitNode := d.wf.GetNodeByName(legacyOnExitNodeName); onExitNode != nil && d.wf.GetNodeByName(depNode.Name).HasChild(onExitNode.ID) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tested by TestOnExitDAGStatusCompatibility in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@@ -6314,3 +6314,223 @@ func TestGenerateOutputResultRegex(t *testing.T) {
assert.Equal(t, `steps\.template-name\.outputs\.result`, ref)
assert.Equal(t, `steps\[['\"]template-name['\"]\]\.outputs.result`, expr)
}

const testOnExitNameBackwardsCompatibility = `
apiVersion: argoproj.io/v1alpha1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already ensured these test workflows are minimal.

Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
Comment on lines +2944 to +2959
// Previously we used `parentDisplayName` to generate all onExit node names. However, as these can be non-unique
// we transitioned to using `parentNodeName` instead, which are guaranteed to be unique. In order to not disrupt
// running workflows during upgrade time, we first check if there is an onExit node that currently exists with the
// legacy name AND said node is a child of the parent node. If it does, we continue execution with the legacy name.
// If it doesn't, we use the new (and unique) name for all operations henceforth.
// TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0
// When the scaffold code is removed, we should only have the following:
//
// onExitNodeName := common.GenerateOnExitNodeName(parentNodeName)
//
// See more: https://github.com/argoproj/argo-workflows/issues/5502
onExitNodeName := common.GenerateOnExitNodeName(parentNodeName)
legacyOnExitNodeName := common.GenerateOnExitNodeName(parentDisplayName)
if legacyNameNode := woc.wf.GetNodeByName(legacyOnExitNodeName); legacyNameNode != nil && woc.wf.GetNodeByName(parentNodeName).HasChild(legacyNameNode.ID) {
onExitNodeName = legacyOnExitNodeName
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tested by TestOnExitNameBackwardsCompatibility in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the block of code looks to be duplicated - maybe create a utility func?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a utility function before, but once all the parameters were taken into account it didn't really reduce complexity so I decided to keep it like so.

@@ -1806,6 +1806,15 @@ func (n NodeStatus) GetDuration() time.Duration {
return n.FinishedAt.Sub(n.StartedAt.Time)
}

func (n NodeStatus) HasChild(childID string) bool {
for _, nodeID := range n.Children {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit test please

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on it right now, will push soon

@@ -301,6 +301,31 @@ func withAnnotation(key, val string) with {
return func(pod *apiv1.Pod) { pod.Annotations[key] = val }
}

// createRunningPods creates the pods that are marked as running in a given test so that they can be accessed by the
// pod assessor
func createRunningPods(ctx context.Context, woc *wfOperationCtx, with ...with) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused? delete?

Comment on lines +704 to +711
// Previously we used `depNode.DisplayName` to generate all onExit node names. However, as these can be non-unique
// we transitioned to using `depNode.Name` instead, which are guaranteed to be unique. In order to not disrupt
// running workflows during upgrade time, we do an additional check to see if there is an onExit node with the old
// name (`depNode.DisplayName`).
// TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0
// See more: https://github.com/argoproj/argo-workflows/issues/5502
legacyOnExitNodeName := common.GenerateOnExitNodeName(depNode.DisplayName)
if onExitNode := d.wf.GetNodeByName(legacyOnExitNodeName); onExitNode != nil && d.wf.GetNodeByName(depNode.Name).HasChild(onExitNode.ID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Comment on lines +2944 to +2959
// Previously we used `parentDisplayName` to generate all onExit node names. However, as these can be non-unique
// we transitioned to using `parentNodeName` instead, which are guaranteed to be unique. In order to not disrupt
// running workflows during upgrade time, we first check if there is an onExit node that currently exists with the
// legacy name AND said node is a child of the parent node. If it does, we continue execution with the legacy name.
// If it doesn't, we use the new (and unique) name for all operations henceforth.
// TODO: This scaffold code should be removed after a couple of "grace period" version upgrades to allow transitions. It was introduced in v3.0.0
// When the scaffold code is removed, we should only have the following:
//
// onExitNodeName := common.GenerateOnExitNodeName(parentNodeName)
//
// See more: https://github.com/argoproj/argo-workflows/issues/5502
onExitNodeName := common.GenerateOnExitNodeName(parentNodeName)
legacyOnExitNodeName := common.GenerateOnExitNodeName(parentDisplayName)
if legacyNameNode := woc.wf.GetNodeByName(legacyOnExitNodeName); legacyNameNode != nil && woc.wf.GetNodeByName(parentNodeName).HasChild(legacyNameNode.ID) {
onExitNodeName = legacyOnExitNodeName
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the block of code looks to be duplicated - maybe create a utility func?

@alexec alexec changed the title fix: Use node.Name instead of node.DisplayName for onExit nodes fix(controller): Use node.Name instead of node.DisplayName for onExit nodes Mar 24, 2021
Signed-off-by: Simon Behar <simbeh7@gmail.com>
@simster7 simster7 merged commit d964fe4 into argoproj:master Mar 24, 2021
This was referenced Mar 29, 2021
simster7 added a commit that referenced this pull request Mar 30, 2021
… nodes (#5486)

Signed-off-by: Simon Behar <simbeh7@gmail.com>
@simster7 simster7 mentioned this pull request Apr 19, 2021
50 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants