Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fix bugs in retryWorkflow if failed pod node has children nodes. Fix #9244 #9285

Merged
merged 5 commits into from
Aug 11, 2022
Merged

Conversation

smile-luobin
Copy link
Contributor

Fix bugs in retry workflow if the failed node has children nodes. Fix #9244.

For example. node t2 was failed, but it has child node t3.

{
  "apiVersion": "argoproj.io/v1alpha1",
  "kind": "Workflow",
  "metadata": {
    "generateName": "test-pipeline-"
  },
  "spec": {
    "arguments": {},
    "entrypoint": "test-pipeline-dag",
    "templates": [
      {
        "dag": {
          "tasks": [
            {
              "arguments": {},
              "name": "t1",
              "template": "succeeded"
            },
            {
              "arguments": {},
              "depends": "t1",
              "name": "t2",
              "template": "failed"
            },
            {
              "arguments": {},
              "depends": "t2 || t2.Failed",
              "name": "t3",
              "template": "succeeded"
            },
            {
              "arguments": {},
              "depends": "t3",
              "name": "t4-1",
              "template": "succeeded"
            },
            {
              "arguments": {},
              "depends": "t3",
              "name": "t4-2",
              "template": "succeeded"
            },
            {
              "arguments": {},
              "depends": "t3",
              "name": "t4-3",
              "template": "failed"
            }
          ]
        },
        "inputs": {},
        "metadata": {},
        "name": "test-pipeline-dag",
        "outputs": {}
      },
      {
        "container": {
          "command": [
            "true"
          ],
          "image": "alpine"
        },
        "name": "succeeded"
      },
      {
        "container": {
          "command": [
            "false"
          ],
          "image": "alpine",
          "resources": {}
        },
        "name": "failed"
      }
    ]
  },
  "status": {
    "finishedAt": null,
    "startedAt": null
  }
}

Signed-off-by: smile-luobin <smile.luobin@gmail.com>
Signed-off-by: smile-luobin <smile.luobin@gmail.com>
@smile-luobin smile-luobin reopened this Aug 5, 2022
@smile-luobin smile-luobin changed the title feat: fix bugs in retryWorkflow if failed node has children nodes. Fix #9244 feat: fix bugs in retryWorkflow if failed pod node has children nodes. Fix #9244 Aug 5, 2022
@terrytangyuan terrytangyuan self-assigned this Aug 5, 2022
Signed-off-by: smile-luobin <smile.luobin@gmail.com>
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix! I left some comments.

Comment on lines 863 to 871
descendantNodeIDs := getDescendantNodeIDs(wf, node)
for _, descendantNodeID := range descendantNodeIDs {
deletedNodes[descendantNodeID] = true
descendantNode := wf.Status.Nodes[descendantNodeID]
if descendantNode.Type == wfv1.NodeTypePod {
templateName := getTemplateFromNode(descendantNode)
version := GetWorkflowPodNameVersion(wf)
podName := PodName(wf.Name, descendantNode.Name, templateName, descendantNode.ID, version)
podsToDelete = append(podsToDelete, podName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there will be redundant pod names here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 869 to 871
version := GetWorkflowPodNameVersion(wf)
podName := PodName(wf.Name, descendantNode.Name, templateName, descendantNode.ID, version)
podsToDelete = append(podsToDelete, podName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this? Similar code above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 929 to 930
//assert.Equal(t, wfv1.NodeRunning, wf.Status.Nodes["3"].Phase)
//assert.Equal(t, wfv1.NodeRunning, wf.Status.Nodes["4"].Phase)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert.Equal(t, wfv1.NodeSucceeded, wf.Status.Nodes["my-nested-dag-2"].Phase)
// This should be running since it's node #4's parent node.
assert.Equal(t, wfv1.NodeRunning, wf.Status.Nodes["1"].Phase)
// This should be running since it's node #1's child node and node #1 is being retried.
assert.Equal(t, wfv1.NodeRunning, wf.Status.Nodes["2"].Phase)
assert.Equal(t, wfv1.NodeSucceeded, wf.Status.Nodes["3"].Phase)
assert.Equal(t, wfv1.NodeRunning, wf.Status.Nodes["4"].Phase)
//assert.Equal(t, wfv1.NodeRunning, wf.Status.Nodes["4"].Phase)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

workflows.argoproj.io/completed: "true"
workflows.argoproj.io/phase: Failed
workflows.argoproj.io/resubmitted-from-workflow: test-pipeline-t5h77
managedFields:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Can we make this YAML more concise and only contain what's needed for the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: smile-luobin <smile.luobin@gmail.com>
Signed-off-by: smile-luobin <smile.luobin@gmail.com>
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@terrytangyuan terrytangyuan merged commit e12c697 into argoproj:master Aug 11, 2022
juchaosong pushed a commit to juchaosong/argo-workflows that referenced this pull request Nov 3, 2022
…Fix argoproj#9244 (argoproj#9285)

* feat: fix bugs in retry workflow if failed node has children nodes.

Signed-off-by: smile-luobin <smile.luobin@gmail.com>

* Fix bugs in retryWorkflow

Signed-off-by: smile-luobin <smile.luobin@gmail.com>

* refactor the method that gets the descendantNodes

Signed-off-by: smile-luobin <smile.luobin@gmail.com>

* do some refactoring

Signed-off-by: smile-luobin <smile.luobin@gmail.com>

* fix bugs, and add more checks in test

Signed-off-by: smile-luobin <smile.luobin@gmail.com>

Signed-off-by: smile-luobin <smile.luobin@gmail.com>
Signed-off-by: juchao <juchao@coscene.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

argo retry ends up in runtime error: invalid memory address or nil pointer dereference
2 participants