Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: cache configmap don't create with workflow has retrystrategy. Fixes: #12490 #10426 #12491

Conversation

shuangkun
Copy link
Member

@shuangkun shuangkun commented Jan 9, 2024

I think MemoizationStatus should like inputs, if the executeTmpl has Memoize, the node should have MemoizationStatus.

I test it and if i Set the MemoizationStatus, the ut pass. If did't,the ut will failed.

Fixes #12490
Fixes #10426

Motivation

Modifications

Verification

@Joibel Joibel self-requested a review January 9, 2024 16:14
@Joibel
Copy link
Member

Joibel commented Jan 9, 2024

Would also fix the second part of #10426

@Joibel
Copy link
Member

Joibel commented Jan 9, 2024

I'm going to have a build and test of this tomorrow, but the code looks like it does the right thing. Thanks @shuangkun.

@juliev0 juliev0 added the prioritized-review For members of the Sustainability Effort label Jan 9, 2024
@shuangkun
Copy link
Member Author

I'm going to have a build and test of this tomorrow, but the code looks like it does the right thing. Thanks @shuangkun.

Thanks for your review!

@Joibel
Copy link
Member

Joibel commented Jan 10, 2024

This breaks the workflow example from #10426. Without this change the workflow doesn't correctly memoize but does run correctly to completion. With this change you get

level=error msg="Recovered from panic" namespace=argo r="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 307 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate.func2()\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:200 +0xb4\npanic({0x20c1e00?, 0x39f0490?})\n\t/usr/local/go/src/runtime/panic.go:920 +0x270\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).runOnExitNode(0xc00150a0c0, {0x27b8560, 0x3a5c2c0}, 0x0, 0xc000d07980, {0xc0015265a0, 0x1b}, 0xc000fe9650?, {0xc000aa1db8, 0x14}, ...)\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/exit_handler.go:20 +0xa6\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAGTask(0xc00150a0c0, {0x27b8560, 0x3a5c2c0}, 0xc00034b490, {0xc000b92a90, 0xe})\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/dag.go:614 +0x2232\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAG(0xc00150a0c0, {0x27b8560, 0x3a5c2c0}, {0xc001526540, 0x1b}, 0xc000404b80, {0xc000acab40, 0x21}, 0xc000adfd40, {0x27bbe20, ...}, ...)\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/dag.go:269 +0x42a\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeTemplate(0xc00150a0c0, {0x27b8560, 0x3a5c2c0}, {0xc001526540, 0x1b}, {0x27bbe20, 0xc000b46000?}, 0x46add3?, {{0x0, 0x0, ...}, ...}, ...)\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2120 +0x2d4a\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate(0xc00150a0c0, {0x27b8560?, 0x3a5c2c0})\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:364 +0x18cb\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem(0xc00042f900, {0x27b8560, 0x3a5c2c0})\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:802 +0x687\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker(0xc000d846a0?)\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:719 +0x88\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)\n\t/home/vscode/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:155 +0x33\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27909c0, 0xc000d0a870}, 0x1, 0xc0006b6240)\n\t/home/vscode/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:156 +0xaf\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)\n\t/home/vscode/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:133 +0x7f\nk8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)\n\t/home/vscode/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:90 +0x1e\ncreated by github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run in goroutine 50\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:339 +0x185a\n" workflow=memoized-workflow-testgbx2b

That's how it comes out in the logs, sorry it's horridly formatted.

Could you look into this?

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes the workflow given as an example in #10426 to crash the controller.

@Joibel Joibel self-assigned this Jan 10, 2024
@shuangkun
Copy link
Member Author

This breaks the workflow example from #10426. Without this change the workflow doesn't correctly memoize but does run correctly to completion. With this change you get

level=error msg="Recovered from panic" namespace=argo r="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 307 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate.func2()\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:200 +0xb4\npanic({0x20c1e00?, 0x39f0490?})\n\t/usr/local/go/src/runtime/panic.go:920 +0x270\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).runOnExitNode(0xc00150a0c0, {0x27b8560, 0x3a5c2c0}, 0x0, 0xc000d07980, {0xc0015265a0, 0x1b}, 0xc000fe9650?, {0xc000aa1db8, 0x14}, ...)\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/exit_handler.go:20 +0xa6\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAGTask(0xc00150a0c0, {0x27b8560, 0x3a5c2c0}, 0xc00034b490, {0xc000b92a90, 0xe})\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/dag.go:614 +0x2232\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAG(0xc00150a0c0, {0x27b8560, 0x3a5c2c0}, {0xc001526540, 0x1b}, 0xc000404b80, {0xc000acab40, 0x21}, 0xc000adfd40, {0x27bbe20, ...}, ...)\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/dag.go:269 +0x42a\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeTemplate(0xc00150a0c0, {0x27b8560, 0x3a5c2c0}, {0xc001526540, 0x1b}, {0x27bbe20, 0xc000b46000?}, 0x46add3?, {{0x0, 0x0, ...}, ...}, ...)\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2120 +0x2d4a\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate(0xc00150a0c0, {0x27b8560?, 0x3a5c2c0})\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:364 +0x18cb\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem(0xc00042f900, {0x27b8560, 0x3a5c2c0})\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:802 +0x687\ngit.luolix.top/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker(0xc000d846a0?)\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:719 +0x88\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)\n\t/home/vscode/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:155 +0x33\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27909c0, 0xc000d0a870}, 0x1, 0xc0006b6240)\n\t/home/vscode/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:156 +0xaf\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)\n\t/home/vscode/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:133 +0x7f\nk8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)\n\t/home/vscode/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:90 +0x1e\ncreated by github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run in goroutine 50\n\t/home/vscode/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:339 +0x185a\n" workflow=memoized-workflow-testgbx2b

That's how it comes out in the logs, sorry it's horridly formatted.

Could you look into this?

Thanks, I will test it tommorow!

@shuangkun shuangkun force-pushed the fix/MemoizeNotCreatedInRetryStrategy branch from 44f0ce3 to a8cc3d0 Compare January 11, 2024 15:47
@shuangkun shuangkun force-pushed the fix/MemoizeNotCreatedInRetryStrategy branch from 7248c81 to e745202 Compare January 11, 2024 16:59
@shuangkun
Copy link
Member Author

This causes the workflow given as an example in #10426 to crash the controller.

Thanks for your good find. I test it and find the root cause is when the node hit the memoize, it won't has children even though it is a retry type node. So i think we should change node output from children when it didn't hit memoize.

@shuangkun shuangkun requested a review from Joibel January 11, 2024 17:06
@shuangkun shuangkun force-pushed the fix/MemoizeNotCreatedInRetryStrategy branch from e745202 to 6f98655 Compare January 12, 2024 02:20
@@ -15,7 +15,7 @@ import (

func (woc *wfOperationCtx) runOnExitNode(ctx context.Context, exitHook *wfv1.LifecycleHook, parentNode *wfv1.NodeStatus, boundaryID string, tmplCtx *templateresolution.Context, prefix string, scope *wfScope) (bool, *wfv1.NodeStatus, error) {
outputs := parentNode.Outputs
if parentNode.Type == wfv1.NodeTypeRetry {
if parentNode.Type == wfv1.NodeTypeRetry && !(parentNode.MemoizationStatus != nil && parentNode.MemoizationStatus.Hit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complex thing to get your head around (not that it wasn't before your change). I agree it does the right thing, but I'd like to wrap up this if and the subsequent getChildNodeIndex() call into a single helper function.
It feels like it may result in breakage in the future unless we do this.

// Check if we have a retry node which wasn't memoized and return that if we do
func (woc *wfOperationCtx) possiblyGetRetryChildNode(node *wfv1.NodeStatus) *wfv1.NodeStatus {
	if node.Type == wfv1.NodeTypeRetry && !(node.MemoizationStatus != nil && node.MemoizationStatus.Hit) {
		return getChildNodeIndex(node, woc.wf.Status.Nodes, -1)
	}
	return nil
}

and then this becomes

if lastChildNode := possiblyGetRetryChildNode(node); lastChildNode != nil {
  outputs = lastChildNode.Outputs
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I agree with you! I have changed all.

@@ -75,7 +75,7 @@ func (woc *wfOperationCtx) executeTmplLifeCycleHook(ctx context.Context, scope *
// executeTemplated should be invoked when hookedNode != nil, because we should reexecute the function to check mutex condition, etc.
if execute || hookedNode != nil {
outputs := parentNode.Outputs
if parentNode.Type == wfv1.NodeTypeRetry {
if parentNode.Type == wfv1.NodeTypeRetry && !(parentNode.MemoizationStatus != nil && parentNode.MemoizationStatus.Hit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can use that here

@@ -2978,7 +2988,7 @@ func (woc *wfOperationCtx) requeueIfTransientErr(err error, nodeName string) (*w
func (woc *wfOperationCtx) buildLocalScope(scope *wfScope, prefix string, node *wfv1.NodeStatus) {
// It may be that the node is a retry node, in which case we want to get the outputs of the last node
// in the retry group instead of the retry node itself.
if node.Type == wfv1.NodeTypeRetry {
if node.Type == wfv1.NodeTypeRetry && !(node.MemoizationStatus != nil && node.MemoizationStatus.Hit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

@shuangkun shuangkun force-pushed the fix/MemoizeNotCreatedInRetryStrategy branch from 6f98655 to 20d5acc Compare January 15, 2024 16:16
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun force-pushed the fix/MemoizeNotCreatedInRetryStrategy branch from 20d5acc to bca2466 Compare January 15, 2024 16:19
Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you.
LGTM, @terrytangyuan could you take a look please.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the PR title?

@shuangkun shuangkun changed the title fix: cache configmap don't create with workflow has retrystrategy. Fi… fix: cache configmap don't create with workflow has retrystrategy. Fixes: #12490 #10426 Jan 19, 2024
@shuangkun
Copy link
Member Author

Can you fix the PR title?

Yes,Is this okay?

@terrytangyuan terrytangyuan merged commit 46c1324 into argoproj:main Jan 19, 2024
28 checks passed
@agilgur5 agilgur5 added area/memoization area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Jan 19, 2024
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 28, 2024
: argoproj#12490 argoproj#10426 (argoproj#12491)

Signed-off-by: Isitha Subasinghe <isubasinghe@student.unimelb.edu.au>
@agilgur5 agilgur5 added area/retryStrategy Template-level retryStrategy and removed area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/memoization area/retryStrategy Template-level retryStrategy prioritized-review For members of the Sustainability Effort
Projects
None yet
5 participants