fix: add guard against NodeStatus. Fixes #11102 #11451

isubasinghe · 2023-07-26T08:31:36Z

Fixes #11102 and also issues in the same class (such as #10285)

Motivation

Adds a guard to all NodeStatus accesses, this enforces users to check the HashMap before using a value in it.
This is needed because the HashMap will return a default initialised NodeStatus. Default initialised NodeStatus have triggered edge case bugs for our customers. This fix existed internally in Pipekit for a while now and was able to solve the customers issues with container sets.

Modifications

Fairly simple changes were added.

panic when we are absolutely certain a NodeStatus should exist but we see an error
bubble up errors instead of relying on default initalised NodeStatus

Verification

Manual testing

Signed-off-by: isubasinghe <isitha@pipekit.io> fix: get rid of return for Set, remove logs Signed-off-by: isubasinghe <isitha@pipekit.io> fix: has returns correct value Signed-off-by: isubasinghe <isitha@pipekit.io> fix: remove debug logs Signed-off-by: isubasinghe <isitha@pipekit.io> fix: ensure tests pass Signed-off-by: isubasinghe <isitha@pipekit.io> fix: restore back to /bin/bash Signed-off-by: isubasinghe <isitha@pipekit.io> fix: remove logging Signed-off-by: isubasinghe <isitha@pipekit.io> fix: remove logrus and replace with log Signed-off-by: isubasinghe <isitha@pipekit.io> fix: replace panic with errors Signed-off-by: isubasinghe <isitha@pipekit.io> fix: add comments and use Get for helper fn Signed-off-by: isubasinghe <isitha@pipekit.io> fix: always diagnose as failed when shutdown Signed-off-by: Isitha Subasinghe <isitha@pipekit.io> fix: remove error return in taskresult reconcilation Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>

Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>

Signed-off-by: isubasinghe <isitha@pipekit.io>

isubasinghe · 2023-08-01T09:33:20Z

Simple changes, just makes error handling mandatory.

We encountered too many random failing container sets without these fixes.

Now invalid access errors are just bubbled up, instead of relying on default initialised structs.

isubasinghe · 2023-08-01T09:50:13Z

Blocking #11493

Joibel

Looks good to me.

agilgur5

LGTM. Simple change, but so many places needed changing; nice work getting through them!

For my own edification, did the edge case bugs primarily happen on retried workflows? Reading through this code, it seems that retried workflows are the only place where Nodes get deleted?
Please correct me if I'm wrong, just trying to understand when/how this class of bugs would happen

isubasinghe · 2023-08-02T00:18:19Z

@agilgur5 some areas of the code weren't checking if the NodeStatus actually existed, thanks to Go's (in)sane way of dealing with missing entries (zero initialised), we were getting valid structs.

x := make(map[string]NodeStatus)
assert x["blah"] == DefaultInitialisedNodeStatus() // function doesn't actually exist, just making a point

This didn't just happen for retried workflows, for some reason workflows with container sets pretty much randomly failed all the time. I believe this was because some code path was falsely relying on the default initalised values.

This change now forces you to explicitly handle the case where a NodeStatus is missing.

agilgur5 · 2023-08-03T16:01:43Z

@agilgur5 some areas of the code weren't checking if the NodeStatus actually existed, thanks to Go's (in)sane way of dealing with missing entries (zero initialised), we were getting valid structs.

This change now forces you to explicitly handle the case where a NodeStatus is missing.

Yea I understood the purpose of the PR. Another Go gotcha 😕

This didn't just happen for retried workflows, for some reason workflows with container sets pretty much randomly failed all the time. I believe this was because some code path was falsely relying on the default initalised values.

This is the part I was curious about. As I would think the only time an uninitialized value would be accessed would be if the key previously existed, i.e. it was deleted. Otherwise, I would think that the Nodes map would only grow (until it is GC'd), and so a non-existent key should never occur.
Or to put it differently, trying to understand the root cause of when this would happen. Not just that the node could not exist, but why it would not exist

isubasinghe · 2023-08-04T02:24:19Z

This is the part I was curious about. As I would think the only time an uninitialized value would be accessed would be if the key previously existed, i.e. it was deleted. Otherwise, I would think that the Nodes map would only grow (until it is GC'd), and so a non-existent key should never occur.
Or to put it differently, trying to understand the root cause of when this would happen. Not just that the node could not exist, but why it would not exist

This is a good question, unfortunately one I do not have a definitive answer to. I believe that the codebase is somewhat loosely consistent, code paths are allowed to fail as long as eventually they succeed. That is my educated guess anyway.
I dealt with this issue for months before realising that this was the issue. This simple change was likely the most difficult PR I opened in the codebase.

agilgur5 · 2023-08-15T00:56:26Z

This is a good question, unfortunately one I do not have a definitive answer to.

Gotcha. Was hoping you might know. Guess we still have to watch out for the root cause somewhere (which may give this new error now at least!). If it's not just impacting retries, there might be a thread safety issue somewhere 🤔

This simple change was likely the most difficult PR I opened in the codebase.

I know that feeling all too well 😅 Like to use these types of fixes as great examples that amount of code is often not the most impactful contribution!

agilgur5 · 2023-09-01T15:23:12Z

Jotting down some notes here as I attempted to find the root cause:

Searched the codebase for places where NodeStatus could have been accessed or removed (keywords: "delete" and "Status.Nodes").
delete indeed seems only used during retries
The only other place I could find where nodes are removed is in this line in compressWorkflow, which I believe is only used for node status offload.
- Which does lead to a question if this has been reproduced without node status offload?
The only other potential anomaly is that some places seem to lock when modifying the nodes and others don't. I did not inspect carefully why that is though.
- I also thought that the goroutines did not share a workflow between each other (roughly this codepath), so thought locking wouldn't be necessary. But I may have misinterpreted the code; the controller is the most complex part of the codebase after all.
- It would be nice if the Go compiler had more advanced race condition detection (like Rust, Pony, etc). I also don't think(?) we use Go's race detector
- Even if there were some race conditions with modifying, that should never result in a Node not existing, so those cases don't really explain it either 😕

The merge with argoproj#11451 reverted this, so this commit is just to reinstate that. The tests included in argoproj#11379 failed to catch this, I've raised argoproj#12129 for this, but in the interests of matching the documentation and kubecon next week I'm putting this PR in now. Fixes argoproj#12117 Signed-off-by: Alan Clucas <alan@clucas.org>

isubasinghe marked this pull request as draft July 26, 2023 08:31

isubasinghe added 6 commits July 26, 2023 14:22

fix: handler error for GetNodeByName in hooks.go

75ca5a5

Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>

fix: ensure access via setter for NodeStatus

b09a400

Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>

fix: added err assertion for operator test cases

4ba89b4

Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>

ci: Trigger Build

eb0ea3f

Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>

fix: remove err checking for hooks

34691e6

Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>

fix: remove err handling in hook

92a903e

Signed-off-by: isubasinghe <isitha@pipekit.io>

isubasinghe force-pushed the upstream-nodestatus branch from fb870fc to 92a903e Compare August 1, 2023 08:21

Merge branch 'master' into upstream-nodestatus

d26a57c

isubasinghe force-pushed the upstream-nodestatus branch from cd84d98 to d26a57c Compare August 1, 2023 09:16

isubasinghe marked this pull request as ready for review August 1, 2023 09:32

isubasinghe mentioned this pull request Aug 1, 2023

fix: always fail dag when shutdown is enabled. Fixes #11452 #11493

Merged

caelan-io requested a review from Joibel August 1, 2023 09:55

Joibel approved these changes Aug 1, 2023

View reviewed changes

terrytangyuan approved these changes Aug 1, 2023

View reviewed changes

terrytangyuan merged commit 1f6b19f into argoproj:master Aug 1, 2023
42 of 43 checks passed

agilgur5 reviewed Aug 1, 2023

View reviewed changes

Joibel mentioned this pull request Aug 2, 2023

fix: Fixed memoization is unchecked after mutex synchronization. Fixes #11219 #11456

Merged

isubasinghe mentioned this pull request Aug 3, 2023

REQUEST: Promotion to Approver for @isubasinghe argoproj/argoproj#232

Closed

7 tasks

terrytangyuan mentioned this pull request Aug 9, 2023

Release v3.4.10 cherry-pick candidates #11552

Closed

45 tasks

terrytangyuan mentioned this pull request Sep 5, 2023

Release v3.4.11 cherry-pick candidates #11648

Closed

5 tasks

agilgur5 mentioned this pull request Sep 21, 2023

fix: when key not present assume NodeRunning. Fixes 11843 #11847

Merged

mikutas mentioned this pull request Oct 17, 2023

fix: suppress error about unable to obtain node #12020

Merged

agilgur5 added the area/controller Controller issues, panics label Oct 18, 2023

Joibel mentioned this pull request Nov 3, 2023

Tests for memoization without outputs did not catch regression #12129

Closed

3 tasks

Joibel mentioned this pull request Nov 3, 2023

fix: regression in memoization without outputs #12130

Merged

This was referenced Nov 3, 2023

Flood of "Unable to obtain node for" error messages #12132

Open

fix: add guard against NodeStatus (3.4 backport). Fixes #11102 #11665

Merged

refactor: invert conditionals for less nesting in includeScriptOutput #12146

Merged

This was referenced Feb 4, 2024

fix(controller): re-use node variables instead of re-retrieving them #12614

Draft

fix: prevent update race in workflow cache (Fixes #9574) #12233

Merged

agilgur5 mentioned this pull request Apr 12, 2024

fix: change fatal to panic. #12931

Merged

jswxstw mentioned this pull request Jun 18, 2024

fix: Task result comparison is incorrect, leading to inconsistent comparison results. #13211

Merged

agilgur5 added this to the v3.5.x patches milestone Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add guard against NodeStatus. Fixes #11102 #11451

fix: add guard against NodeStatus. Fixes #11102 #11451

isubasinghe commented Jul 26, 2023 •

edited

Loading

isubasinghe commented Aug 1, 2023 •

edited

Loading

isubasinghe commented Aug 1, 2023 •

edited

Loading

Joibel left a comment

agilgur5 left a comment

isubasinghe commented Aug 2, 2023

agilgur5 commented Aug 3, 2023 •

edited

Loading

isubasinghe commented Aug 4, 2023

agilgur5 commented Aug 15, 2023

agilgur5 commented Sep 1, 2023

fix: add guard against NodeStatus. Fixes #11102 #11451

fix: add guard against NodeStatus. Fixes #11102 #11451

Conversation

isubasinghe commented Jul 26, 2023 • edited Loading

Motivation

Modifications

Verification

isubasinghe commented Aug 1, 2023 • edited Loading

isubasinghe commented Aug 1, 2023 • edited Loading

Joibel left a comment

Choose a reason for hiding this comment

agilgur5 left a comment

Choose a reason for hiding this comment

isubasinghe commented Aug 2, 2023

agilgur5 commented Aug 3, 2023 • edited Loading

isubasinghe commented Aug 4, 2023

agilgur5 commented Aug 15, 2023

agilgur5 commented Sep 1, 2023

isubasinghe commented Jul 26, 2023 •

edited

Loading

isubasinghe commented Aug 1, 2023 •

edited

Loading

isubasinghe commented Aug 1, 2023 •

edited

Loading

agilgur5 commented Aug 3, 2023 •

edited

Loading