Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add guard against NodeStatus. Fixes #11102 #11451

Merged
merged 8 commits into from
Aug 1, 2023

Conversation

isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Jul 26, 2023

Fixes #11102 and also issues in the same class (such as #10285)

Motivation

Adds a guard to all NodeStatus accesses, this enforces users to check the HashMap before using a value in it.
This is needed because the HashMap will return a default initialised NodeStatus. Default initialised NodeStatus have triggered edge case bugs for our customers. This fix existed internally in Pipekit for a while now and was able to solve the customers issues with container sets.

Modifications

Fairly simple changes were added.

  • panic when we are absolutely certain a NodeStatus should exist but we see an error
  • bubble up errors instead of relying on default initalised NodeStatus

Verification

  • Manual testing

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: get rid of return for Set, remove logs

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: has returns correct value

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: remove debug logs

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: ensure tests pass

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: restore back to /bin/bash

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: remove logging

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: remove logrus and replace with log

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: replace panic with errors

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: add comments and use Get for helper fn

Signed-off-by: isubasinghe <isitha@pipekit.io>

fix: always diagnose as failed when shutdown

Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>

fix: remove error return in taskresult reconcilation

Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe marked this pull request as draft July 26, 2023 08:31
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe marked this pull request as ready for review August 1, 2023 09:32
@isubasinghe
Copy link
Member Author

isubasinghe commented Aug 1, 2023

Simple changes, just makes error handling mandatory.

We encountered too many random failing container sets without these fixes.

Now invalid access errors are just bubbled up, instead of relying on default initialised structs.

@isubasinghe
Copy link
Member Author

isubasinghe commented Aug 1, 2023

Blocking #11493

@caelan-io caelan-io requested a review from Joibel August 1, 2023 09:55
Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@terrytangyuan terrytangyuan merged commit 1f6b19f into argoproj:master Aug 1, 2023
42 of 43 checks passed
Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Simple change, but so many places needed changing; nice work getting through them!

For my own edification, did the edge case bugs primarily happen on retried workflows? Reading through this code, it seems that retried workflows are the only place where Nodes get deleted?
Please correct me if I'm wrong, just trying to understand when/how this class of bugs would happen

@isubasinghe
Copy link
Member Author

@agilgur5 some areas of the code weren't checking if the NodeStatus actually existed, thanks to Go's (in)sane way of dealing with missing entries (zero initialised), we were getting valid structs.

x := make(map[string]NodeStatus)
assert x["blah"] == DefaultInitialisedNodeStatus() // function doesn't actually exist, just making a point 

This didn't just happen for retried workflows, for some reason workflows with container sets pretty much randomly failed all the time. I believe this was because some code path was falsely relying on the default initalised values.

This change now forces you to explicitly handle the case where a NodeStatus is missing.

@agilgur5
Copy link
Member

agilgur5 commented Aug 3, 2023

@agilgur5 some areas of the code weren't checking if the NodeStatus actually existed, thanks to Go's (in)sane way of dealing with missing entries (zero initialised), we were getting valid structs.

This change now forces you to explicitly handle the case where a NodeStatus is missing.

Yea I understood the purpose of the PR. Another Go gotcha 😕

This didn't just happen for retried workflows, for some reason workflows with container sets pretty much randomly failed all the time. I believe this was because some code path was falsely relying on the default initalised values.

This is the part I was curious about. As I would think the only time an uninitialized value would be accessed would be if the key previously existed, i.e. it was deleted. Otherwise, I would think that the Nodes map would only grow (until it is GC'd), and so a non-existent key should never occur.
Or to put it differently, trying to understand the root cause of when this would happen. Not just that the node could not exist, but why it would not exist

@isubasinghe
Copy link
Member Author

This is the part I was curious about. As I would think the only time an uninitialized value would be accessed would be if the key previously existed, i.e. it was deleted. Otherwise, I would think that the Nodes map would only grow (until it is GC'd), and so a non-existent key should never occur.
Or to put it differently, trying to understand the root cause of when this would happen. Not just that the node could not exist, but why it would not exist

This is a good question, unfortunately one I do not have a definitive answer to. I believe that the codebase is somewhat loosely consistent, code paths are allowed to fail as long as eventually they succeed. That is my educated guess anyway.
I dealt with this issue for months before realising that this was the issue. This simple change was likely the most difficult PR I opened in the codebase.

@agilgur5
Copy link
Member

This is a good question, unfortunately one I do not have a definitive answer to.

Gotcha. Was hoping you might know. Guess we still have to watch out for the root cause somewhere (which may give this new error now at least!). If it's not just impacting retries, there might be a thread safety issue somewhere 🤔

This simple change was likely the most difficult PR I opened in the codebase.

I know that feeling all too well 😅 Like to use these types of fixes as great examples that amount of code is often not the most impactful contribution!

@agilgur5
Copy link
Member

agilgur5 commented Sep 1, 2023

Jotting down some notes here as I attempted to find the root cause:

  1. Searched the codebase for places where NodeStatus could have been accessed or removed (keywords: "delete" and "Status.Nodes").
  2. delete indeed seems only used during retries
  3. The only other place I could find where nodes are removed is in this line in compressWorkflow, which I believe is only used for node status offload.
    • Which does lead to a question if this has been reproduced without node status offload?
  4. The only other potential anomaly is that some places seem to lock when modifying the nodes and others don't. I did not inspect carefully why that is though.
    • I also thought that the goroutines did not share a workflow between each other (roughly this codepath), so thought locking wouldn't be necessary. But I may have misinterpreted the code; the controller is the most complex part of the codebase after all.
    • It would be nice if the Go compiler had more advanced race condition detection (like Rust, Pony, etc). I also don't think(?) we use Go's race detector
    • Even if there were some race conditions with modifying, that should never result in a Node not existing, so those cases don't really explain it either 😕

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Oct 18, 2023
Joibel added a commit to Joibel/argo-workflows that referenced this pull request Nov 3, 2023
The merge with argoproj#11451 reverted this, so this commit is just to
reinstate that.
The tests included in argoproj#11379 failed to catch this, I've raised argoproj#12129
for this, but in the interests of matching the documentation and
kubecon next week I'm putting this PR in now.

Fixes argoproj#12117

Signed-off-by: Alan Clucas <alan@clucas.org>
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default initialised NodeStatus causes random false failures.
4 participants