Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flood of "Unable to obtain node for" error messages #12132

Open
2 of 3 tasks
terrytangyuan opened this issue Nov 3, 2023 · 15 comments
Open
2 of 3 tasks

Flood of "Unable to obtain node for" error messages #12132

terrytangyuan opened this issue Nov 3, 2023 · 15 comments
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug

Comments

@terrytangyuan
Copy link
Member

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Seeing a lot of these errors:

time="2023-11-02T12:45:44.922Z" level=error msg="was unable to obtain node for " namespace=argo workflow=hello-world-h7bpn
time="2023-11-02T12:45:44.922Z" level=error msg="was unable to obtain node for " namespace=argo workflow=hello-world-7vspx

This was introduced in #11665

cc @isubasinghe

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Hello world example

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@agilgur5
Copy link
Member

agilgur5 commented Nov 3, 2023

latest

This was introduced in #11665

Rather #11451 if you're on 3.5, since #11665 was the backport.

Seeing a lot of these errors:

Is there a specific circumstance under which you see them? Since #11451 should mostly just reveal any existing bugs that were previously silently missed due to zero values.

Also is anything going wrong in Workflow evaluation or is it just printing errors while still operating correctly?

More details would be very useful as the error itself is not a root cause.

@agilgur5 agilgur5 added area/controller Controller issues, panics P3 Low priority labels Nov 3, 2023
@terrytangyuan
Copy link
Member Author

The workflow is fine. We should check whether these errors are necessary. Maybe node was just not initialized yet.

@agilgur5
Copy link
Member

agilgur5 commented Nov 4, 2023

We should check whether these errors are necessary. Maybe node was just not initialized yet.

So if the node was not initialized yet, then there shouldn't be a call to Nodes.Get at all. The previous code already did that though, meaning the previous code was incorrect.
The error message is an indication of a bug somewhere. If we fix the root cause of that bug, then the error should no longer show up.

The error message is not the problem, the underlying bug is.

@isubasinghe
Copy link
Member

isubasinghe commented Nov 4, 2023

@terrytangyuan and @agilgur5 I know exactly what is going on here, it is speculative execution on the DAG section of the code.

It most likely is looking for the boundaryNode here. The boundaryNode is represented by the empty string.
To me this logging is a good thing and indicates something wrong on the code side (speculation in the DAG code).

We can of course do an if statement on the boundaryID and not print if it is the empty string, I opted not to do this because it would mean that any serious bug would also not get logged.

The unfortunate thing is that NodeIDs are not quite strings, at least from a logical perspective.
They are much closer to a sum type.

data NodeId 
  = NodeID String
  |  BoundaryNode

or

pub enum NodeID {
  NodeID(String),
  BoundaryID
}

I see this as two issues:

  1. Something more typesafe for enum representation
  2. Refactor the DAG code to remove speculation.

Of course happy to do the if statement as a temporary solution if the logging is too annoying

@isubasinghe
Copy link
Member

So if the node was not initialized yet, then there shouldn't be a call to Nodes.Get at all. The previous code already did that though, meaning the previous code was incorrect.

Very much agreed

@agilgur5
Copy link
Member

agilgur5 commented Nov 4, 2023

It most likely is looking for the boundaryNode here.

Might've missed a code link in "here"?

We can of course do an if statement on the boundaryID and not print if it is the empty string

So #12020 did something similar, but in that case the boundaryID is an option in every caller and so it indeed may be unset.
I'm not sure which case this one is since a code link is missing, per above

@tooptoop4
Copy link
Contributor

@isubasinghe

@isubasinghe
Copy link
Member

I suspect this arises from the dag as mentioned previously, I am now going to find out why we can ignore the error and I will document this. If we can successfully ignore the error, I will remove the log for that particular section.

@isubasinghe
Copy link
Member

isubasinghe commented Jan 20, 2024

I don't really know how to fix this without rewriting the dag.go logic, the core problem is that the dag logic is entirely speculative, this means that it is impossible to differentiate between a missing node and one that doesn't exist yet.

I'm going to do something that I don't particularly like and open a PR to remove the logging in this section, the argument I am going to use for this is just the fact that while this is problematic, the code does seem to work.

@agilgur5 what are your thoughts, I'd presume you'd also ideally want the speculation itself to be removed?

A correction on the previous comment would be that this is not just for the boundaryNode, its for every node that this speculation exists. (Confirmed via running a dag and viewing the dag and logs in real time).

@drawlerr
Copy link
Contributor

May be worth seeing if this still reproduces after #12233

@isubasinghe
Copy link
Member

@drawlerr the DAG code will always produce this logging unfortunately

@agilgur5 agilgur5 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Feb 12, 2024
@k-kaminskas
Copy link

Bumping this issue.

After a while of getting warning level logs, a fatal level log is received for the same workflow, which kills the whole controller.

time="2024-04-02T16:39:45.097Z" level=fatal msg="was unable to obtain node for xx-1712057400-2227577750" namespace=xx workflow=xx-cron-1712057400

It happens on version 3.5.5

I had to rollback to argo-workflows 3.4.10 which is not an option as well, since it has an issue of not resolving the dag step output results to lifecycle-hooks.

@ryancurrah
Copy link
Contributor

ryancurrah commented Jun 4, 2024

Also getting a flood of these log messages on Argo Workflows v3.5.7.

level=warning msg="was unable to obtain the node for foo-8c78r-235979071, taskName beforetest"

@agilgur5
Copy link
Member

agilgur5 commented Jun 4, 2024

Per my previous comments, this message is generally* indicative of a bug somewhere. It can appear in many different places, so context is critical, e.g. if it happens on a certain Workflow every time. Without those details, comments are not very helpful.

* = as Isitha wrote above, sometimes it's not a bug, but an unconditional log; in which case, the code should be modified to not log and have a comment about why the error is ignored / or not use the code which errors in the first place (e.g. it could be skipped entirely if the node will always be unintialized).

@terrytangyuan
Copy link
Member Author

in which case, the code should be modified to not log and have a comment about why the error is ignored / or not use the code which errors in the first place (e.g. it could be skipped entirely if the node will always be unintialized).

Agreed with this approach. Does anyone want to pick this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants