-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flood of "Unable to obtain node for" error messages #12132
Comments
Rather #11451 if you're on 3.5, since #11665 was the backport.
Is there a specific circumstance under which you see them? Since #11451 should mostly just reveal any existing bugs that were previously silently missed due to zero values. Also is anything going wrong in Workflow evaluation or is it just printing errors while still operating correctly? More details would be very useful as the error itself is not a root cause. |
The workflow is fine. We should check whether these errors are necessary. Maybe node was just not initialized yet. |
So if the node was not initialized yet, then there shouldn't be a call to The error message is not the problem, the underlying bug is. |
@terrytangyuan and @agilgur5 I know exactly what is going on here, it is speculative execution on the DAG section of the code. It most likely is looking for the boundaryNode here. The boundaryNode is represented by the empty string. We can of course do an if statement on the boundaryID and not print if it is the empty string, I opted not to do this because it would mean that any serious bug would also not get logged. The unfortunate thing is that NodeIDs are not quite strings, at least from a logical perspective. data NodeId
= NodeID String
| BoundaryNode or pub enum NodeID {
NodeID(String),
BoundaryID
} I see this as two issues:
Of course happy to do the if statement as a temporary solution if the logging is too annoying |
Very much agreed |
Might've missed a code link in "here"?
So #12020 did something similar, but in that case the |
I suspect this arises from the dag as mentioned previously, I am now going to find out why we can ignore the error and I will document this. If we can successfully ignore the error, I will remove the log for that particular section. |
I don't really know how to fix this without rewriting the dag.go logic, the core problem is that the dag logic is entirely speculative, this means that it is impossible to differentiate between a missing node and one that doesn't exist yet. I'm going to do something that I don't particularly like and open a PR to remove the logging in this section, the argument I am going to use for this is just the fact that while this is problematic, the code does seem to work. @agilgur5 what are your thoughts, I'd presume you'd also ideally want the speculation itself to be removed? A correction on the previous comment would be that this is not just for the boundaryNode, its for every node that this speculation exists. (Confirmed via running a dag and viewing the dag and logs in real time). |
May be worth seeing if this still reproduces after #12233 |
@drawlerr the DAG code will always produce this logging unfortunately |
Bumping this issue. After a while of getting warning level logs, a fatal level log is received for the same workflow, which kills the whole controller.
It happens on version 3.5.5 I had to rollback to argo-workflows 3.4.10 which is not an option as well, since it has an issue of not resolving the dag step output results to lifecycle-hooks. |
Also getting a flood of these log messages on Argo Workflows v3.5.7.
|
Per my previous comments, this message is generally* indicative of a bug somewhere. It can appear in many different places, so context is critical, e.g. if it happens on a certain Workflow every time. Without those details, comments are not very helpful.
|
Agreed with this approach. Does anyone want to pick this up? |
Pre-requisites
:latest
What happened/what you expected to happen?
Seeing a lot of these errors:
This was introduced in #11665
cc @isubasinghe
Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Hello world example
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: