Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change log level for was unable to obtain the node for <workflow name>, <step name> #12382

Open
nicolas-vivot opened this issue Dec 17, 2023 · 5 comments
Labels
area/controller Controller issues, panics type/feature Feature request

Comments

@nicolas-vivot
Copy link

nicolas-vivot commented Dec 17, 2023

Summary

This is to propose to change the log level of the following log type from warning to info:

was unable to obtain the node for ,

Use Cases

This log message happens when the workflow controller wants to schedule a step's pod, but there is no available node with the necessary resources on the cluster.
In such case, this message is just an indication that there are no available resources or node (based on potential selector/taints usage) right now. With cluster auto-scaling, a new node is provisioned and the step's pod get anyway scheduled there.

Therefore, this logs should be changed from warning to info.

Of course there might be some cases where even with the auto-scaler, we have reached the maximum capacity. In such case the step's pod should just be kept on the Kubernetes scheduler with a re-queued, and scheduled when a node/resources become available.

Even in such situation, not being able to obtain a node at a certain time is never an error not a warning.
This logs contributes to log pollution for big clusters that handle heavy load with auto-scaling.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Beyond this issue:

@nicolas-vivot nicolas-vivot added the type/feature Feature request label Dec 17, 2023
@terrytangyuan
Copy link
Member

Related #12132

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Dec 21, 2023
@agilgur5
Copy link
Contributor

agilgur5 commented Dec 21, 2023

This log message happens when the workflow controller wants to schedule a step's pod, but there is no available node with the necessary resources on the cluster.

Are you saying this because this is what you observed happening, or because that's how you interpreted the log message?

Because that log message does not refer to k8s nodes, it refers to nodes of the DAG. If you're getting that message a lot, it is usually indicative of a bug somewhere (see also #12132 (comment)).
This looks indeed like a duplicate of #12132. Do you have more details about when you're receiving this message specifically and what line of code you're getting this from (this is the error message for several error catches)? Or do you have a Workflow and step that reproduces this message consistently?

@agilgur5 agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Dec 21, 2023
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Jan 10, 2024
@nicolas-vivot
Copy link
Author

nicolas-vivot commented Jan 10, 2024

Are you saying this because this is what you observed happening, or because that's how you interpreted the log message?

Based on what i observed.

We are scheduling step's pods on specific node pools with label selectors /toleration, giving cpu/memory limits. When the node pool is full and need to provision a new node (which can take about 90 seconds), then we start to have that kind of logs.

Reading the #12132 it looks like this is actually quite a generic logs and it could happen in many situations.
It was not present on previous version, due to silent drop but probably that was already happening for a long time / several releases yes.

@nicolas-vivot
Copy link
Author

nicolas-vivot commented Jan 10, 2024

I agree that this log message seems to be actually quite important and can highlight some underlying issues/bugs.

If this message can pop for many different reasons/situations, then there is a clear lack of logging somewhere else. Potential problematic situation should be warned at the source, giving contextual information if possible. Of course this is on the paper and easy to say here.

I also noticed quite a lot of backend - front end state inconsistencies, generating two different types of "zombie" workflows.

  • one where the actual underlying workflow/step pod is not running anymore: probably more a state issue inside argo workflow between the state itself and the UI.
  • one where the actual underlying workflo/step pod is still running on Kubernetes, but the internal process got killed and so do not do anything anymore - thus argo workflow controller is probably waiting for completion for nothing.

(i will open a dedicated issue with the zombie workflow cases when i have time)

All of this without having any error logs anywhere... just saying, no criticism here.

@github-actions github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/feature Feature request
Projects
None yet
Development

No branches or pull requests

3 participants