Change log level for `was unable to obtain the node for <workflow name>, <step name>` #12382

nicolas-vivot · 2023-12-17T01:47:07Z

Summary

This is to propose to change the log level of the following log type from warning to info:

was unable to obtain the node for ,

Use Cases

This log message happens when the workflow controller wants to schedule a step's pod, but there is no available node with the necessary resources on the cluster.
In such case, this message is just an indication that there are no available resources or node (based on potential selector/taints usage) right now. With cluster auto-scaling, a new node is provisioned and the step's pod get anyway scheduled there.

Therefore, this logs should be changed from warning to info.

Of course there might be some cases where even with the auto-scaler, we have reached the maximum capacity. In such case the step's pod should just be kept on the Kubernetes scheduler with a re-queued, and scheduled when a node/resources become available.

Even in such situation, not being able to obtain a node at a certain time is never an error not a warning.
This logs contributes to log pollution for big clusters that handle heavy load with auto-scaling.

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Beyond this issue:

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2023-12-17T02:13:41Z

Related #12132

agilgur5 · 2023-12-21T02:45:02Z

This log message happens when the workflow controller wants to schedule a step's pod, but there is no available node with the necessary resources on the cluster.

Are you saying this because this is what you observed happening, or because that's how you interpreted the log message?

Because that log message does not refer to k8s nodes, it refers to nodes of the DAG. If you're getting that message a lot, it is usually indicative of a bug somewhere (see also #12132 (comment)).
This looks indeed like a duplicate of #12132. Do you have more details about when you're receiving this message specifically and what line of code you're getting this from (this is the error message for several error catches)? Or do you have a Workflow and step that reproduces this message consistently?

github-actions · 2024-01-10T02:14:29Z

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

nicolas-vivot · 2024-01-10T06:40:38Z

Are you saying this because this is what you observed happening, or because that's how you interpreted the log message?

Based on what i observed.

We are scheduling step's pods on specific node pools with label selectors /toleration, giving cpu/memory limits. When the node pool is full and need to provision a new node (which can take about 90 seconds), then we start to have that kind of logs.

Reading the #12132 it looks like this is actually quite a generic logs and it could happen in many situations.
It was not present on previous version, due to silent drop but probably that was already happening for a long time / several releases yes.

nicolas-vivot · 2024-01-10T06:49:06Z

I agree that this log message seems to be actually quite important and can highlight some underlying issues/bugs.

If this message can pop for many different reasons/situations, then there is a clear lack of logging somewhere else. Potential problematic situation should be warned at the source, giving contextual information if possible. Of course this is on the paper and easy to say here.

I also noticed quite a lot of backend - front end state inconsistencies, generating two different types of "zombie" workflows.

one where the actual underlying workflow/step pod is not running anymore: probably more a state issue inside argo workflow between the state itself and the UI.
one where the actual underlying workflo/step pod is still running on Kubernetes, but the internal process got killed and so do not do anything anymore - thus argo workflow controller is probably waiting for completion for nothing.

(i will open a dedicated issue with the zombie workflow cases when i have time)

All of this without having any error logs anywhere... just saying, no criticism here.

nicolas-vivot added the type/feature Feature request label Dec 17, 2023

agilgur5 added the area/controller Controller issues, panics label Dec 21, 2023

agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Dec 21, 2023

github-actions bot added the problem/stale This has not had a response in some time label Jan 10, 2024

github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change log level for `was unable to obtain the node for <workflow name>, <step name>` #12382

Change log level for `was unable to obtain the node for <workflow name>, <step name>` #12382

nicolas-vivot commented Dec 17, 2023 •

edited

Loading

terrytangyuan commented Dec 17, 2023

agilgur5 commented Dec 21, 2023 •

edited

Loading

github-actions bot commented Jan 10, 2024

nicolas-vivot commented Jan 10, 2024 •

edited

Loading

nicolas-vivot commented Jan 10, 2024 •

edited

Loading

Change log level for was unable to obtain the node for <workflow name>, <step name> #12382

Change log level for was unable to obtain the node for <workflow name>, <step name> #12382

Comments

nicolas-vivot commented Dec 17, 2023 • edited Loading

Summary

Use Cases

terrytangyuan commented Dec 17, 2023

agilgur5 commented Dec 21, 2023 • edited Loading

github-actions bot commented Jan 10, 2024

nicolas-vivot commented Jan 10, 2024 • edited Loading

nicolas-vivot commented Jan 10, 2024 • edited Loading

Change log level for `was unable to obtain the node for <workflow name>, <step name>` #12382

Change log level for `was unable to obtain the node for <workflow name>, <step name>` #12382

nicolas-vivot commented Dec 17, 2023 •

edited

Loading

agilgur5 commented Dec 21, 2023 •

edited

Loading

nicolas-vivot commented Jan 10, 2024 •

edited

Loading

nicolas-vivot commented Jan 10, 2024 •

edited

Loading