-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change log level for was unable to obtain the node for <workflow name>, <step name>
#12382
Comments
Related #12132 |
Are you saying this because this is what you observed happening, or because that's how you interpreted the log message? Because that log message does not refer to k8s nodes, it refers to nodes of the DAG. If you're getting that message a lot, it is usually indicative of a bug somewhere (see also #12132 (comment)). |
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs. |
Based on what i observed. We are scheduling step's pods on specific node pools with label selectors /toleration, giving cpu/memory limits. When the node pool is full and need to provision a new node (which can take about 90 seconds), then we start to have that kind of logs. Reading the #12132 it looks like this is actually quite a generic logs and it could happen in many situations. |
I agree that this log message seems to be actually quite important and can highlight some underlying issues/bugs. If this message can pop for many different reasons/situations, then there is a clear lack of logging somewhere else. Potential problematic situation should be warned at the source, giving contextual information if possible. Of course this is on the paper and easy to say here. I also noticed quite a lot of backend - front end state inconsistencies, generating two different types of "zombie" workflows.
(i will open a dedicated issue with the zombie workflow cases when i have time) All of this without having any error logs anywhere... just saying, no criticism here. |
Summary
This is to propose to change the log level of the following log type from
warning
toinfo
:Use Cases
This log message happens when the workflow controller wants to schedule a step's pod, but there is no available node with the necessary resources on the cluster.
In such case, this message is just an indication that there are no available resources or node (based on potential selector/taints usage) right now. With cluster auto-scaling, a new node is provisioned and the step's pod get anyway scheduled there.
Therefore, this logs should be changed from
warning
toinfo
.Of course there might be some cases where even with the auto-scaler, we have reached the maximum capacity. In such case the step's pod should just be kept on the Kubernetes scheduler with a re-queued, and scheduled when a node/resources become available.
Even in such situation, not being able to obtain a node at a certain time is never an error not a warning.
This logs contributes to log pollution for big clusters that handle heavy load with auto-scaling.
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.
Beyond this issue:
The text was updated successfully, but these errors were encountered: