-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UI] Fix Trial Logs when Kubernetes Job Fails #2164
[UI] Fix Trial Logs when Kubernetes Job Fails #2164
Conversation
@andreyvelich: GitHub didn't allow me to assign the following users: d-gol. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thank you for fixing this bug!
} | ||
|
||
// Otherwise, return the first Failed Pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Otherwise, return the first Failed Pod. | |
// Otherwise, return the first Failed or Pending Pod. | |
} |
Is this included in Pending Pods too, right?
So, should we return a failed pod if there are pending pods and failed pods in podList?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this included in Pending Pods too, right?
Yes, that's right. In case of Pending pod, UI shows the similar error with my change:
One of the Pod containers is is ContainerCreating stage
@tenzen-y Do you think we should process only Failed pod after we processed Succeeded and Running (like you mentioned above: So, should we return a failed pod if there are pending pods and failed pods in podList
?
And then, if there are no more Failed pods, just return podList.Items[0].Name
(Pending Pod) and UI should show the above error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should process only Failed pod after we processed Succeeded and Running ?
Yes, I do. I think users may prefer that show logs for a failed pod rather than a pending pod. wdyt?
Also in the current implementation, If there are pending pods and failed pods, maybe, the selected pods will differ each time we run clientset.CoreV1().Pods(trial.ObjectMeta.Namespace).List...
.
Hence, for either a pending or failed pod, we should guarantee which phase this function returns.
And then, if there are no more Failed pods, just return podList.Items[0].Name (Pending Pod) and UI should show the above error ?
Uhm. I'm not sure the above message is appropriate. However, I don't think that pending means ContainerCreating
. For example, when using a custom scheduler to enforce pending pods based on priority or requested resources, it will take a bit much time until is scheduled into a Node and starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logs for a failed pod rather than a pending pod. wdyt
That makes sense.
The selected pods will differ each time we run clientset.CoreV1().Pods(trial.ObjectMeta.Namespace).
From my testing however, the [0] element is always the last created Pod (despite that list in client-go
doesn't guarantee ordering).
I guess, that is how caching works behind list operation.
However, I don't think that pending means ContainerCreating
That is right, it can be different message I just added this as an example. We can always log this message to the user (when Pod is the Pending state): Failed to get logs for this Trial. Pod is in the Pending state
. WDYT @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my testing however, the [0] element is always the last created Pod (despite that list in client-go doesn't guarantee ordering).
I guess, that is how caching works behind list operation.
It's interesting. Thanks for sharing the result of the investigation!
FYI: I guess the informer cache might affect the result. And the informer cache generally is rotated in intervals. IIRC, an interval is 10 hours as a default.
Failed to get logs for this Trial. Pod is in the Pending state.
Looks great :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @johnugeorge I've changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I fixed UI backend that should return proper value when Kubernetes Job Fails. When Job Fails, it spawns multiple Pods and our UI shows this:
More than one master replica found
.To fix this, I follow this:
/assign @tenzen-y @kimwnasptd @johnugeorge @apo-ger @d-gol