[UI] Fix Trial Logs when Kubernetes Job Fails #2164

andreyvelich · 2023-06-19T20:09:05Z

I fixed UI backend that should return proper value when Kubernetes Job Fails. When Job Fails, it spawns multiple Pods and our UI shows this: More than one master replica found.

To fix this, I follow this:

If one Succeeded or Running Pod exists, we print the logs.
Otherwise, print logs from one of Failed Pods.

/assign @tenzen-y @kimwnasptd @johnugeorge @apo-ger @d-gol

google-oss-prow · 2023-06-19T20:09:09Z

@andreyvelich: GitHub didn't allow me to assign the following users: d-gol.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

I fixed UI backend that should return proper value when Kubernetes Job Fails. When Job Fails, it spawns multiple Pods and our UI shows this: More than one master replica found.

To fix this, I follow this:

If one Succeeded or Running Pod exists, we print the logs.

Otherwise, print logs from one of Failed pods.

/assign @tenzen-y @kimwnasptd @johnugeorge @apo-ger @d-gol

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tenzen-y

@andreyvelich Thank you for fixing this bug!

tenzen-y · 2023-06-19T20:30:11Z

pkg/new-ui/v1beta1/backend.go

 	}

+	// Otherwise, return the first Failed Pod.


Suggested change

// Otherwise, return the first Failed Pod.

// Otherwise, return the first Failed or Pending Pod.

}

Is this included in Pending Pods too, right?
So, should we return a failed pod if there are pending pods and failed pods in podList?

Is this included in Pending Pods too, right?

Yes, that's right. In case of Pending pod, UI shows the similar error with my change:

One of the Pod containers is is ContainerCreating stage

@tenzen-y Do you think we should process only Failed pod after we processed Succeeded and Running (like you mentioned above: So, should we return a failed pod if there are pending pods and failed pods in podList?

And then, if there are no more Failed pods, just return podList.Items[0].Name (Pending Pod) and UI should show the above error ?

Do you think we should process only Failed pod after we processed Succeeded and Running ?

Yes, I do. I think users may prefer that show logs for a failed pod rather than a pending pod. wdyt?

Also in the current implementation, If there are pending pods and failed pods, maybe, the selected pods will differ each time we run clientset.CoreV1().Pods(trial.ObjectMeta.Namespace).List....

Hence, for either a pending or failed pod, we should guarantee which phase this function returns.

And then, if there are no more Failed pods, just return podList.Items[0].Name (Pending Pod) and UI should show the above error ?

Uhm. I'm not sure the above message is appropriate. However, I don't think that pending means ContainerCreating. For example, when using a custom scheduler to enforce pending pods based on priority or requested resources, it will take a bit much time until is scheduled into a Node and starts.

logs for a failed pod rather than a pending pod. wdyt

That makes sense.

The selected pods will differ each time we run clientset.CoreV1().Pods(trial.ObjectMeta.Namespace).

From my testing however, the [0] element is always the last created Pod (despite that list in client-go doesn't guarantee ordering).
I guess, that is how caching works behind list operation.

However, I don't think that pending means ContainerCreating

That is right, it can be different message I just added this as an example. We can always log this message to the user (when Pod is the Pending state): Failed to get logs for this Trial. Pod is in the Pending state. WDYT @tenzen-y ?

From my testing however, the [0] element is always the last created Pod (despite that list in client-go doesn't guarantee ordering).
I guess, that is how caching works behind list operation.

It's interesting. Thanks for sharing the result of the investigation!

FYI: I guess the informer cache might affect the result. And the informer cache generally is rotated in intervals. IIRC, an interval is 10 hours as a default.

Failed to get logs for this Trial. Pod is in the Pending state.

Looks great :)

@tenzen-y @johnugeorge I've changed it.

tenzen-y

Thanks!
/lgtm
/approve

google-oss-prow · 2023-06-20T02:20:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/new-ui/v1beta1/OWNERS~~ [andreyvelich,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

[UI] Fix Trial Logs when Kubernetes Job Fails

bb16872

google-oss-prow bot assigned apo-ger, johnugeorge, kimwnasptd and tenzen-y Jun 19, 2023

google-oss-prow bot requested review from anencore94, johnugeorge and kimwnasptd June 19, 2023 20:09

google-oss-prow bot added size/XS approved labels Jun 19, 2023

tenzen-y reviewed Jun 19, 2023

View reviewed changes

Return error when Pod is in the Pending state

94cf6af

google-oss-prow bot added size/S and removed size/XS labels Jun 19, 2023

tenzen-y reviewed Jun 20, 2023

View reviewed changes

google-oss-prow bot added the lgtm label Jun 20, 2023

google-oss-prow bot merged commit ede6e74 into kubeflow:master Jun 20, 2023

andreyvelich deleted the fix-ui-trial-log branch June 20, 2023 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UI] Fix Trial Logs when Kubernetes Job Fails #2164

[UI] Fix Trial Logs when Kubernetes Job Fails #2164

andreyvelich commented Jun 19, 2023 •

edited

Loading

google-oss-prow bot commented Jun 19, 2023

tenzen-y left a comment

tenzen-y Jun 19, 2023

andreyvelich Jun 19, 2023 •

edited

Loading

tenzen-y Jun 19, 2023

andreyvelich Jun 19, 2023 •

edited

Loading

tenzen-y Jun 19, 2023 •

edited

Loading

andreyvelich Jun 19, 2023

tenzen-y left a comment

google-oss-prow bot commented Jun 20, 2023

	// Otherwise, return the first Failed Pod.
	// Otherwise, return the first Failed or Pending Pod.
	}

[UI] Fix Trial Logs when Kubernetes Job Fails #2164

[UI] Fix Trial Logs when Kubernetes Job Fails #2164

Conversation

andreyvelich commented Jun 19, 2023 • edited Loading

google-oss-prow bot commented Jun 19, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

tenzen-y Jun 19, 2023

Choose a reason for hiding this comment

andreyvelich Jun 19, 2023 • edited Loading

Choose a reason for hiding this comment

tenzen-y Jun 19, 2023

Choose a reason for hiding this comment

andreyvelich Jun 19, 2023 • edited Loading

Choose a reason for hiding this comment

tenzen-y Jun 19, 2023 • edited Loading

Choose a reason for hiding this comment

andreyvelich Jun 19, 2023

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Jun 20, 2023

andreyvelich commented Jun 19, 2023 •

edited

Loading

andreyvelich Jun 19, 2023 •

edited

Loading

andreyvelich Jun 19, 2023 •

edited

Loading

tenzen-y Jun 19, 2023 •

edited

Loading