You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is unclear why the trial is failed. This is especially confusing since an observation metric is reported which would suggest that the trial succeeded.
I looked for Kubernetes events associated with the trial and actually found none. I would have expected Kubernetes events to be associated with the trial indicating creation of the training jobs and metric collectors as well as reporting any failures.
It would also be good to include a more informative message in the TrialFailed condition about why the trial failed.
The text was updated successfully, but these errors were encountered:
@johnugeorge the jobs were deleted presumably by the trial when it entered the failed state. The events for the Job show pods created but that's it. I looked at the pods and it looks like the container may not have started because of "failed to set up sandbox container".
@gaocegege My expectation is that a trial should emit events corresponding to the actions it takes. So when a trial creates a job it should emit an event indicating that the job was created.
Similarly when a Trial enters a failed state it should emit an event indicating that and there should be a reason. The reason should be indicative of why the Trial decided that it was failed. I wouldn't expect that to just be an aggregation of job events.
/kind feature
Describe the solution you'd like
I have a trial in the failed state.
It is unclear why the trial is failed. This is especially confusing since an observation metric is reported which would suggest that the trial succeeded.
I looked for Kubernetes events associated with the trial and actually found none. I would have expected Kubernetes events to be associated with the trial indicating creation of the training jobs and metric collectors as well as reporting any failures.
It would also be good to include a more informative message in the TrialFailed condition about why the trial failed.
The text was updated successfully, but these errors were encountered: