Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better explanation of why trials failed - Trial is missing K8s events and failure condition could use a better reason #848

Closed
jlewi opened this issue Oct 3, 2019 · 3 comments · Fixed by #852

Comments

@jlewi
Copy link
Contributor

jlewi commented Oct 3, 2019

/kind feature

Describe the solution you'd like

I have a trial in the failed state.

apiVersion: kubeflow.org/v1alpha2
kind: Trial
metadata:
  ...
spec:
  ...
status:
  completionTime: 2019-10-02T14:24:36Z
  conditions:
  - lastTransitionTime: 2019-10-02T13:24:25Z
    lastUpdateTime: 2019-10-02T13:24:25Z
    message: Trial is created
    reason: TrialCreated
    status: "True"
    type: Created
  - lastTransitionTime: 2019-10-02T14:24:36Z
    lastUpdateTime: 2019-10-02T14:24:36Z
    message: Trial is running
    reason: TrialRunning
    status: "False"
    type: Running
  - lastTransitionTime: 2019-10-02T14:24:36Z
    lastUpdateTime: 2019-10-02T14:24:36Z
    message: Trial has failed
    reason: TrialFailed
    status: "True"
    type: Failed
  observation:
    metrics:
    - name: accuracy
      value: -0.08421557
  startTime: 2019-10-02T13:24:25Z

It is unclear why the trial is failed. This is especially confusing since an observation metric is reported which would suggest that the trial succeeded.

I looked for Kubernetes events associated with the trial and actually found none. I would have expected Kubernetes events to be associated with the trial indicating creation of the training jobs and metric collectors as well as reporting any failures.

It would also be good to include a more informative message in the TrialFailed condition about why the trial failed.

@gaocegege
Copy link
Member

Can we aggregate the job/tfjob/pytorchjob 's event into trial's event? Does it work for you?

@johnugeorge
Copy link
Member

@jlewi Did the job actually error out when you see the status of job?

@jlewi
Copy link
Contributor Author

jlewi commented Oct 3, 2019

@johnugeorge the jobs were deleted presumably by the trial when it entered the failed state. The events for the Job show pods created but that's it. I looked at the pods and it looks like the container may not have started because of "failed to set up sandbox container".

@gaocegege My expectation is that a trial should emit events corresponding to the actions it takes. So when a trial creates a job it should emit an event indicating that the job was created.

Similarly when a Trial enters a failed state it should emit an event indicating that and there should be a reason. The reason should be indicative of why the Trial decided that it was failed. I wouldn't expect that to just be an aggregation of job events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants