[Improvement] Increase error state visibility to end users #996

perdasilva · 2020-01-07T12:18:30Z

Currently, it's not necessarily clear to the user if an error has occured along the way. We'll use this ticket to track how we can increase this visibility.

Note: Status messages should be emitted to this status topic.

There are a few different sources for errors, including, but not limited to:

Toml validation errors (bff)
Problems fetching data (fetcher)
Problems transpiling or creating kubernetes resources (executor)
Problems with the execution of the benchmark in k8s (should be caught by the watcher)
Problems executing a benchmark on SageMaker (sm-executor)

On the python side, most (if not all) services extend the the KafkaService class, which contains utility method that can be used to emit status messages.

For the bff, there is a function that can be used to emit status events.

My suggestion is to go through each of the services and ensure that whichever method is handling an event appropriately catches any errors and emits a helpful error status message.

haohanchen-aws · 2020-01-08T12:16:26Z

Credential error happens with data-puller like this

We cannot see this error unless we use kubectl logs
Usually this would be fixed by delete and restart the pod.
mrcnn_singlenode.toml.txt

perdasilva · 2020-01-08T12:47:13Z

Yeah, it seems the problem is kube2iam: jtblin/kube2iam#136 - I'll post a PR with a work around

perdasilva · 2020-01-09T07:58:00Z

ok - so I've had a deeper look (even though I haven't been able to reproduce the issue, yet). The puller is an init container. According to the k8s documentation: "If a Pod’s init container fails, Kubernetes repeatedly restarts the Pod until the init container succeeds. However, if the Pod has a restartPolicy of Never, Kubernetes does not restart the Pod". If the benchmark pod is initiating, this can only mean that the restartPolicy is Never.

@haohanchen-yagao it would be good if you could describe the pod when this issue happens, so we can try to vefify the restart policy. If it is indeed Never, we should also update the horovod job template in the executor to set the restart policy to OnFailure.

gavinmbell added good first issue Good for newcomers enhancement New feature or request labels Jan 7, 2020

surajkota mentioned this issue Jan 7, 2020

Add system logs to increase visibility #1001

Open

perdasilva mentioned this issue Jan 8, 2020

[Workaround] Puller polls s3 to ensure credentials #1004

Merged

surajkota mentioned this issue Jan 17, 2020

Cancel jobs which require unavailable instance types #369

Open

This was referenced Jan 17, 2020

[Watcher] Init:ImagePullBackOff - bad status #843

Open

FetcherStatus.FAILED message is empty #393

Open

This was referenced Jan 27, 2020

Single instance GPU job failed #980

Open

[Executor] script mode not working #1016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] Increase error state visibility to end users #996

[Improvement] Increase error state visibility to end users #996

perdasilva commented Jan 7, 2020 •

edited

Loading

haohanchen-aws commented Jan 8, 2020 •

edited

Loading

perdasilva commented Jan 8, 2020

perdasilva commented Jan 9, 2020

[Improvement] Increase error state visibility to end users #996

[Improvement] Increase error state visibility to end users #996

Comments

perdasilva commented Jan 7, 2020 • edited Loading

haohanchen-aws commented Jan 8, 2020 • edited Loading

perdasilva commented Jan 8, 2020

perdasilva commented Jan 9, 2020

perdasilva commented Jan 7, 2020 •

edited

Loading

haohanchen-aws commented Jan 8, 2020 •

edited

Loading