Skip to content
This repository has been archived by the owner on May 13, 2024. It is now read-only.

[Improvement] Increase error state visibility to end users #996

Open
perdasilva opened this issue Jan 7, 2020 · 3 comments
Open

[Improvement] Increase error state visibility to end users #996

perdasilva opened this issue Jan 7, 2020 · 3 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@perdasilva
Copy link
Contributor

perdasilva commented Jan 7, 2020

Currently, it's not necessarily clear to the user if an error has occured along the way. We'll use this ticket to track how we can increase this visibility.

Note: Status messages should be emitted to this status topic.

There are a few different sources for errors, including, but not limited to:

  • Toml validation errors (bff)
  • Problems fetching data (fetcher)
  • Problems transpiling or creating kubernetes resources (executor)
  • Problems with the execution of the benchmark in k8s (should be caught by the watcher)
  • Problems executing a benchmark on SageMaker (sm-executor)

On the python side, most (if not all) services extend the the KafkaService class, which contains utility method that can be used to emit status messages.

For the bff, there is a function that can be used to emit status events.

My suggestion is to go through each of the services and ensure that whichever method is handling an event appropriately catches any errors and emits a helpful error status message.

@gavinmbell gavinmbell added good first issue Good for newcomers enhancement New feature or request labels Jan 7, 2020
@haohanchen-aws
Copy link

haohanchen-aws commented Jan 8, 2020

Credential error happens with data-puller like this
image
We cannot see this error unless we use kubectl logs
Usually this would be fixed by delete and restart the pod.
mrcnn_singlenode.toml.txt

@perdasilva
Copy link
Contributor Author

Yeah, it seems the problem is kube2iam: jtblin/kube2iam#136 - I'll post a PR with a work around

@perdasilva
Copy link
Contributor Author

ok - so I've had a deeper look (even though I haven't been able to reproduce the issue, yet). The puller is an init container. According to the k8s documentation: "If a Pod’s init container fails, Kubernetes repeatedly restarts the Pod until the init container succeeds. However, if the Pod has a restartPolicy of Never, Kubernetes does not restart the Pod". If the benchmark pod is initiating, this can only mean that the restartPolicy is Never.

@haohanchen-yagao it would be good if you could describe the pod when this issue happens, so we can try to vefify the restart policy. If it is indeed Never, we should also update the horovod job template in the executor to set the restart policy to OnFailure.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants