You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 13, 2024. It is now read-only.
Currently, it's not necessarily clear to the user if an error has occured along the way. We'll use this ticket to track how we can increase this visibility.
Note: Status messages should be emitted to this status topic.
There are a few different sources for errors, including, but not limited to:
Toml validation errors (bff)
Problems fetching data (fetcher)
Problems transpiling or creating kubernetes resources (executor)
Problems with the execution of the benchmark in k8s (should be caught by the watcher)
Problems executing a benchmark on SageMaker (sm-executor)
On the python side, most (if not all) services extend the the KafkaService class, which contains utility method that can be used to emit status messages.
For the bff, there is a function that can be used to emit status events.
My suggestion is to go through each of the services and ensure that whichever method is handling an event appropriately catches any errors and emits a helpful error status message.
The text was updated successfully, but these errors were encountered:
Credential error happens with data-puller like this
We cannot see this error unless we use kubectl logs
Usually this would be fixed by delete and restart the pod. mrcnn_singlenode.toml.txt
ok - so I've had a deeper look (even though I haven't been able to reproduce the issue, yet). The puller is an init container. According to the k8s documentation: "If a Pod’s init container fails, Kubernetes repeatedly restarts the Pod until the init container succeeds. However, if the Pod has a restartPolicy of Never, Kubernetes does not restart the Pod". If the benchmark pod is initiating, this can only mean that the restartPolicy is Never.
@haohanchen-yagao it would be good if you could describe the pod when this issue happens, so we can try to vefify the restart policy. If it is indeed Never, we should also update the horovod job template in the executor to set the restart policy to OnFailure.
Currently, it's not necessarily clear to the user if an error has occured along the way. We'll use this ticket to track how we can increase this visibility.
Note: Status messages should be emitted to this status topic.
There are a few different sources for errors, including, but not limited to:
On the python side, most (if not all) services extend the the KafkaService class, which contains utility method that can be used to emit status messages.
For the bff, there is a function that can be used to emit status events.
My suggestion is to go through each of the services and ensure that whichever method is handling an event appropriately catches any errors and emits a helpful error status message.
The text was updated successfully, but these errors were encountered: