-
Notifications
You must be signed in to change notification settings - Fork 548
Need alarms for unhealthy GPU cases #2192
Comments
please help add more cases |
@mzmssg how to get their cgroup in
How to get ecc error code? |
job-exporter call |
@xudifsd For ecc error, a response of nvidia-smi as follows, ECC error code is the right-top number in every GPU, it could be a number,
|
@mzaman1 job-exporter reads output of
Maybe I should use |
@xudifsd By the way, I think zombie GPU container should trigger unhealthy GPU alert instead of zombie container alert. Because GPU is a higher priority issue which might impact scheduling, but zombie container won't. |
@xudifsd , how long have you test the calling latency to get 0.5~2s 95th latency? can you log the timeout? I suggest to set the timeout to be 10s of seconds. Based on previous experience, sometimes it could be as much as several minutes. |
@fanyangCS for about two weeks, you can see latency from prod bed. The timeout set in prod bed is 3 second now, so the latency line should all below 3 second. From the graph, it seems no nvidia call has been interrupted by timeout. I'm fine with setting timeout to be 10s, and maybe set alert on 5s? We may need more cases of nvidia hangs, so better setting lower alert threshold? |
Memory leak should larger than 20M to be troublesome. |
related issue: #2146 |
test case checkbox:
3~5 are hard to simulate |
maybe should close this issue since the PR has been merged |
Closed |
Organization Name: Microsoft
Short summary about the issue/question:
There are some GPU exceptions PAI can't perfectly handle. For these case, as the first phase, we will alert and introunce manual operations.
Currently, we summarized some unhealthy cases, including:
Detection: Filter processes list of command 'nvidia-smi' and status 'D', to reduce false positive alarms, we could add a timeout:
Detection: Call nvidia-smi to get memory-utilized GPUs and processes running GPUs,
Subtraction results is the memory leak GPU.
Detection: Call nvidia-smi, get running processes, then get their cgroup in /proc/$pid/cgroup, check whether it’s under job container
Detection: Based on 2, further check whether corresponding yarn container survives.
Detection: Call nvidia-smi to get ecc error code.
OpenPAI Environment:
Anything else we need to know:
Alarms and manual operations are the first step.
As the next step, PAI itself should tolerate such cases.
Test Cases
Zombie container test:
On any worker node, run command to produce a "zombie" container:
sudo docker run --name admin-tensorflow-serving-container_e09_1551848189332_0010_01_000004 --rm --init -d --privileged=false --oom-score-adj=1000 --cap-add=SYS_ADMIN --network=host --cpus=4 --memory=8192m --shm-size=64m --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/fuse --security-opt apparmor:unconfined --volume /var/drivers/nvidia/current:/usr/local/nvidia:ro --entrypoint= openpai/pai.example.tensorflow-serving /bin/bash -c "bazel-bin/tensorflow_serving/example/mnist_saved_model /tmp/mnist_model && while :; do tensorflow_model_server --port=5001 --model_name=mnist --model_base_path=/tmp/mnist_model; done"
Then check prometheus alerts at http://master_node:9091/prometheus/alerts after a few mins. There should be 2 alerts of
GpuUsedByZombieContainer
andPaiJobsZombie
.Then in dev-box, execute following commands:
The text was updated successfully, but these errors were encountered: