Need alarms for unhealthy GPU cases #2192

mzmssg · 2019-02-22T05:24:42Z

Organization Name: Microsoft

Short summary about the issue/question:

There are some GPU exceptions PAI can't perfectly handle. For these case, as the first phase, we will alert and introunce manual operations.

job-exporter collect gpu status
alert admin
admin manually decommission nodes.

Currently, we summarized some unhealthy cases, including:

Nvidia-smi hangs
1. In kernal mode
  Detection: Filter processes list of command 'nvidia-smi' and status 'D', to reduce false positive alarms, we could add a timeout:
Nvidia-smi works
1. GPU memory leak
  Detection: Call nvidia-smi to get memory-utilized GPUs and processes running GPUs,
  Subtraction results is the memory leak GPU.
2. GPU used by external processes
  Detection: Call nvidia-smi, get running processes, then get their cgroup in /proc/$pid/cgroup, check whether it’s under job container
3. GPU used by zombie container
  Detection: Based on 2, further check whether corresponding yarn container survives.
4. GPU ECC error
  Detection: Call nvidia-smi to get ecc error code.

OpenPAI Environment:

OpenPAI version: master

Anything else we need to know:

Alarms and manual operations are the first step.
As the next step, PAI itself should tolerate such cases.

Test Cases

Zombie container test:
On any worker node, run command to produce a "zombie" container:

sudo docker run --name admin-tensorflow-serving-container_e09_1551848189332_0010_01_000004 --rm --init -d --privileged=false --oom-score-adj=1000 --cap-add=SYS_ADMIN --network=host --cpus=4 --memory=8192m --shm-size=64m --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/fuse --security-opt apparmor:unconfined --volume /var/drivers/nvidia/current:/usr/local/nvidia:ro --entrypoint= openpai/pai.example.tensorflow-serving /bin/bash -c "bazel-bin/tensorflow_serving/example/mnist_saved_model /tmp/mnist_model && while :; do tensorflow_model_server --port=5001 --model_name=mnist --model_base_path=/tmp/mnist_model; done"

Then check prometheus alerts at http://master_node:9091/prometheus/alerts after a few mins. There should be 2 alerts of GpuUsedByZombieContainer and PaiJobsZombie.
Then in dev-box, execute following commands:
```
Get alerting gpus:
cd src/tools
python node_maintain.py badgpus get -m {prometheus_ip} 

Add nodes to blacklist:
python node_maintain.py blacklist add -n {unhealthy_nodes} -m {api-server-ip}

Decommission:
python node_maintain.py blacklist enforce -m {master_ip} [--api-server-ip api-server-ip] [--resource-manager-ip resource-manager-ip]
```

The text was updated successfully, but these errors were encountered:

mzmssg · 2019-02-22T06:40:23Z

please help add more cases

xudifsd · 2019-02-25T04:52:19Z

@mzmssg how to get their cgroup in /proc/$pid/cgroup? The content of it looks like:

12:rdma:/
11:hugetlb:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
10:pids:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
9:memory:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
8:cpuset:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
7:devices:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
6:perf_event:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
5:blkio:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
4:net_cls,net_prio:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
3:freezer:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
2:cpu,cpuacct:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
1:name=systemd:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c

How to get ecc error code?

xudifsd · 2019-02-25T06:10:02Z

job-exporter call nvidia-smi with timeout 3 second, I have checked 95th latency for calling nvidia-smi is from 0.5~1,2 second, so we could enlarge the timeout to be 5 second, and trigger alert on 3 second.

mzmssg · 2019-02-25T06:37:21Z

@xudifsd
Yes, in this case, this process belong to a docker container of ID 74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c.

For ecc error, a response of nvidia-smi as follows, ECC error code is the right-top number in every GPU, it could be a number, N/A or Off. A number greater than 0 means abnormal.

   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 384.111                Driver Version: 384.111                   |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |===============================+======================+======================|
   |   0  Tesla K80           Off  | 00006B24:00:00.0 Off |                    0 |
   | N/A   26C    P8    34W / 149W |   3322MiB / 11439MiB |      0%      Default |
   +-------------------------------+----------------------+----------------------+

xudifsd · 2019-02-25T06:47:27Z

@mzaman1 job-exporter reads output of nvidia-smi -q -x, which has following output related to ECC:

                <ecc_mode>
                        <current_ecc>Enabled</current_ecc>
                        <pending_ecc>Enabled</pending_ecc>
                </ecc_mode>
                <ecc_errors>
                        <volatile>
                                <single_bit>
                                        <device_memory>0</device_memory>
                                        <register_file>0</register_file>
                                        <l1_cache>0</l1_cache>
                                        <l2_cache>0</l2_cache>
                                        <texture_memory>0</texture_memory>
                                        <texture_shm>N/A</texture_shm>
                                        <cbu>N/A</cbu>
                                        <total>0</total>
                                </single_bit>
                                <double_bit>
                                        <device_memory>0</device_memory>
                                        <register_file>0</register_file>
                                        <l1_cache>0</l1_cache>
                                        <l2_cache>0</l2_cache>
                                        <texture_memory>0</texture_memory>
                                        <texture_shm>N/A</texture_shm>
                                        <cbu>N/A</cbu>
                                        <total>0</total>
                                </double_bit>
                        </volatile>
                        <aggregate>
                                <single_bit>
                                        <device_memory>0</device_memory>
                                        <register_file>0</register_file>
                                        <l1_cache>0</l1_cache>
                                        <l2_cache>0</l2_cache>
                                        <texture_memory>0</texture_memory>
                                        <texture_shm>N/A</texture_shm>
                                        <cbu>N/A</cbu>
                                        <total>0</total>
                                </single_bit>
                                <double_bit>
                                        <device_memory>9</device_memory>
                                        <register_file>0</register_file>
                                        <l1_cache>0</l1_cache>
                                        <l2_cache>0</l2_cache>
                                        <texture_memory>0</texture_memory>
                                        <texture_shm>N/A</texture_shm>
                                        <cbu>N/A</cbu>
                                        <total>9</total>
                                </double_bit>
                        </aggregate>
                </ecc_errors>

Maybe I should use ecc_errors.volatile.single_bit.total and ecc_errors.volatile.double_bit.total?

mzmssg · 2019-02-25T10:47:36Z

@xudifsd
Seems right.

By the way, I think zombie GPU container should trigger unhealthy GPU alert instead of zombie container alert. Because GPU is a higher priority issue which might impact scheduling, but zombie container won't.

fanyangCS · 2019-02-25T12:44:22Z

job-exporter call nvidia-smi with timeout 3 second, I have checked 95th latency for calling nvidia-smi is from 0.5~1,2 second, so we could enlarge the timeout to be 5 second, and trigger alert on 3 second.

@xudifsd , how long have you test the calling latency to get 0.5~2s 95th latency? can you log the timeout? I suggest to set the timeout to be 10s of seconds. Based on previous experience, sometimes it could be as much as several minutes.

xudifsd · 2019-02-26T01:58:35Z

@fanyangCS for about two weeks, you can see latency from prod bed. The timeout set in prod bed is 3 second now, so the latency line should all below 3 second. From the graph, it seems no nvidia call has been interrupted by timeout.

I'm fine with setting timeout to be 10s, and maybe set alert on 5s? We may need more cases of nvidia hangs, so better setting lower alert threshold?

xudifsd · 2019-02-26T06:01:56Z

@mzmssg do we need to alert on ecc error? From the manual

Single bit ECC errors are automatically corrected by the HW and do not result in data corruption. Double bit errors are detected but not corrected.

Maybe just alert for double error?

xudifsd · 2019-02-26T06:49:22Z

Memory leak should larger than 20M to be troublesome.

fanyangCS · 2019-02-28T07:19:03Z

related issue: #2146

mzmssg · 2019-03-11T06:02:41Z

test case checkbox:

3~5 are hard to simulate

xudifsd · 2019-03-13T04:07:05Z

maybe should close this issue since the PR has been merged

mzmssg · 2019-03-18T11:17:12Z

Closed

mzmssg added C-BITA-USTC C-DLTS labels Feb 22, 2019

mzmssg assigned xudifsd and mzmssg Feb 22, 2019

scarlett2018 added raised by customer ops-opt labels Feb 22, 2019

scarlett2018 added this to the 0.11.0 milestone Feb 26, 2019

scarlett2018 mentioned this issue Feb 26, 2019

Iteration Plan for March 2019 #2207

Closed

11 tasks

xudifsd mentioned this issue Feb 26, 2019

alert on unhealthy gpu #2209

Merged

4 tasks

scarlett2018 added the high priority label Mar 1, 2019

mzmssg mentioned this issue Mar 13, 2019

Query unhealthy GPU from prometheus #2319

Merged

mzmssg closed this as completed Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need alarms for unhealthy GPU cases #2192

Need alarms for unhealthy GPU cases #2192

mzmssg commented Feb 22, 2019 •

edited by scarlett2018

Loading

mzmssg commented Feb 22, 2019 •

edited

Loading

xudifsd commented Feb 25, 2019 •

edited

Loading

xudifsd commented Feb 25, 2019

mzmssg commented Feb 25, 2019

xudifsd commented Feb 25, 2019

mzmssg commented Feb 25, 2019 •

edited

Loading

fanyangCS commented Feb 25, 2019 •

edited

Loading

xudifsd commented Feb 26, 2019

xudifsd commented Feb 26, 2019

xudifsd commented Feb 26, 2019

fanyangCS commented Feb 28, 2019

mzmssg commented Mar 11, 2019 •

edited

Loading

xudifsd commented Mar 13, 2019

mzmssg commented Mar 18, 2019

Need alarms for unhealthy GPU cases #2192

Need alarms for unhealthy GPU cases #2192

Comments

mzmssg commented Feb 22, 2019 • edited by scarlett2018 Loading

Test Cases

mzmssg commented Feb 22, 2019 • edited Loading

xudifsd commented Feb 25, 2019 • edited Loading

xudifsd commented Feb 25, 2019

mzmssg commented Feb 25, 2019

xudifsd commented Feb 25, 2019

mzmssg commented Feb 25, 2019 • edited Loading

fanyangCS commented Feb 25, 2019 • edited Loading

xudifsd commented Feb 26, 2019

xudifsd commented Feb 26, 2019

xudifsd commented Feb 26, 2019

fanyangCS commented Feb 28, 2019

mzmssg commented Mar 11, 2019 • edited Loading

xudifsd commented Mar 13, 2019

mzmssg commented Mar 18, 2019

mzmssg commented Feb 22, 2019 •

edited by scarlett2018

Loading

mzmssg commented Feb 22, 2019 •

edited

Loading

xudifsd commented Feb 25, 2019 •

edited

Loading

mzmssg commented Feb 25, 2019 •

edited

Loading

fanyangCS commented Feb 25, 2019 •

edited

Loading

mzmssg commented Mar 11, 2019 •

edited

Loading