Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Need alarms for unhealthy GPU cases #2192

Closed
1 task done
mzmssg opened this issue Feb 22, 2019 · 14 comments · Fixed by #2209
Closed
1 task done

Need alarms for unhealthy GPU cases #2192

mzmssg opened this issue Feb 22, 2019 · 14 comments · Fixed by #2209

Comments

@mzmssg
Copy link
Member

mzmssg commented Feb 22, 2019

Organization Name: Microsoft

Short summary about the issue/question:

There are some GPU exceptions PAI can't perfectly handle. For these case, as the first phase, we will alert and introunce manual operations.

  1. job-exporter collect gpu status
  2. alert admin
  3. admin manually decommission nodes.

Currently, we summarized some unhealthy cases, including:

  • Nvidia-smi hangs
    1. In kernal mode
      Detection: Filter processes list of command 'nvidia-smi' and status 'D', to reduce false positive alarms, we could add a timeout:
  • Nvidia-smi works
    1. GPU memory leak
      Detection: Call nvidia-smi to get memory-utilized GPUs and processes running GPUs,
      Subtraction results is the memory leak GPU.
    2. GPU used by external processes
      Detection: Call nvidia-smi, get running processes, then get their cgroup in /proc/$pid/cgroup, check whether it’s under job container
    3. GPU used by zombie container
      Detection: Based on 2, further check whether corresponding yarn container survives.
    4. GPU ECC error
      Detection: Call nvidia-smi to get ecc error code.

OpenPAI Environment:

  • OpenPAI version: master

Anything else we need to know:

Alarms and manual operations are the first step.
As the next step, PAI itself should tolerate such cases.

Test Cases

  • Zombie container test:
    On any worker node, run command to produce a "zombie" container:

    sudo docker run --name admin-tensorflow-serving-container_e09_1551848189332_0010_01_000004 --rm --init -d --privileged=false --oom-score-adj=1000 --cap-add=SYS_ADMIN --network=host --cpus=4 --memory=8192m --shm-size=64m --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/fuse --security-opt apparmor:unconfined --volume /var/drivers/nvidia/current:/usr/local/nvidia:ro --entrypoint= openpai/pai.example.tensorflow-serving /bin/bash -c "bazel-bin/tensorflow_serving/example/mnist_saved_model /tmp/mnist_model && while :; do tensorflow_model_server --port=5001 --model_name=mnist --model_base_path=/tmp/mnist_model; done"

    Then check prometheus alerts at http://master_node:9091/prometheus/alerts after a few mins. There should be 2 alerts of GpuUsedByZombieContainer and PaiJobsZombie.
    Then in dev-box, execute following commands:

    Get alerting gpus:
    cd src/tools
    python node_maintain.py badgpus get -m {prometheus_ip} 
    
    Add nodes to blacklist:
    python node_maintain.py blacklist add -n {unhealthy_nodes} -m {api-server-ip}
    
    Decommission:
    python node_maintain.py blacklist enforce -m {master_ip} [--api-server-ip api-server-ip] [--resource-manager-ip resource-manager-ip]
    
@mzmssg
Copy link
Member Author

mzmssg commented Feb 22, 2019

please help add more cases

@xudifsd
Copy link
Member

xudifsd commented Feb 25, 2019

@mzmssg how to get their cgroup in /proc/$pid/cgroup? The content of it looks like:

12:rdma:/
11:hugetlb:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
10:pids:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
9:memory:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
8:cpuset:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
7:devices:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
6:perf_event:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
5:blkio:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
4:net_cls,net_prio:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
3:freezer:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
2:cpu,cpuacct:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c
1:name=systemd:/docker/74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c

How to get ecc error code?

@xudifsd
Copy link
Member

xudifsd commented Feb 25, 2019

job-exporter call nvidia-smi with timeout 3 second, I have checked 95th latency for calling nvidia-smi is from 0.5~1,2 second, so we could enlarge the timeout to be 5 second, and trigger alert on 3 second.

@mzmssg
Copy link
Member Author

mzmssg commented Feb 25, 2019

@xudifsd
Yes, in this case, this process belong to a docker container of ID 74ce4b73b4a46b937129ca40ab4bb131105a1e14a12b8839cf05a04cc4e90b3c.

For ecc error, a response of nvidia-smi as follows, ECC error code is the right-top number in every GPU, it could be a number, N/A or Off. A number greater than 0 means abnormal.

   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 384.111                Driver Version: 384.111                   |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |===============================+======================+======================|
   |   0  Tesla K80           Off  | 00006B24:00:00.0 Off |                    0 |
   | N/A   26C    P8    34W / 149W |   3322MiB / 11439MiB |      0%      Default |
   +-------------------------------+----------------------+----------------------+

@xudifsd
Copy link
Member

xudifsd commented Feb 25, 2019

@mzaman1 job-exporter reads output of nvidia-smi -q -x, which has following output related to ECC:

                <ecc_mode>
                        <current_ecc>Enabled</current_ecc>
                        <pending_ecc>Enabled</pending_ecc>
                </ecc_mode>
                <ecc_errors>
                        <volatile>
                                <single_bit>
                                        <device_memory>0</device_memory>
                                        <register_file>0</register_file>
                                        <l1_cache>0</l1_cache>
                                        <l2_cache>0</l2_cache>
                                        <texture_memory>0</texture_memory>
                                        <texture_shm>N/A</texture_shm>
                                        <cbu>N/A</cbu>
                                        <total>0</total>
                                </single_bit>
                                <double_bit>
                                        <device_memory>0</device_memory>
                                        <register_file>0</register_file>
                                        <l1_cache>0</l1_cache>
                                        <l2_cache>0</l2_cache>
                                        <texture_memory>0</texture_memory>
                                        <texture_shm>N/A</texture_shm>
                                        <cbu>N/A</cbu>
                                        <total>0</total>
                                </double_bit>
                        </volatile>
                        <aggregate>
                                <single_bit>
                                        <device_memory>0</device_memory>
                                        <register_file>0</register_file>
                                        <l1_cache>0</l1_cache>
                                        <l2_cache>0</l2_cache>
                                        <texture_memory>0</texture_memory>
                                        <texture_shm>N/A</texture_shm>
                                        <cbu>N/A</cbu>
                                        <total>0</total>
                                </single_bit>
                                <double_bit>
                                        <device_memory>9</device_memory>
                                        <register_file>0</register_file>
                                        <l1_cache>0</l1_cache>
                                        <l2_cache>0</l2_cache>
                                        <texture_memory>0</texture_memory>
                                        <texture_shm>N/A</texture_shm>
                                        <cbu>N/A</cbu>
                                        <total>9</total>
                                </double_bit>
                        </aggregate>
                </ecc_errors>

Maybe I should use ecc_errors.volatile.single_bit.total and ecc_errors.volatile.double_bit.total?

@mzmssg
Copy link
Member Author

mzmssg commented Feb 25, 2019

@xudifsd
Seems right.

By the way, I think zombie GPU container should trigger unhealthy GPU alert instead of zombie container alert. Because GPU is a higher priority issue which might impact scheduling, but zombie container won't.

@fanyangCS
Copy link
Contributor

fanyangCS commented Feb 25, 2019

job-exporter call nvidia-smi with timeout 3 second, I have checked 95th latency for calling nvidia-smi is from 0.5~1,2 second, so we could enlarge the timeout to be 5 second, and trigger alert on 3 second.

@xudifsd , how long have you test the calling latency to get 0.5~2s 95th latency? can you log the timeout? I suggest to set the timeout to be 10s of seconds. Based on previous experience, sometimes it could be as much as several minutes.

@xudifsd
Copy link
Member

xudifsd commented Feb 26, 2019

@fanyangCS for about two weeks, you can see latency from prod bed. The timeout set in prod bed is 3 second now, so the latency line should all below 3 second. From the graph, it seems no nvidia call has been interrupted by timeout.

I'm fine with setting timeout to be 10s, and maybe set alert on 5s? We may need more cases of nvidia hangs, so better setting lower alert threshold?

@scarlett2018 scarlett2018 added this to the 0.11.0 milestone Feb 26, 2019
@xudifsd
Copy link
Member

xudifsd commented Feb 26, 2019

@mzmssg do we need to alert on ecc error? From the manual

Single bit ECC errors are automatically corrected by the HW and do not result in data corruption. Double bit errors are detected but not corrected.

Maybe just alert for double error?

@xudifsd
Copy link
Member

xudifsd commented Feb 26, 2019

Memory leak should larger than 20M to be troublesome.

@xudifsd xudifsd mentioned this issue Feb 26, 2019
4 tasks
@fanyangCS
Copy link
Contributor

related issue: #2146

@mzmssg
Copy link
Member Author

mzmssg commented Mar 11, 2019

test case checkbox:

  • GPU used by zombie container
  • GPU used by external process
  • GPU ECC error
  • GPU memory leak
  • nvidia-smi hanging

3~5 are hard to simulate

@xudifsd
Copy link
Member

xudifsd commented Mar 13, 2019

maybe should close this issue since the PR has been merged

@mzmssg
Copy link
Member Author

mzmssg commented Mar 18, 2019

Closed

@mzmssg mzmssg closed this as completed Mar 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants