a100-MIG multiple mig devices not supported with torchserve #1237

LydiaXiaohongLi · 2021-09-07T07:06:48Z

I have created multiple MIG devices on one machine with one nvidia a100 GPU, when i started torchserve, with no limit on number_of_gpu, I expect it to use all MIG devices created (total 6), however, only MIG DEV 0 is used.

+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 2074MiB / 9984MiB | 28 0 | 2 0 1 0 0 |
| | 4MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 1 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 2 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 3 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 4 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 5 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 3 0 17401 C /opt/conda/bin/python3.7 2067MiB |
+-----------------------------------------------------------------------------+

Context

torchserve version: 0.4.2
torch-model-archiver version: 0.4.2
torch version: 1.9.0+cu111
java version: 11.0.12
Operating System and version: debian

Your Environment

Installed using source? [yes/no]: no
Is it a CPU or GPU environment?: GPU
Using a default/custom handler?: custom handler
What kind of model is it e.g. vision, text, audio?: transformers, distilled bert
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? local model
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs:

Expected Behavior

torchserve to use all MIG devices, i.e. to load models on all MIG devices and do inference using all MIG devices

Current Behavior

torchserve only uses MIG device 0 (total 6)

Steps to Reproduce

create a100 device with MIG enabled, and multiple MIG devices created. e.g. following https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus#mig-gpu
run any sample torchserve model

Thanks!

HamidShojanazeri · 2021-09-08T03:58:49Z

Thanks @LydiaXiaohongLi for opening this ticket, I believe this is happening as devices are assigned using gpu physical ids, as indicated in the link you shared, all the partitioned GPU are on GPU:0, you might be able to modify the device assignment in a custom handler to support the partitioned GPUs, an example of custom handler can be found here.

LydiaXiaohongLi · 2021-09-12T12:51:26Z

Hihi Thanks a lot @HamidShojanazeri for your prompt reply!

Sorry may I understand more how do I do device assignment with partitioned GPUs, when all partitioned GPUs assigned with same gpu physical id?

I understand A100 MIGs, have this unique UUID for each partitioned GPU, as shown below.

xxxx@xxxx:~$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/14/0)

And I am only able to use command lines with CUDA_VISIBLE_DEVICES to bring up processes that runs on different partitioned MIG device, such as below commands. however, I am not so sure, how do I integrate this with torchserve? It would be much appreciated if you could share some sample code with A100 MIG partitioned GPU? Thanks a lot!

CUDA_VISIBLE_DEVICES=MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0 python test.py
CUDA_VISIBLE_DEVICES=MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/8/0 python test.py

LydiaXiaohongLi · 2021-09-21T08:36:55Z

hihi @HamidShojanazeri , any updates? Thank you!

HamidShojanazeri · 2021-09-22T05:30:47Z

@LydiaXiaohongLi sorry for the late reply. I am looking into this issue, it seems it is only available from command line Nvidia-smi and CUDA_VISBILE_DEVICES extended to support it. I am looking into a way to programmatically access this info through Nvidia-smi or other packages. It is not supported through CUDA utilities in Pytorch yet.

In the meantime if you are blocked, one hacky way that should work which is not scalable is to assign devices by iterating on the "GPU instance ID" which you are aware of it.

It seems the structure of the device_name is : MIG- "GPU-UUID" "GPU instance ID" "compute instance ID"

something like :

GPU_UUID="MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b"
GPU_instance_ID_list = [0,1,..7]

self.map_location = GPU_UUID if torch.cuda.is_available()  else "cpu"

self.device = torch.device(
          self.map_location +  "/{}/0".format(random.choice(GPU_instance_ID_list))

LydiaXiaohongLi · 2021-09-22T10:08:37Z

hi @HamidShojanazeri ,

I have tried your suggestion, however, the device string format of GPU_UUID/GPU_instance_ID/0 is not valid device string.
I have tried some other formats for device string, all failed for "Invalid device string".

RuntimeError: Invalid device string: 'GPU-58a224da-d632-bb7e-bdce-3e0e117ad6e4/7/0'

Does it allow you to create the torch device with UUID string format?

I am using:
torchserve version: 0.4.2
torch version: 1.9.0+cu111.

Many thanks!
Regards
Xiaohong

HamidShojanazeri · 2021-09-22T19:01:44Z

@LydiaXiaohongLi Sorry I think a "MIG-" was missing from the start of the string, so should be "MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0". I will look into NVML Python Binding it might be helpful to provide available MIG devices. You might want to give it a shot as well. Here is the doc on python bidings.

LydiaXiaohongLi · 2021-09-23T14:27:10Z

hihi @HamidShojanazeri, thanks for the references. Btw, I have tried various formats of device string, including with 'MIG-' prefix, still failing for the same reason.

nickisworking · 2024-01-23T17:12:41Z

@LydiaXiaohongLi @HamidShojanazeri
Hi, This issue seems to have been around for a while...Is there any update in this issue?

HamidShojanazeri self-assigned this Sep 8, 2021

HamidShojanazeri added the triaged_wait Waiting for the Reporter's resp label Sep 8, 2021

msaroufim mentioned this issue Sep 3, 2022

how allocate a specific GPU memory for each model. #1846

Open

henrysecond1 mentioned this issue Jan 3, 2023

Add a flag to disable system metrics collection #2052

Closed

x45dev mentioned this issue Nov 29, 2023

AttributeError: 'NoneType' object has no attribute 'groups' rossumai/nvgpu#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a100-MIG multiple mig devices not supported with torchserve #1237

a100-MIG multiple mig devices not supported with torchserve #1237

LydiaXiaohongLi commented Sep 7, 2021

HamidShojanazeri commented Sep 8, 2021 •

edited

Loading

LydiaXiaohongLi commented Sep 12, 2021

LydiaXiaohongLi commented Sep 21, 2021

HamidShojanazeri commented Sep 22, 2021 •

edited

Loading

LydiaXiaohongLi commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021 •

edited

Loading

LydiaXiaohongLi commented Sep 23, 2021

nickisworking commented Jan 23, 2024 •

edited

Loading

a100-MIG multiple mig devices not supported with torchserve #1237

a100-MIG multiple mig devices not supported with torchserve #1237

Comments

LydiaXiaohongLi commented Sep 7, 2021

Context

Your Environment

Expected Behavior

Current Behavior

Steps to Reproduce

HamidShojanazeri commented Sep 8, 2021 • edited Loading

LydiaXiaohongLi commented Sep 12, 2021

LydiaXiaohongLi commented Sep 21, 2021

HamidShojanazeri commented Sep 22, 2021 • edited Loading

LydiaXiaohongLi commented Sep 22, 2021

HamidShojanazeri commented Sep 22, 2021 • edited Loading

LydiaXiaohongLi commented Sep 23, 2021

nickisworking commented Jan 23, 2024 • edited Loading

HamidShojanazeri commented Sep 8, 2021 •

edited

Loading

HamidShojanazeri commented Sep 22, 2021 •

edited

Loading

HamidShojanazeri commented Sep 22, 2021 •

edited

Loading

nickisworking commented Jan 23, 2024 •

edited

Loading