Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a100-MIG multiple mig devices not supported with torchserve #1237

Open
LydiaXiaohongLi opened this issue Sep 7, 2021 · 8 comments
Open

a100-MIG multiple mig devices not supported with torchserve #1237

LydiaXiaohongLi opened this issue Sep 7, 2021 · 8 comments
Assignees
Labels
triaged_wait Waiting for the Reporter's resp

Comments

@LydiaXiaohongLi
Copy link

I have created multiple MIG devices on one machine with one nvidia a100 GPU, when i started torchserve, with no limit on number_of_gpu, I expect it to use all MIG devices created (total 6), however, only MIG DEV 0 is used.

xiaohong_li@semantic-torchserve-a100-mig:~/title_semantic_relevance_api_torchserve$ nvidia-smi
Tue Sep 7 06:25:02 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:00:04.0 Off | On |
| N/A 34C P0 47W / 400W | 2084MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 2074MiB / 9984MiB | 28 0 | 2 0 1 0 0 |
| | 4MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 1 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 2 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 3 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 4 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 5 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 3 0 17401 C /opt/conda/bin/python3.7 2067MiB |
+-----------------------------------------------------------------------------+

Context

  • torchserve version: 0.4.2
  • torch-model-archiver version: 0.4.2
  • torch version: 1.9.0+cu111
  • java version: 11.0.12
  • Operating System and version: debian

Your Environment

  • Installed using source? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU
  • Using a default/custom handler?: custom handler
  • What kind of model is it e.g. vision, text, audio?: transformers, distilled bert
  • Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? local model
  • Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs:

Expected Behavior

torchserve to use all MIG devices, i.e. to load models on all MIG devices and do inference using all MIG devices

Current Behavior

torchserve only uses MIG device 0 (total 6)

Steps to Reproduce

  1. create a100 device with MIG enabled, and multiple MIG devices created. e.g. following https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus#mig-gpu
  2. run any sample torchserve model

Thanks!

@HamidShojanazeri
Copy link
Collaborator

HamidShojanazeri commented Sep 8, 2021

Thanks @LydiaXiaohongLi for opening this ticket, I believe this is happening as devices are assigned using gpu physical ids, as indicated in the link you shared, all the partitioned GPU are on GPU:0, you might be able to modify the device assignment in a custom handler to support the partitioned GPUs, an example of custom handler can be found here.

@HamidShojanazeri HamidShojanazeri self-assigned this Sep 8, 2021
@HamidShojanazeri HamidShojanazeri added the triaged_wait Waiting for the Reporter's resp label Sep 8, 2021
@LydiaXiaohongLi
Copy link
Author

Hihi Thanks a lot @HamidShojanazeri for your prompt reply!

Sorry may I understand more how do I do device assignment with partitioned GPUs, when all partitioned GPUs assigned with same gpu physical id?

I understand A100 MIGs, have this unique UUID for each partitioned GPU, as shown below.

xxxx@xxxx:~$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/14/0)

And I am only able to use command lines with CUDA_VISIBLE_DEVICES to bring up processes that runs on different partitioned MIG device, such as below commands. however, I am not so sure, how do I integrate this with torchserve? It would be much appreciated if you could share some sample code with A100 MIG partitioned GPU? Thanks a lot!

CUDA_VISIBLE_DEVICES=MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0 python test.py
CUDA_VISIBLE_DEVICES=MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/8/0 python test.py

@LydiaXiaohongLi
Copy link
Author

hihi @HamidShojanazeri , any updates? Thank you!

@HamidShojanazeri
Copy link
Collaborator

HamidShojanazeri commented Sep 22, 2021

@LydiaXiaohongLi sorry for the late reply. I am looking into this issue, it seems it is only available from command line Nvidia-smi and CUDA_VISBILE_DEVICES extended to support it. I am looking into a way to programmatically access this info through Nvidia-smi or other packages. It is not supported through CUDA utilities in Pytorch yet.

In the meantime if you are blocked, one hacky way that should work which is not scalable is to assign devices by iterating on the "GPU instance ID" which you are aware of it.

It seems the structure of the device_name is : MIG- "GPU-UUID" "GPU instance ID" "compute instance ID"

something like :

GPU_UUID="MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b"
GPU_instance_ID_list = [0,1,..7]

self.map_location = GPU_UUID if torch.cuda.is_available()  else "cpu"

self.device = torch.device(
          self.map_location +  "/{}/0".format(random.choice(GPU_instance_ID_list))

@LydiaXiaohongLi
Copy link
Author

hi @HamidShojanazeri ,

I have tried your suggestion, however, the device string format of GPU_UUID/GPU_instance_ID/0 is not valid device string.
I have tried some other formats for device string, all failed for "Invalid device string".

RuntimeError: Invalid device string: 'GPU-58a224da-d632-bb7e-bdce-3e0e117ad6e4/7/0'

Does it allow you to create the torch device with UUID string format?

I am using:
torchserve version: 0.4.2
torch version: 1.9.0+cu111.

Many thanks!
Regards
Xiaohong

@HamidShojanazeri
Copy link
Collaborator

HamidShojanazeri commented Sep 22, 2021

@LydiaXiaohongLi Sorry I think a "MIG-" was missing from the start of the string, so should be "MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0". I will look into NVML Python Binding it might be helpful to provide available MIG devices. You might want to give it a shot as well. Here is the doc on python bidings.

@LydiaXiaohongLi
Copy link
Author

hihi @HamidShojanazeri, thanks for the references. Btw, I have tried various formats of device string, including with 'MIG-' prefix, still failing for the same reason.

@nickisworking
Copy link

nickisworking commented Jan 23, 2024

@LydiaXiaohongLi @HamidShojanazeri
Hi, This issue seems to have been around for a while...Is there any update in this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged_wait Waiting for the Reporter's resp
Projects
None yet
Development

No branches or pull requests

3 participants