-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a100-MIG multiple mig devices not supported with torchserve #1237
Comments
Thanks @LydiaXiaohongLi for opening this ticket, I believe this is happening as devices are assigned using gpu physical ids, as indicated in the link you shared, all the partitioned GPU are on GPU:0, you might be able to modify the device assignment in a custom handler to support the partitioned GPUs, an example of custom handler can be found here. |
Hihi Thanks a lot @HamidShojanazeri for your prompt reply! Sorry may I understand more how do I do device assignment with partitioned GPUs, when all partitioned GPUs assigned with same gpu physical id? I understand A100 MIGs, have this unique UUID for each partitioned GPU, as shown below.
And I am only able to use command lines with CUDA_VISIBLE_DEVICES to bring up processes that runs on different partitioned MIG device, such as below commands. however, I am not so sure, how do I integrate this with torchserve? It would be much appreciated if you could share some sample code with A100 MIG partitioned GPU? Thanks a lot!
|
hihi @HamidShojanazeri , any updates? Thank you! |
@LydiaXiaohongLi sorry for the late reply. I am looking into this issue, it seems it is only available from command line Nvidia-smi and CUDA_VISBILE_DEVICES extended to support it. I am looking into a way to programmatically access this info through Nvidia-smi or other packages. It is not supported through CUDA utilities in Pytorch yet. In the meantime if you are blocked, one hacky way that should work which is not scalable is to assign devices by iterating on the "GPU instance ID" which you are aware of it. It seems the structure of the device_name is : MIG- "GPU-UUID" "GPU instance ID" "compute instance ID" something like :
|
hi @HamidShojanazeri , I have tried your suggestion, however, the device string format of GPU_UUID/GPU_instance_ID/0 is not valid device string.
Does it allow you to create the torch device with UUID string format? I am using: Many thanks! |
@LydiaXiaohongLi Sorry I think a "MIG-" was missing from the start of the string, so should be "MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0". I will look into NVML Python Binding it might be helpful to provide available MIG devices. You might want to give it a shot as well. Here is the doc on python bidings. |
hihi @HamidShojanazeri, thanks for the references. Btw, I have tried various formats of device string, including with 'MIG-' prefix, still failing for the same reason. |
@LydiaXiaohongLi @HamidShojanazeri |
I have created multiple MIG devices on one machine with one nvidia a100 GPU, when i started torchserve, with no limit on number_of_gpu, I expect it to use all MIG devices created (total 6), however, only MIG DEV 0 is used.
xiaohong_li@semantic-torchserve-a100-mig:~/title_semantic_relevance_api_torchserve$ nvidia-smi
Tue Sep 7 06:25:02 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:00:04.0 Off | On |
| N/A 34C P0 47W / 400W | 2084MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 2074MiB / 9984MiB | 28 0 | 2 0 1 0 0 |
| | 4MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 1 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 2 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 3 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 4 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 5 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 3 0 17401 C /opt/conda/bin/python3.7 2067MiB |
+-----------------------------------------------------------------------------+
Context
Your Environment
Expected Behavior
torchserve to use all MIG devices, i.e. to load models on all MIG devices and do inference using all MIG devices
Current Behavior
torchserve only uses MIG device 0 (total 6)
Steps to Reproduce
Thanks!
The text was updated successfully, but these errors were encountered: