[Bug] SDK无法支持多卡GPU #2380

ChrisKong93 · 2023-08-28T01:37:19Z

Checklist

I have searched related issues but cannot get the expected help.
2. I have read the FAQ documentation but cannot get the expected help.
3. The bug has not been fixed in the latest version.

Describe the bug

我想在一个python代码中，将模型同时加载到两个GPU上，循环让两个GPU进行推理，第一次可以推理成功，第二次就报错了

Reproduction

主要代码如下：
`gpu_count = len(gpus_id)

try:
for i in range(len(gpus_id)):
gpu_id = int(gpus_id[i])
print(gpu_id)
model_path = "'./resnet50{}'".format(i)
exec('classifier{} = Classifier(model_path= {} ,device_name= {}, device_id = {})'.format(
i, model_path, "'cuda'", gpu_id))
except RuntimeError as e:
classifier = Classifier(model_path='./resnet50', device_name='cpu', device_id=0)`

Environment

08/28 09:30:11 - mmengine - INFO - 

08/28 09:30:11 - mmengine - INFO - **********Environmental information**********
08/28 09:30:11 - mmengine - INFO - sys.platform: linux
08/28 09:30:11 - mmengine - INFO - Python: 3.8.17 (default, Jul  5 2023, 21:04:15) [GCC 11.2.0]
08/28 09:30:11 - mmengine - INFO - CUDA available: True
08/28 09:30:11 - mmengine - INFO - GPU 0,1: NVIDIA GeForce RTX 3090
08/28 09:30:11 - mmengine - INFO - CUDA_HOME: /usr/local/cuda-11.3
08/28 09:30:11 - mmengine - INFO - NVCC: Cuda compilation tools, release 11.3, V11.3.109
08/28 09:30:11 - mmengine - INFO - GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
08/28 09:30:11 - mmengine - INFO - PyTorch: 1.12.1
08/28 09:30:11 - mmengine - INFO - PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

08/28 09:30:11 - mmengine - INFO - TorchVision: 0.13.1
08/28 09:30:11 - mmengine - INFO - OpenCV: 4.8.0
08/28 09:30:11 - mmengine - INFO - MMCV: 1.5.1
08/28 09:30:11 - mmengine - INFO - MMCV Compiler: GCC 7.5
08/28 09:30:11 - mmengine - INFO - MMCV CUDA Compiler: 11.3
08/28 09:30:11 - mmengine - INFO - MMDeploy: 1.2.0+553f9b8
08/28 09:30:11 - mmengine - INFO - 

08/28 09:30:11 - mmengine - INFO - **********Backend information**********
08/28 09:30:11 - mmengine - INFO - tensorrt:    None
08/28 09:30:11 - mmengine - INFO - ONNXRuntime: None
08/28 09:30:11 - mmengine - INFO - ONNXRuntime-gpu:     1.8.1
08/28 09:30:11 - mmengine - INFO - ONNXRuntime custom ops:      Available
08/28 09:30:11 - mmengine - INFO - pplnn:       None
08/28 09:30:11 - mmengine - INFO - ncnn:        None
08/28 09:30:11 - mmengine - INFO - snpe:        None
08/28 09:30:11 - mmengine - INFO - openvino:    None
08/28 09:30:11 - mmengine - INFO - torchscript: 1.12.1
08/28 09:30:11 - mmengine - INFO - torchscript custom ops:      NotAvailable
08/28 09:30:11 - mmengine - INFO - rknn-toolkit:        None
08/28 09:30:11 - mmengine - INFO - rknn-toolkit2:       None
08/28 09:30:11 - mmengine - INFO - ascend:      None
08/28 09:30:11 - mmengine - INFO - coreml:      None
08/28 09:30:11 - mmengine - INFO - tvm: None
08/28 09:30:11 - mmengine - INFO - vacc:        None
08/28 09:30:11 - mmengine - INFO - 

08/28 09:30:11 - mmengine - INFO - **********Codebase information**********
08/28 09:30:11 - mmengine - INFO - mmdet:       None
08/28 09:30:11 - mmengine - INFO - mmseg:       None
08/28 09:30:11 - mmengine - INFO - mmpretrain:  None
08/28 09:30:11 - mmengine - INFO - mmocr:       None
08/28 09:30:11 - mmengine - INFO - mmagic:      None
08/28 09:30:11 - mmengine - INFO - mmdet3d:     None
08/28 09:30:11 - mmengine - INFO - mmpose:      None
08/28 09:30:11 - mmengine - INFO - mmrotate:    None
08/28 09:30:11 - mmengine - INFO - mmaction:    None
08/28 09:30:11 - mmengine - INFO - mmrazor:     None
08/28 09:30:11 - mmengine - INFO - mmyolo:      None

Error traceback

[ERROR][2023-08-28 09:35:33.983][resize.cu:1202] CUDA error: invalid resource handle
Aborted (core dumped)

irexyc · 2023-08-29T11:49:58Z

@ChrisKong93

device 管理的有点问题，我们会在下一版修复，目前的话，你可以先用 cudaSetDevice 绑一下线程和device.

如果你用多线程的，并且线程不会切换device使用的话，绑一次就可以了

pip install cuda-python==11.5

from mmdeploy_runtime import Classifier
import cv2
import numpy as np
from cuda import cudart

img = cv2.imread('/root/workspace/mmpretrain/demo/demo.JPEG')

model = []
for i in range(2):
    model.append(Classifier('/root/workspace/mmdeploy/work-dir/ort', 'cuda', i))

while True:
    for i in range(2):
        cudart.cudaSetDevice(i)
        res = model[i](img)
        print(res)

ChrisKong93 · 2023-08-29T13:04:38Z

@ChrisKong93

device 管理的有点问题，我们会在下一版修复，目前的话，你可以先用 cudaSetDevice 绑一下线程和device.

如果你用多线程的，并且线程不会切换device使用的话，绑一次就可以了
pip install cuda-python==11.5
from mmdeploy_runtime import Classifier
import cv2
import numpy as np
from cuda import cudart

img = cv2.imread('/root/workspace/mmpretrain/demo/demo.JPEG')

model = []
for i in range(2):
    model.append(Classifier('/root/workspace/mmdeploy/work-dir/ort', 'cuda', i))

while True:
    for i in range(2):
        cudart.cudaSetDevice(i)
        res = model[i](img)
        print(res)

好的，感谢，我试一下这个方法，我用这个办法解决的，效果是达到了，但不知道解决方法是不是正确的

github-actions · 2023-09-19T01:49:22Z

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

RunningLeon assigned irexyc Aug 28, 2023

RunningLeon added the SDK label Aug 28, 2023

irexyc mentioned this issue Sep 7, 2023

Fix sdk error for multi-gpu execution #2411

Merged

RunningLeon added the Stale label Sep 13, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] SDK无法支持多卡GPU #2380

[Bug] SDK无法支持多卡GPU #2380

ChrisKong93 commented Aug 28, 2023

irexyc commented Aug 29, 2023

ChrisKong93 commented Aug 29, 2023

github-actions bot commented Sep 19, 2023

[Bug] SDK无法支持多卡GPU #2380

[Bug] SDK无法支持多卡GPU #2380

Comments

ChrisKong93 commented Aug 28, 2023

Checklist

Describe the bug

Reproduction

Environment

Error traceback

irexyc commented Aug 29, 2023

ChrisKong93 commented Aug 29, 2023

github-actions bot commented Sep 19, 2023