Disable `cudaDeviceReset()` calls in pmcx #188

lkeegan · 2023-09-28T06:37:47Z

If MCX_DISABLE_CUDA_DEVICE_RESET is defined then cudaDeviceReset() calls are not made in mcx_core.cu
cudaDeviceReset() destroys all resources on the GPU associated with the current process
for pmcx this can leave other python GPU libraries like pytorch in a broken state
resolves Pytorch seems to be broken after calling pmcx.run() #187

- If `MCX_DISABLE_CUDA_DEVICE_RESET` is defined then `cudaDeviceReset()` calls are not made in mcx_core.cu - cudaDeviceReset() destroys all resources on the GPU associated with the current process - for pmcx this can leave other python GPU libraries like pytorch in a broken state - resolves fangq#187

fangq · 2023-09-28T15:06:45Z

Thanks a lot. the cudaDeviceReset() call was previously added for mcxlab (then inherited to pmcx) to solve exceptions thrown from the GPU kernel - when the GPU host/kernel code throws a CUDA error, we found that a new session can not be launched as the GPU is being occupied by the previous run. cudaDeviceReset was a quick fix to this, but I see it won't play nice with pytorch where persistent GPU memory is used.

I don't know if the GPU lockup issue will happen after this patch, we can do more tests to see.

lkeegan · 2023-09-29T06:21:21Z

So in the event of a CUDA error, you had to call cudaDeviceReset() before the process terminated to be able to use the GPU again in a new process?

If so, and this issue remains, I guess the equivalent of this for pmcx would be to call cudaDeviceReset() just before the Python process terminates.

I think one way to do this could be that pmcx registers a function to run atexit, which calls e.g. _pmcx.reset_device(). This would then be called (at least in most cases) when the Python process terminates.

fangq merged commit b6e84b6 into fangq:master Sep 28, 2023
25 checks passed

fangq added a commit that referenced this pull request Sep 28, 2023

bump pmcx to 0.2.5 after #187 #188

3606b51

lkeegan deleted the remove_reset_cuda_device_calls branch September 28, 2023 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable `cudaDeviceReset()` calls in pmcx #188

Disable `cudaDeviceReset()` calls in pmcx #188

lkeegan commented Sep 28, 2023

fangq commented Sep 28, 2023 •

edited

Loading

lkeegan commented Sep 29, 2023

Disable cudaDeviceReset() calls in pmcx #188

Disable cudaDeviceReset() calls in pmcx #188

Conversation

lkeegan commented Sep 28, 2023

fangq commented Sep 28, 2023 • edited Loading

lkeegan commented Sep 29, 2023

Disable `cudaDeviceReset()` calls in pmcx #188

Disable `cudaDeviceReset()` calls in pmcx #188

fangq commented Sep 28, 2023 •

edited

Loading