Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable cudaDeviceReset() calls in pmcx #188

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

lkeegan
Copy link
Contributor

@lkeegan lkeegan commented Sep 28, 2023

  • If MCX_DISABLE_CUDA_DEVICE_RESET is defined then cudaDeviceReset() calls are not made in mcx_core.cu
  • cudaDeviceReset() destroys all resources on the GPU associated with the current process
  • for pmcx this can leave other python GPU libraries like pytorch in a broken state
  • resolves Pytorch seems to be broken after calling pmcx.run() #187

- If `MCX_DISABLE_CUDA_DEVICE_RESET` is defined then `cudaDeviceReset()` calls are not made in mcx_core.cu
- cudaDeviceReset() destroys all resources on the GPU associated with the current process
- for pmcx this can leave other python GPU libraries like pytorch in a broken state
- resolves fangq#187
@fangq fangq merged commit b6e84b6 into fangq:master Sep 28, 2023
25 checks passed
fangq added a commit that referenced this pull request Sep 28, 2023
@lkeegan lkeegan deleted the remove_reset_cuda_device_calls branch September 28, 2023 12:20
@fangq
Copy link
Owner

fangq commented Sep 28, 2023

Thanks a lot. the cudaDeviceReset() call was previously added for mcxlab (then inherited to pmcx) to solve exceptions thrown from the GPU kernel - when the GPU host/kernel code throws a CUDA error, we found that a new session can not be launched as the GPU is being occupied by the previous run. cudaDeviceReset was a quick fix to this, but I see it won't play nice with pytorch where persistent GPU memory is used.

I don't know if the GPU lockup issue will happen after this patch, we can do more tests to see.

@lkeegan
Copy link
Contributor Author

lkeegan commented Sep 29, 2023

So in the event of a CUDA error, you had to call cudaDeviceReset() before the process terminated to be able to use the GPU again in a new process?

If so, and this issue remains, I guess the equivalent of this for pmcx would be to call cudaDeviceReset() just before the Python process terminates.

I think one way to do this could be that pmcx registers a function to run atexit, which calls e.g. _pmcx.reset_device(). This would then be called (at least in most cases) when the Python process terminates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pytorch seems to be broken after calling pmcx.run()
2 participants