Skip to content

Commit

Permalink
fix(pt): improve out-of-memory handling (#3836)
Browse files Browse the repository at this point in the history
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Improved handling of out-of-memory errors by including
"CUSOLVER_STATUS_INTERNAL_ERROR" and releasing cached memory to prevent
crashes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Han Wang <92130845+wanghan-iapcm@users.noreply.github.com>
  • Loading branch information
3 people authored May 30, 2024
1 parent 710cad3 commit 84b711e
Showing 1 changed file with 11 additions and 1 deletion.
12 changes: 11 additions & 1 deletion deepmd/pt/utils/auto_batch_size.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,17 @@ def is_oom_error(self, e: Exception) -> bool:
e : Exception
Exception
"""
return isinstance(e, RuntimeError) and "CUDA out of memory." in e.args[0]
# several sources think CUSOLVER_STATUS_INTERNAL_ERROR is another out-of-memory error,
# such as https://github.com/JuliaGPU/CUDA.jl/issues/1924
# (the meaningless error message should be considered as a bug in cusolver)
if isinstance(e, RuntimeError) and (
"CUDA out of memory." in e.args[0]
or "cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR" in e.args[0]
):
# Release all unoccupied cached memory
torch.cuda.empty_cache()
return True
return False

def execute_all(
self, callable: Callable, total_size: int, natoms: int, *args, **kwargs
Expand Down

0 comments on commit 84b711e

Please sign in to comment.