fix(pt): improve out-of-memory handling (#3836)

## Summary by CodeRabbit - **Bug Fixes** - Improved handling of out-of-memory errors by including "CUSOLVER_STATUS_INTERNAL_ERROR" and releasing cached memory to prevent crashes.  --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Han Wang <92130845+wanghan-iapcm@users.noreply.github.com>
deepmodeling · May 30, 2024 · 84b711e · 84b711e
1 parent 710cad3
commit 84b711e
Showing 1 changed file with 11 additions and 1 deletion.
diff --git a/deepmd/pt/utils/auto_batch_size.py b/deepmd/pt/utils/auto_batch_size.py
@@ -52,7 +52,17 @@ def is_oom_error(self, e: Exception) -> bool:
         e : Exception
             Exception
         """
-        return isinstance(e, RuntimeError) and "CUDA out of memory." in e.args[0]
+        # several sources think CUSOLVER_STATUS_INTERNAL_ERROR is another out-of-memory error,
+        # such as https://github.com/JuliaGPU/CUDA.jl/issues/1924
+        # (the meaningless error message should be considered as a bug in cusolver)
+        if isinstance(e, RuntimeError) and (
+            "CUDA out of memory." in e.args[0]
+            or "cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR" in e.args[0]
+        ):
+            # Release all unoccupied cached memory
+            torch.cuda.empty_cache()
+            return True
+        return False
 
     def execute_all(
         self, callable: Callable, total_size: int, natoms: int, *args, **kwargs