-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] VRAM is wasted when running Lammps with multiple GPUs #4171
Comments
Fix deepmodeling#4171. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Fix #4171. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced GPU selection logic for improved resource management. - Added support for single-frame and multi-frame computations with new parameters for atom energy and virial calculations. - Extended functionality for mixed-type computations in the model. - **Bug Fixes** - Improved error handling during initialization and model execution. - Added output tensor dimension validations to ensure expected structures are maintained. - **Documentation** - Clarified output tensor validation to ensure expected dimensions are maintained. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
I found a similar issue with the DeePMD-kit Version
LAMMPS versionLammps 29Aug2024 update1 Backend stackPyTorch 2.4.1 |
For PyTorch, I guess related discussion: https://discuss.pytorch.org/t/cuda-extension-with-multiple-gpus/160053/6 |
As a user, I just know that |
Fix deepmodeling#4171. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@Entropy-Enthalpy Please tell me if #4261 works. |
The compiler threw an error:
Update: I fixed that by Look at this DPA-2 MD case, it seems the VRAM issue has been settled:
|
This is not the last commit... |
I'm sorry. I just tested the latest commit 9bee6f4 , this also works:
|
I don't use this resolution as this file seems unavailable in the CPU version of libtorch (including it will throw the error) and there is no way to check whether it is available. |
Fix #4171. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Improved GPU initialization to ensure the correct device is utilized. - Enhanced error handling for clearer context on exceptions. - **Bug Fixes** - Updated error handling in multiple methods to catch and rethrow specific exceptions. - Added logic to handle communication-related tensors during computation. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Bug summary
I have been using DP for a long time, and in every version I have used, I have encountered this issue: when running a Lammps MD simulation using multiple GPUs via
mpirun
, each MPI Rank consumes VRAM on all GPUs, even though the computation of each MPI Rank is actually running on only one GPU.For example, in the picture below, I requested 4 V100-SXM2-16GB GPUs for a single MD job and started 4 MPI Ranks. In reality, each GPU has (4-1)*0.3=0.9GiB of VRAM "wasted". For an 8-GPU job, this would "waste" (8-1)*0.3=2.1GiB of VRAM. If MPS is used, the "wasted" VRAM would be doubled.
On the surface, it seems that this issue arises because the TensorFlow
gpu_device
runtime executes a "create device" operation for each GPU in every MPI Rank (as can be seen in the logs), but I don't know how to avoid this problem. It is noteworthy that TensorFlow "can't see" the GPUs on different nodes, so when running Lammps MD across multiple nodes and each node uses only one GPU, there is no such issue.DeePMD-kit Version
3.0.0b4
Backend and its version
TensorFlow v2.15.2, Lammps 29Aug2024
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
Running Commands:
mpirun -np 4 lmp_mpi -in input.lammps
Part of Log:
Steps to Reproduce
N/A
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: