[BUG] VRAM is wasted when running Lammps with multiple GPUs #4171

Entropy-Enthalpy · 2024-09-30T20:38:18Z

Bug summary

I have been using DP for a long time, and in every version I have used, I have encountered this issue: when running a Lammps MD simulation using multiple GPUs via mpirun, each MPI Rank consumes VRAM on all GPUs, even though the computation of each MPI Rank is actually running on only one GPU.

For example, in the picture below, I requested 4 V100-SXM2-16GB GPUs for a single MD job and started 4 MPI Ranks. In reality, each GPU has (4-1)*0.3=0.9GiB of VRAM "wasted". For an 8-GPU job, this would "waste" (8-1)*0.3=2.1GiB of VRAM. If MPS is used, the "wasted" VRAM would be doubled.

On the surface, it seems that this issue arises because the TensorFlow gpu_device runtime executes a "create device" operation for each GPU in every MPI Rank (as can be seen in the logs), but I don't know how to avoid this problem. It is noteworthy that TensorFlow "can't see" the GPUs on different nodes, so when running Lammps MD across multiple nodes and each node uses only one GPU, there is no such issue.

DeePMD-kit Version

3.0.0b4

Backend and its version

TensorFlow v2.15.2, Lammps 29Aug2024

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Running Commands:
mpirun -np 4 lmp_mpi -in input.lammps

Part of Log:

...
2024-10-01 03:13:12.619343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14529 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:84:00.0, compute capability: 7.0
2024-10-01 03:13:12.620016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14529 MB memory:  -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0
2024-10-01 03:13:12.620570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14529 MB memory:  -> device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c4:00.0, compute capability: 7.0
2024-10-01 03:13:12.621108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14529 MB memory:  -> device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c5:00.0, compute capability: 7.0
2024-10-01 03:13:12.640945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14529 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:84:00.0, compute capability: 7.0
2024-10-01 03:13:12.641605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14529 MB memory:  -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0
2024-10-01 03:13:12.642124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14529 MB memory:  -> device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c4:00.0, compute capability: 7.0
2024-10-01 03:13:12.642635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14529 MB memory:  -> device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c5:00.0, compute capability: 7.0
2024-10-01 03:13:12.659556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14529 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:84:00.0, compute capability: 7.0
2024-10-01 03:13:12.660457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14529 MB memory:  -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0
2024-10-01 03:13:12.661253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14529 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:84:00.0, compute capability: 7.0
2024-10-01 03:13:12.661270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14529 MB memory:  -> device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c4:00.0, compute capability: 7.0
2024-10-01 03:13:12.662060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14529 MB memory:  -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0
2024-10-01 03:13:12.662095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14529 MB memory:  -> device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c5:00.0, compute capability: 7.0
2024-10-01 03:13:12.662639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14529 MB memory:  -> device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c4:00.0, compute capability: 7.0
2024-10-01 03:13:12.663289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14529 MB memory:  -> device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c5:00.0, compute capability: 7.0
...

Steps to Reproduce

N/A

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

Fix deepmodeling#4171. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Fix #4171.  ## Summary by CodeRabbit - **New Features** - Enhanced GPU selection logic for improved resource management. - Added support for single-frame and multi-frame computations with new parameters for atom energy and virial calculations. - Extended functionality for mixed-type computations in the model. - **Bug Fixes** - Improved error handling during initialization and model execution. - Added output tensor dimension validations to ensure expected structures are maintained. - **Documentation** - Clarified output tensor validation to ensure expected dimensions are maintained.  --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Entropy-Enthalpy · 2024-10-11T22:15:14Z

I found a similar issue with the PyTorch backend, but only GPU_0's VRAM was "wasted".

For a 8-GPU job, like this:

DeePMD-kit Version

source:             v3.0.0b4-17-g8174cf11
source branch:      devel
source commit:      8174cf11
source commit at:   2024-10-11 03:20:55 +0000

LAMMPS version

Lammps 29Aug2024 update1

Backend stack

PyTorch 2.4.1
cuDNN 9.3.0
NVHPC 24.5 (nompi)
OpenMPI 5.0.5 (CUDA-Aware)
UCX 1.17.0 (CUDA + GDRCopy)

njzjz · 2024-10-13T03:07:07Z

For PyTorch, I guess c10::cuda::set_device should work. This API is not documented, though.

related discussion: https://discuss.pytorch.org/t/cuda-extension-with-multiple-gpus/160053/6

Entropy-Enthalpy · 2024-10-13T13:18:28Z

For PyTorch, I guess c10::cuda::set_device should work. This API is not documented, though.

related discussion: https://discuss.pytorch.org/t/cuda-extension-with-multiple-gpus/160053/6

As a user, I just know that source/api_cc/src/DeepPotPT.cc might need to be modified, but I don't know how... 🥺

Fix deepmodeling#4171. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz · 2024-10-26T19:08:06Z

@Entropy-Enthalpy Please tell me if #4261 works.

Entropy-Enthalpy · 2024-10-26T20:49:22Z

@Entropy-Enthalpy Please tell me if #4261 works.

The compiler threw an error:

/shared_apps/deepmd-kit/test/deepmd-kit/source/api_cc/src/DeepPotPT.cc: In member function ‘virtual void deepmd::DeepPotPT::init(const string&, const int&, const string&)’:
/shared_apps/deepmd-kit/test/deepmd-kit/source/api_cc/src/DeepPotPT.cc:83:16: error: ‘set_device’ is not a member of ‘c10::cuda’; did you mean ‘_set_device’?
   83 |     c10::cuda::set_device(gpu_id);
      |                ^~~~~~~~~~
      |                _set_device

Update:

I fixed that by #include <c10/cuda/CUDAFunctions.h>, then it works.

Look at this DPA-2 MD case, it seems the VRAM issue has been settled:

Sun Oct 27 05:00:21 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           On  |   00000000:85:00.0 Off |                    0 |
| N/A   47C    P0            140W /  300W |    7546MiB /  16384MiB |     72%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           On  |   00000000:C4:00.0 Off |                    0 |
| N/A   50C    P0            158W /  300W |    7176MiB /  16384MiB |     82%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    809059      C   lmp                                          7542MiB |
|    1   N/A  N/A    809060      C   lmp                                          7172MiB |
+-----------------------------------------------------------------------------------------+

njzjz · 2024-10-26T21:03:06Z

The compiler threw an error:

This is not the last commit...

Entropy-Enthalpy · 2024-10-26T21:18:10Z

The compiler threw an error:

This is not the last commit...

I'm sorry. I just tested the latest commit 9bee6f4 , this also works:

Sun Oct 27 05:15:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           On  |   00000000:05:00.0 Off |                    0 |
| N/A   42C    P0            138W /  300W |    4128MiB /  16384MiB |     74%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           On  |   00000000:85:00.0 Off |                    0 |
| N/A   41C    P0            125W /  300W |    3164MiB /  16384MiB |     63%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-SXM2-16GB           On  |   00000000:C4:00.0 Off |                    0 |
| N/A   42C    P0            138W /  300W |    3826MiB /  16384MiB |     71%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    816231      C   lmp                                          4124MiB |
|    1   N/A  N/A    816232      C   lmp                                          3160MiB |
|    2   N/A  N/A    816233      C   lmp                                          3822MiB |
+-----------------------------------------------------------------------------------------+

njzjz · 2024-10-26T21:26:59Z

I fixed that by #include <c10/cuda/CUDAFunctions.h>, then it works.

I don't use this resolution as this file seems unavailable in the CPU version of libtorch (including it will throw the error) and there is no way to check whether it is available.

Fix #4171.  ## Summary by CodeRabbit - **New Features** - Improved GPU initialization to ensure the correct device is utilized. - Enhanced error handling for clearer context on exceptions. - **Bug Fixes** - Updated error handling in multiple methods to catch and rethrow specific exceptions. - Added logic to handle communication-related tensors during computation.  --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Entropy-Enthalpy added the bug label Sep 30, 2024

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Sep 30, 2024

fix(tf): set visible_device_list for TF C++

d3944bb

Fix deepmodeling#4171. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz mentioned this issue Sep 30, 2024

fix(tf): set visible_device_list for TF C++ #4172

Merged

njzjz linked a pull request Sep 30, 2024 that will close this issue

fix(tf): set visible_device_list for TF C++ #4172

Merged

njzjz closed this as completed Oct 7, 2024

njzjz reopened this Oct 12, 2024

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Oct 26, 2024

fix(pt): set device for PT C++

5f7707a

Fix deepmodeling#4171. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz mentioned this issue Oct 26, 2024

fix(pt): set device for PT C++ #4261

Merged

njzjz linked a pull request Oct 26, 2024 that will close this issue

fix(pt): set device for PT C++ #4261

Merged

njzjz closed this as completed Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] VRAM is wasted when running Lammps with multiple GPUs #4171

[BUG] VRAM is wasted when running Lammps with multiple GPUs #4171

Entropy-Enthalpy commented Sep 30, 2024 •

edited

Loading

Entropy-Enthalpy commented Oct 11, 2024

njzjz commented Oct 13, 2024

Entropy-Enthalpy commented Oct 13, 2024

njzjz commented Oct 26, 2024

Entropy-Enthalpy commented Oct 26, 2024 •

edited

Loading

njzjz commented Oct 26, 2024

Entropy-Enthalpy commented Oct 26, 2024

njzjz commented Oct 26, 2024

[BUG] VRAM is wasted when running Lammps with multiple GPUs #4171

[BUG] VRAM is wasted when running Lammps with multiple GPUs #4171

Comments

Entropy-Enthalpy commented Sep 30, 2024 • edited Loading

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Entropy-Enthalpy commented Oct 11, 2024

DeePMD-kit Version

LAMMPS version

Backend stack

njzjz commented Oct 13, 2024

Entropy-Enthalpy commented Oct 13, 2024

njzjz commented Oct 26, 2024

Entropy-Enthalpy commented Oct 26, 2024 • edited Loading

njzjz commented Oct 26, 2024

Entropy-Enthalpy commented Oct 26, 2024

njzjz commented Oct 26, 2024

Entropy-Enthalpy commented Sep 30, 2024 •

edited

Loading

Entropy-Enthalpy commented Oct 26, 2024 •

edited

Loading