[BUG] CUDA out of memory, when only 1600 atoms, using the pytorch model with spin #3969

shiruosong · 2024-07-12T05:40:32Z

Bug summary

I trained a pytorch version of the BiFeO3 model with DPSPIN. When I use the model to do minimization, with only 1600 atoms, I get an error that CUDA out of memory. The machine type is c12_m92_1 * NVIDIA V100.

I had previously run DPLR with 1-2w atoms normally, and even the normal DP-tf model with more atoms. For DPSPIN-tf, 1,600 atoms are also far from the limit. But for DPSPIN-pytorch, you can't do that anymore.

The ERROR is listed below:
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0001
terminate called after throwing an instance of 'std::runtime_error'
what(): The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/transform_output.py", line 154, in forward_lower
vvi = split_vv1[_44]
svvi = split_svv1[_44]
_45 = _36(vvi, svvi, coord_ext, do_virial, do_atomic_virial, )
~~~ <--- HERE
ffi, aviri, = _45
ffi0 = torch.unsqueeze(ffi, -2)
File "code/torch/deepmd/pt/model/model/transform_output.py", line 201, in task_deriv_one
extended_virial0 = torch.matmul(_53, torch.unsqueeze(extended_coord, -2))
if do_atomic_virial:
extended_virial_corr = _50(extended_coord, atom_energy, )
~~~ <--- HERE
extended_virial2 = torch.add(extended_virial0, extended_virial_corr)
extended_virial1 = extended_virial2
File "code/torch/deepmd/pt/model/model/transform_output.py", line 234, in atomic_virial_corr
ops.prim.RaiseException("AssertionError: ")
extended_virial_corr00 = _55
_61 = torch.autograd.grad([sumce1], [extended_coord], lst, None, True)
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_virial_corr1 = _61[0]
_62 = torch.isnot(extended_virial_corr1, None)

Traceback of TorchScript, original code (most recent call last):
File "/opt/mamba/envs/DeepSpin_devel/lib/python3.9/site-packages/deepmd/pt/model/model/transform_output.py", line 120, in forward_lower
for vvi, svvi in zip(split_vv1, split_svv1):
# nf x nloc x 3, nf x nloc x 9
ffi, aviri = task_deriv_one(
~~~~~~~~~~~~~~ <--- HERE
vvi,
svvi,
File "/opt/mamba/envs/DeepSpin_devel/lib/python3.9/site-packages/deepmd/pt/model/model/transform_output.py", line 76, in task_deriv_one
# the correction sums to zero, which does not contribute to global virial
if do_atomic_virial:
extended_virial_corr = atomic_virial_corr(extended_coord, atom_energy)
~~~~~~~~~~~~~~~~~~ <--- HERE
extended_virial = extended_virial + extended_virial_corr
# to [...,3,3] -> [...,9]
File "/opt/mamba/envs/DeepSpin_devel/lib/python3.9/site-packages/deepmd/pt/model/model/transform_output.py", line 39, in atomic_virial_corr
)[0]
assert extended_virial_corr0 is not None
extended_virial_corr1 = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <--- HERE
[sumce1], [extended_coord], grad_outputs=lst, create_graph=True
)[0]
RuntimeError: CUDA out of memory. Tried to allocate 220.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 202.12 MiB is free. Process 19403 has 31.54 GiB memory in use. Of the allocated memory 30.12 GiB is allocated by PyTorch, and 425.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

DeePMD-kit Version

DeePMD-kit v3.0.0a1.dev107+ga26b6803.d20240430

Backend and its version

torch v2.1.0+cu118

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

test.zip

Steps to Reproduce

lmp_mpi -i input.lammps

Further Information, Files, and Links

No response

iProzd · 2024-07-18T08:31:51Z

Hi @shiruosong, the difference in atomic virial calculation implementations of lammps interface between PyTorch and TensorFlow may be causing increased memory usage. Should we consider disabling do_atomic_virial in the PyTorch LAMMPS interface by default? @njzjz @wanghan-iapcm

shiruosong · 2024-07-18T09:04:49Z

The output of virial should already have been turned off. At least according to lammps results, virial is always 0.

But the model is still calculating virial, and it costs much memory, right?

wanghan-iapcm · 2024-07-18T11:16:36Z

Hi @shiruosong, the difference in atomic virial calculation implementations of lammps interface between PyTorch and TensorFlow may be causing increased memory usage. Should we consider disabling do_atomic_virial in the PyTorch LAMMPS interface by default? @njzjz @wanghan-iapcm

In this case we need to revise the C++ interface to pass in the information of whether the user needs atomic virial. @njzjz what do you think?

njzjz · 2024-07-18T19:37:37Z

the difference in atomic virial calculation implementations of lammps interface between PyTorch and TensorFlow may be causing increased memory usage

Do you have evidence to support it? For example, memory with atomic virial & without atomic virial.

In this case we need to revise the C++ interface to pass in the information of whether the user needs atomic virial. @njzjz what do you think?

Should we revert #3145? Or add request_deriv like DeepTensor.

njzjz · 2024-07-18T22:00:55Z

It seems that create_graph=True is used even during inference (in which case, the graph is no longer needed after calculating the forces and atomic virials).

See deepmodeling#3969 for the background. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

See #3969 for the background.  ## Summary by CodeRabbit - **New Features** - Introduced an 'atomic' parameter in various compute functions to enable atomic energy and virial calculations, providing more granular control over computations.  --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz · 2024-07-26T18:46:16Z

The memory has been reduced by several PRs: #3996, #4006, #4010, #4012.

njzjz · 2024-07-26T19:27:20Z

Note: the C++ interface for the PyTorch backend doesn't actually support the spin model (tracked in #4023).

…ng#3996) See deepmodeling#3969 for the background.  ## Summary by CodeRabbit - **New Features** - Introduced an 'atomic' parameter in various compute functions to enable atomic energy and virial calculations, providing more granular control over computations.  --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

shiruosong added the bug label Jul 12, 2024

wanghan-iapcm assigned iProzd Jul 13, 2024

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Jul 19, 2024

fix(cc): add atomic argument to DeepPotBase

c4dd24c

See deepmodeling#3969 for the background. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz mentioned this issue Jul 19, 2024

fix(cc): add atomic argument to DeepPotBase::computew #3996

Merged

njzjz added the reproduced This bug has been reproduced by developers label Jul 19, 2024

github-project-automation bot added this to Bugfixes for DeePMD-kit Jul 19, 2024

github-project-automation bot moved this to Todo in Bugfixes for DeePMD-kit Jul 19, 2024

iProzd pushed a commit to iProzd/deepmd-kit that referenced this issue Jul 23, 2024

fix(cc): add atomic argument to DeepPotBase

4fdb938

See deepmodeling#3969 for the background. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz linked a pull request Jul 23, 2024 that will close this issue

fix(pt): optimize graph memory usage #4006

Merged

njzjz closed this as completed Jul 26, 2024

github-project-automation bot moved this from Todo to Done in Bugfixes for DeePMD-kit Jul 26, 2024

njzjz mentioned this issue Nov 8, 2024

Aborted when running ABACUS with dp(from DPA-2 finetune) #4317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CUDA out of memory, when only 1600 atoms, using the pytorch model with spin #3969

[BUG] CUDA out of memory, when only 1600 atoms, using the pytorch model with spin #3969

shiruosong commented Jul 12, 2024

iProzd commented Jul 18, 2024 •

edited

Loading

shiruosong commented Jul 18, 2024

wanghan-iapcm commented Jul 18, 2024

njzjz commented Jul 18, 2024 •

edited

Loading

njzjz commented Jul 18, 2024

njzjz commented Jul 26, 2024

njzjz commented Jul 26, 2024

[BUG] CUDA out of memory, when only 1600 atoms, using the pytorch model with spin #3969

[BUG] CUDA out of memory, when only 1600 atoms, using the pytorch model with spin #3969

Comments

shiruosong commented Jul 12, 2024

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

iProzd commented Jul 18, 2024 • edited Loading

shiruosong commented Jul 18, 2024

wanghan-iapcm commented Jul 18, 2024

njzjz commented Jul 18, 2024 • edited Loading

njzjz commented Jul 18, 2024

njzjz commented Jul 26, 2024

njzjz commented Jul 26, 2024

iProzd commented Jul 18, 2024 •

edited

Loading

njzjz commented Jul 18, 2024 •

edited

Loading