Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Model converted from PT to TF backend could not run with TF #3997

Closed
Cloudac7 opened this issue Jul 19, 2024 · 4 comments · Fixed by #4007
Closed

[BUG] Model converted from PT to TF backend could not run with TF #3997

Cloudac7 opened this issue Jul 19, 2024 · 4 comments · Fixed by #4007
Assignees
Labels
bug reproduced This bug has been reproduced by developers

Comments

@Cloudac7
Copy link
Contributor

Cloudac7 commented Jul 19, 2024

Bug summary

I am now working on multi-task training with DeePMD-kit v3.0.0b0, and I get a header with se_a descriptor after freezing step. Then, I tried to use dp --pt convert-backend frozen_model.pth frozen_model.pb (and without--pt, getting the same result.) to get a frozen_model.pb. But it could not be used when running Lammps with both v2.2.9 and v3.0.0b0, raising the following error:

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
INVALID_ARGUMENT: 2 root error(s) found.
  (0) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
	 [[{{node Reshape_33}}]]
	 [[o_atom_energy/_37]]
  (1) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
	 [[{{node Reshape_33}}]]
0 successful operations.
0 derived errors ignored.
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: TensorFlow Error: INVALID_ARGUMENT: 2 root error(s) found.
  (0) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
	 [[{{node Reshape_33}}]]
	 [[o_atom_energy/_37]]
  (1) INVALID_ARGUMENT: Input to reshape is a tensor with 504000 values, but the requested shape requires a multiple of 1608
	 [[{{node Reshape_33}}]]
0 successful operations.
0 derived errors ignored. (/public/groups/ai4ec/libs/conda/deepmd/3.0.0b0-cuda118/source/deepmd-kit/source/lmp/pair_deepmd.cpp:586)
Last command: run             ${NSTEPS} upto

It seems something wrong when converting the model, and seems to be a bug.

DeePMD-kit Version

DeePMD-kit v3.0.0b0

Backend and its version

PyTorch v2.0.0.post200, TensorFlow v2.14.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Running command:

dp --pt freeze -o frozen_model.pth --head ener
dp convert-backend frozen_model.pth frozen_model.pb

or use --pt.

And the Lammps error log is under below.
slurm-2623892.txt

Steps to Reproduce

Please use the following frozen_model.pth to freeze and use the following Lammps task to reproduce the bug.

Further Information, Files, and Links

No response

@Cloudac7 Cloudac7 added the bug label Jul 19, 2024
@njzjz
Copy link
Member

njzjz commented Jul 19, 2024

DescrptDPA1Compat has the wrong get_dim_out() when concat_output_tebd is true. cc @iProzd

@njzjz
Copy link
Member

njzjz commented Jul 26, 2024

Fixed in #4007.

@njzjz njzjz closed this as completed Jul 26, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Bugfixes for DeePMD-kit Jul 26, 2024
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Jul 26, 2024
- [x] (Tomorrow) Test if it works for deepmodeling#3997. 

deepmodeling#3997 needs another fix in deepmodeling#4022 .

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Introduced a method to dynamically determine the output dimension of
the descriptor, enhancing its functionality and interaction with other
components.
- Improved tensor dimensionality handling in tests to ensure
compatibility with the new output dimension method.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
mtaillefumier pushed a commit to mtaillefumier/deepmd-kit that referenced this issue Sep 18, 2024
- [x] (Tomorrow) Test if it works for deepmodeling#3997. 

deepmodeling#3997 needs another fix in deepmodeling#4022 .

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Introduced a method to dynamically determine the output dimension of
the descriptor, enhancing its functionality and interaction with other
components.
- Improved tensor dimensionality handling in tests to ensure
compatibility with the new output dimension method.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz
Copy link
Member

njzjz commented Oct 23, 2024

Reopen. #4007 may not fix this issue, which needs more validation.

@njzjz njzjz reopened this Oct 23, 2024
@njzjz
Copy link
Member

njzjz commented Nov 13, 2024

#4320 should fix the issue.

@njzjz njzjz closed this as completed Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug reproduced This bug has been reproduced by developers
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants