Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] finetuning property fitting with multi-dimensional data causes error #4108

Closed
theAfish opened this issue Sep 6, 2024 · 6 comments · Fixed by #4145
Closed

[BUG] finetuning property fitting with multi-dimensional data causes error #4108

theAfish opened this issue Sep 6, 2024 · 6 comments · Fixed by #4145
Assignees
Labels
Milestone

Comments

@theAfish
Copy link

theAfish commented Sep 6, 2024

Bug summary

I have tested the new property fitting model in fine-tuning procedures with the pre-trained OpenLAM_2.2.0_27heads_beta3.pt.
The dataset I used is in the examples folder and has a dimension of 3. Raised errors about tensor size mismatch. See the Error Log below.

DeePMD-kit Version

DeePMD-kit v3.0.0a1.dev320+g46632f90

Backend and its version

torch 2.4.1+cu121

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Commands: dp --pt train input_finetune.json --finetune OpenLAM_2.2.0_27heads_beta3.pt

Input File:
input_finetune.json

The data files I used are in examples/property/data

Error Log:

To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-09-06 17:33:46,629] DEEPMD INFO    DeePMD version: 3.0.0a1.dev320+g46632f90
[2024-09-06 17:33:46,629] DEEPMD INFO    Configuration path: input_finetune.json
[2024-09-06 17:33:46,672] DEEPMD INFO     _____               _____   __  __  _____           _     _  _
[2024-09-06 17:33:46,672] DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
[2024-09-06 17:33:46,672] DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
[2024-09-06 17:33:46,672] DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
[2024-09-06 17:33:46,672] DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
[2024-09-06 17:33:46,672] DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
[2024-09-06 17:33:46,672] DEEPMD INFO    Please read and cite:
[2024-09-06 17:33:46,672] DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[2024-09-06 17:33:46,672] DEEPMD INFO    Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[2024-09-06 17:33:46,672] DEEPMD INFO    See https://deepmd.rtfd.io/credits/ for details.
[2024-09-06 17:33:46,672] DEEPMD INFO    -------------------------------------------------------------------------------
[2024-09-06 17:33:46,672] DEEPMD INFO    installed to:          /home/notfish/dev/dc-dev/deepmd
[2024-09-06 17:33:46,672] DEEPMD INFO                           /home/notfish/dev/dp/lib/python3.10/site-packages/deepmd
[2024-09-06 17:33:46,672] DEEPMD INFO    source:                v3.0.0a0-320-g46632f90
[2024-09-06 17:33:46,672] DEEPMD INFO    source brach:          devel
[2024-09-06 17:33:46,672] DEEPMD INFO    source commit:         46632f90
[2024-09-06 17:33:46,672] DEEPMD INFO    source commit at:      2024-09-04 00:33:34 +0000
[2024-09-06 17:33:46,672] DEEPMD INFO    use float prec:        double
[2024-09-06 17:33:46,672] DEEPMD INFO    build variant:         cpu
[2024-09-06 17:33:46,672] DEEPMD INFO    Backend:               PyTorch
[2024-09-06 17:33:46,672] DEEPMD INFO    PT ver:                v2.4.1+cu121-g38b96d3399a
[2024-09-06 17:33:46,672] DEEPMD INFO    Enable custom OP:      False
[2024-09-06 17:33:46,672] DEEPMD INFO    running on:            theNotfish
[2024-09-06 17:33:46,672] DEEPMD INFO    computing device:      cuda:0
[2024-09-06 17:33:46,672] DEEPMD INFO    CUDA_VISIBLE_DEVICES:  unset
[2024-09-06 17:33:46,672] DEEPMD INFO    Count of visible GPUs: 1
[2024-09-06 17:33:46,672] DEEPMD INFO    num_intra_threads:     0
[2024-09-06 17:33:46,672] DEEPMD INFO    num_inter_threads:     0
[2024-09-06 17:33:46,672] DEEPMD INFO    -------------------------------------------------------------------------------
/home/notfish/dev/dc-dev/deepmd/pt/utils/finetune.py:139: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(finetune_model, map_location=env.DEVICE)
[2024-09-06 17:33:47,902] DEEPMD WARNING The fitting net will be re-init instead of using that in the pretrained model! The bias_adjust_mode will be set-by-statistic!
[2024-09-06 17:33:47,927] DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2024-09-06 17:33:47,959] DEEPMD INFO    If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
[2024-09-06 17:33:47,987] DEEPMD INFO    Adjust batch size from 1024 to 2048
[2024-09-06 17:33:48,113] DEEPMD INFO    training data with min nbor dist: 0.9608642172055677
[2024-09-06 17:33:48,113] DEEPMD INFO    training data with max nbor size: [21]
[2024-09-06 17:33:48,117] DEEPMD INFO    If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
[2024-09-06 17:33:48,119] DEEPMD INFO    Adjust batch size from 1024 to 2048
[2024-09-06 17:33:48,273] DEEPMD INFO    training data with min nbor dist: 0.9608642172055677
[2024-09-06 17:33:48,274] DEEPMD INFO    training data with max nbor size: [21]
[2024-09-06 17:33:48,547] DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
[2024-09-06 17:33:48,548] DEEPMD INFO    found 2 system(s):
[2024-09-06 17:33:48,548] DEEPMD INFO                                        system  natoms  bch_sz   n_bch       prob  pbc
[2024-09-06 17:33:48,548] DEEPMD INFO                                ../data/data_0      20       1      80  5.000e-01    F
[2024-09-06 17:33:48,548] DEEPMD INFO                                ../data/data_1      22       1      80  5.000e-01    F
[2024-09-06 17:33:48,548] DEEPMD INFO    --------------------------------------------------------------------------------------
[2024-09-06 17:33:48,551] DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
[2024-09-06 17:33:48,551] DEEPMD INFO    found 1 system(s):
[2024-09-06 17:33:48,551] DEEPMD INFO                                        system  natoms  bch_sz   n_bch       prob  pbc
[2024-09-06 17:33:48,551] DEEPMD INFO                                ../data/data_2      24       1      80  1.000e+00    F
[2024-09-06 17:33:48,551] DEEPMD INFO    --------------------------------------------------------------------------------------
[2024-09-06 17:33:48,552] DEEPMD INFO    Resuming from OpenLAM_2.2.0_27heads_beta3.pt.
/home/notfish/dev/dc-dev/deepmd/pt/train/training.py:404: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(resume_model, map_location=DEVICE)
Traceback (most recent call last):
  File "/home/notfish/dev/dp/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/home/notfish/dev/dc-dev/deepmd/main.py", line 923, in main
    deepmd_main(args)
  File "/home/notfish/dev/dp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/notfish/dev/dc-dev/deepmd/pt/entrypoints/main.py", line 563, in main
    train(FLAGS)
  File "/home/notfish/dev/dc-dev/deepmd/pt/entrypoints/main.py", line 327, in train
    trainer = get_trainer(
  File "/home/notfish/dev/dc-dev/deepmd/pt/entrypoints/main.py", line 190, in get_trainer
    trainer = training.Trainer(
  File "/home/notfish/dev/dc-dev/deepmd/pt/train/training.py", line 516, in __init__
    self.wrapper.load_state_dict(state_dict)
  File "/home/notfish/dev/dp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ModelWrapper:
        size mismatch for model.Default.atomic_model.out_bias: copying a param with shape torch.Size([1, 118, 1]) from checkpoint, the shape in current model is torch.Size([1, 118, 3]).
        size mismatch for model.Default.atomic_model.out_std: copying a param with shape torch.Size([1, 118, 1]) from checkpoint, the shape in current model is torch.Size([1, 118, 3]).

Steps to Reproduce

Just run the command with the datasets and the input file

Further Information, Files, and Links

No response

@theAfish theAfish added the bug label Sep 6, 2024
@wanghan-iapcm
Copy link
Collaborator

The model OpenLAM_2.2.0_27heads_beta3.pt is for 3.0.0beta release, you were using v3.0.0alpha. please upgrade to beta release.

@theAfish
Copy link
Author

theAfish commented Sep 9, 2024

I am using the newest version of the devel branch which obtains the new PropertyFitting functions. The OpenLAM_2.2.0_27heads_beta3.pt seems work for finetuning with 1D property data properly in my case. Since the v3.0.0b3 does not obtain this specific feature, I'm not sure whether this issus in fitting multi-dimensional properties data is caused by the model's version or the code.

@njzjz
Copy link
Member

njzjz commented Sep 9, 2024

Which commit do you use? 46632f9 does not contain PropertyFitting.

@theAfish
Copy link
Author

should be #3867

@njzjz
Copy link
Member

njzjz commented Sep 13, 2024

It looks like a bug in finetune, but not related to the property fitting. out_bias should not be loaded. @iProzd could you take a look at the finetune code?

@Chengqian-Zhang
Copy link
Collaborator

Chengqian-Zhang commented Sep 14, 2024

This bug appears when finetune task's label is multi-dimensional. dos fitting, property fitting, polar fitting and dipole fitting all report this bug when finetuning using a multitask pretrained model.

@Chengqian-Zhang Chengqian-Zhang self-assigned this Sep 14, 2024
github-merge-queue bot pushed a commit that referenced this issue Sep 25, 2024
…nsional data causes error (#4145)

Fix issue #4108 

If a pretrained model is labeled with energy and the `out_bias` is one
dimension. If we want to finetune a dos/polar/dipole/property model
using this pretrained model, the `out_bias` of finetuning model is
multi-dimension(example: numb_dos = 250). An error occurs:
`RuntimeError: Error(s) in loading state_dict for ModelWrapper:`
` size mismatch for model.Default.atomic_model.out_bias: copying a param
with shape torch.Size([1, 118, 1]) from checkpoint, the shape in current
model is torch.Size([1, 118, 250]).`
` size mismatch for model.Default.atomic_model.out_std: copying a param
with shape torch.Size([1, 118, 1]) from checkpoint, the shape in current
model is torch.Size([1, 118, 250]).`

When using new fitting, old out_bias is useless because we will
recompute the new bias in later code. So we do not need to load old
out_bias when using new fitting finetune.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced parameter collection for fine-tuning, refining criteria for
parameter retention.
- Introduced a model checkpoint file for saving and resuming training
states, facilitating iterative development.

- **Tests**
- Added a new test class to validate training and fine-tuning processes,
ensuring model performance consistency across configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@iProzd iProzd closed this as completed Sep 26, 2024
@njzjz njzjz added this to the v3.0.0 milestone Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
5 participants