[BUG] PT: Loss of the last step is not printed in `lcurve.out` #4206

njzjz · 2024-10-11T21:22:58Z

Bug summary

In the PyTorch backend, the loss of the last step is not printed in lcurve.out.

DeePMD-kit Version

3939786

Backend and its version

PT ver: v2.3.1+cu121-gd44533f9d07

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

When the training step is set to 1000,

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr                                                                                                                                                                                                                              
# If there is no available reference data, rmse_*_{val,trn} will print nan
      0      2.54e+01    2.51e+01      7.16e-01    1.03e+00      8.01e-01    7.92e-01    1.0e-03
    100      7.69e+00    7.79e+00      2.42e-02    1.29e-03      4.03e-01    4.09e-01    3.6e-04
    200      3.76e+00    3.72e+00      1.83e-02    2.34e-02      3.26e-01    3.23e-01    1.3e-04
    300      2.22e+00    2.21e+00      1.10e-02    2.69e-03      3.18e-01    3.18e-01    4.8e-05
    400      1.36e+00    1.30e+00      2.79e-03    3.29e-03      3.18e-01    3.04e-01    1.7e-05
    500      7.98e-01    9.00e-01      2.19e-03    2.98e-03      2.96e-01    3.34e-01    6.2e-06
    600      5.59e-01    6.10e-01      3.11e-03    3.90e-04      3.08e-01    3.38e-01    2.3e-06
    700      4.39e-01    4.43e-01      2.84e-03    9.52e-03      3.24e-01    3.14e-01    8.2e-07
    800      3.56e-01    3.42e-01      2.73e-03    2.63e-03      3.10e-01    2.99e-01    3.0e-07
    900      3.23e-01    3.29e-01      2.68e-03    6.37e-03      3.04e-01    3.02e-01    1.1e-07

Steps to Reproduce

Go to the water example and set the number of steps to 1000. Train the model.

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

Fix deepmodeling#4206. Currently, the training step index displayed in TF and PT has different meanings: - In TF, step 0 means no training; step 1 means a training step has been performed. The maximum training step is equal to the number of steps. - In PT, step 0 means a training step has been performed. The maximum training step is the number of steps minus 1. This PR corrects the defination of the step index in PT and makes them consistent. There are still a difference: TF shows step 0 but PT shows step 1. Showing step 0 in PT needs heavy refactor and thus is not included in this PR. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Fix #4206. Currently, the training step index displayed in TF and PT has different meanings: - In TF, step 0 means no training; step 1 means a training step has been performed. The maximum training step is equal to the number of steps. - In PT, step 0 means a training step has been performed. The maximum training step is the number of steps minus 1. This PR corrects the definition of the step-index in PT and makes them consistent. There is still a difference after this PR: TF shows step 0, but PT shows step 1. Showing the loss of step 0 in PT needs heavy refactoring and is thus not included in this PR.  ## Summary by CodeRabbit - **New Features** - Improved logging for training progress, starting step count from 1 for better clarity. - Enhanced TensorBoard logging for consistent step tracking. - **Bug Fixes** - Adjusted logging conditions to ensure the first step's results are included in the output.  --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

njzjz added the bug label Oct 11, 2024

njzjz added this to the v3.0.0 milestone Oct 11, 2024

github-project-automation bot added this to Multiple backend support for DeePMD-kit Oct 11, 2024

github-project-automation bot moved this to Todo in Multiple backend support for DeePMD-kit Oct 11, 2024

njzjz mentioned this issue Oct 15, 2024

fix(pt): make PT training step idx consistent with TF #4221

Merged

njzjz linked a pull request Oct 15, 2024 that will close this issue

fix(pt): make PT training step idx consistent with TF #4221

Merged

njzjz closed this as completed Oct 16, 2024

github-project-automation bot moved this from Todo to Done in Multiple backend support for DeePMD-kit Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] PT: Loss of the last step is not printed in `lcurve.out` #4206

[BUG] PT: Loss of the last step is not printed in `lcurve.out` #4206

njzjz commented Oct 11, 2024

[BUG] PT: Loss of the last step is not printed in lcurve.out #4206

[BUG] PT: Loss of the last step is not printed in lcurve.out #4206

Comments

njzjz commented Oct 11, 2024

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

[BUG] PT: Loss of the last step is not printed in `lcurve.out` #4206

[BUG] PT: Loss of the last step is not printed in `lcurve.out` #4206