Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] PT: Loss of the last step is not printed in lcurve.out #4206

Closed
njzjz opened this issue Oct 11, 2024 · 0 comments · Fixed by #4221
Closed

[BUG] PT: Loss of the last step is not printed in lcurve.out #4206

njzjz opened this issue Oct 11, 2024 · 0 comments · Fixed by #4221
Labels
Milestone

Comments

@njzjz
Copy link
Member

njzjz commented Oct 11, 2024

Bug summary

In the PyTorch backend, the loss of the last step is not printed in lcurve.out.

DeePMD-kit Version

3939786

Backend and its version

PT ver: v2.3.1+cu121-gd44533f9d07

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

When the training step is set to 1000,

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr                                                                                                                                                                                                                              
# If there is no available reference data, rmse_*_{val,trn} will print nan
      0      2.54e+01    2.51e+01      7.16e-01    1.03e+00      8.01e-01    7.92e-01    1.0e-03
    100      7.69e+00    7.79e+00      2.42e-02    1.29e-03      4.03e-01    4.09e-01    3.6e-04
    200      3.76e+00    3.72e+00      1.83e-02    2.34e-02      3.26e-01    3.23e-01    1.3e-04
    300      2.22e+00    2.21e+00      1.10e-02    2.69e-03      3.18e-01    3.18e-01    4.8e-05
    400      1.36e+00    1.30e+00      2.79e-03    3.29e-03      3.18e-01    3.04e-01    1.7e-05
    500      7.98e-01    9.00e-01      2.19e-03    2.98e-03      2.96e-01    3.34e-01    6.2e-06
    600      5.59e-01    6.10e-01      3.11e-03    3.90e-04      3.08e-01    3.38e-01    2.3e-06
    700      4.39e-01    4.43e-01      2.84e-03    9.52e-03      3.24e-01    3.14e-01    8.2e-07
    800      3.56e-01    3.42e-01      2.73e-03    2.63e-03      3.10e-01    2.99e-01    3.0e-07
    900      3.23e-01    3.29e-01      2.68e-03    6.37e-03      3.04e-01    3.02e-01    1.1e-07

Steps to Reproduce

Go to the water example and set the number of steps to 1000. Train the model.

Further Information, Files, and Links

No response

@njzjz njzjz added the bug label Oct 11, 2024
@njzjz njzjz added this to the v3.0.0 milestone Oct 11, 2024
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Oct 15, 2024
Fix deepmodeling#4206.
Currently, the training step index displayed in TF and PT has different meanings:
- In TF, step 0 means no training; step 1 means a training step has been performed. The maximum training step is equal to the number of steps.
- In PT, step 0 means a training step has been performed. The maximum training step is the number of steps minus 1.
This PR corrects the defination of the step index in PT and makes them consistent.
There are still a difference: TF shows step 0 but PT shows step 1. Showing step 0 in PT needs heavy refactor and thus is not included in this PR.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz njzjz linked a pull request Oct 15, 2024 that will close this issue
github-merge-queue bot pushed a commit that referenced this issue Oct 16, 2024
Fix #4206.

Currently, the training step index displayed in TF and PT has different
meanings:
- In TF, step 0 means no training; step 1 means a training step has been
performed. The maximum training step is equal to the number of steps.
- In PT, step 0 means a training step has been performed. The maximum
training step is the number of steps minus 1.

This PR corrects the definition of the step-index in PT and makes them
consistent.

There is still a difference after this PR: TF shows step 0, but PT shows
step 1. Showing the loss of step 0 in PT needs heavy refactoring and is
thus not included in this PR.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Improved logging for training progress, starting step count from 1 for
better clarity.
	- Enhanced TensorBoard logging for consistent step tracking.

- **Bug Fixes**
- Adjusted logging conditions to ensure the first step's results are
included in the output.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@njzjz njzjz closed this as completed Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

1 participant