-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] PT: Loss of the last step is not printed in lcurve.out
#4206
Comments
njzjz
added a commit
to njzjz/deepmd-kit
that referenced
this issue
Oct 15, 2024
Fix deepmodeling#4206. Currently, the training step index displayed in TF and PT has different meanings: - In TF, step 0 means no training; step 1 means a training step has been performed. The maximum training step is equal to the number of steps. - In PT, step 0 means a training step has been performed. The maximum training step is the number of steps minus 1. This PR corrects the defination of the step index in PT and makes them consistent. There are still a difference: TF shows step 0 but PT shows step 1. Showing step 0 in PT needs heavy refactor and thus is not included in this PR. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
github-merge-queue bot
pushed a commit
that referenced
this issue
Oct 16, 2024
Fix #4206. Currently, the training step index displayed in TF and PT has different meanings: - In TF, step 0 means no training; step 1 means a training step has been performed. The maximum training step is equal to the number of steps. - In PT, step 0 means a training step has been performed. The maximum training step is the number of steps minus 1. This PR corrects the definition of the step-index in PT and makes them consistent. There is still a difference after this PR: TF shows step 0, but PT shows step 1. Showing the loss of step 0 in PT needs heavy refactoring and is thus not included in this PR. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Improved logging for training progress, starting step count from 1 for better clarity. - Enhanced TensorBoard logging for consistent step tracking. - **Bug Fixes** - Adjusted logging conditions to ensure the first step's results are included in the output. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug summary
In the PyTorch backend, the loss of the last step is not printed in
lcurve.out
.DeePMD-kit Version
3939786
Backend and its version
PT ver: v2.3.1+cu121-gd44533f9d07
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
When the training step is set to 1000,
Steps to Reproduce
Go to the water example and set the number of steps to 1000. Train the model.
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: