Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pstjohn/stop and go test non validation #476

Merged

Conversation

pstjohn
Copy link
Collaborator

@pstjohn pstjohn commented Nov 26, 2024

Explicitly test that train inputs and outputs are consistent when we use PreemptionCallback to handle a training interrupt.
Currently it seems that doing this changes the validation step schedule, but otherwise training should be identical.

STOP OUTPUT:

Sanity checking Validation: iteration 1/2
Sanity checking Validation: iteration 2/2
Training epoch 0, iteration 0/9 | lr: 0 | global_batch_size: 2 | global_step: 0 | reduced_train_loss: 4.87
Training epoch 0, iteration 1/9 | lr: 2e-06 | global_batch_size: 2 | global_step: 1 | reduced_train_loss: 4.771 | consumed_samples: 4
[NeMo I 2024-12-02 19:12:44 preemption:87] Received signal 12, initiating graceful stop
[NeMo I 2024-12-02 19:12:44 preemption:67] Preemption detected, saving checkpoint and exiting
2024-12-02 19:12:44,487 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}

RESUME OUTPUT:

Restored all states from the checkpoint at /tmp/tmpn8luhenb/TestESM2StopAndGoCheckpointNotAtValidation/checkpoints/epoch=0-step=2-val_loss=0.00-last/weights
Training epoch 0, iteration 3/9 | lr: 6e-06 | consumed_samples: 8 | global_batch_size: 2 | global_step: 3 | reduced_train_loss: 4.783
Training epoch 0, iteration 4/9 | lr: 8e-06 | consumed_samples: 10 | global_batch_size: 2 | global_step: 4 | reduced_train_loss: 4.785
Validation: iteration 1/2
Validation: iteration 2/2
2024-12-02 19:12:50,690 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Training epoch 0, iteration 5/9 | lr: 1e-05 | consumed_samples: 12 | global_batch_size: 2 | global_step: 5 | reduced_train_loss: 4.672 | val_loss: 4.711
Training epoch 0, iteration 6/9 | lr: 1.2e-05 | consumed_samples: 14 | global_batch_size: 2 | global_step: 6 | reduced_train_loss: 4.7 | val_loss: 4.711
Training epoch 0, iteration 7/9 | lr: 1.4e-05 | consumed_samples: 16 | global_batch_size: 2 | global_step: 7 | reduced_train_loss: 4.621 | val_loss: 4.711
Training epoch 0, iteration 8/9 | lr: 1.6e-05 | consumed_samples: 18 | global_batch_size: 2 | global_step: 8 | reduced_train_loss: 4.532 | val_loss: 4.711
Validation: iteration 1/2
Validation: iteration 2/2
Training epoch 0, iteration 9/9 | lr: 1.8e-05 | consumed_samples: 20 | global_batch_size: 2 | global_step: 9 | reduced_train_loss: 4.497 | val_loss: 4.544
`Trainer.fit` stopped: `max_steps=10` reached.

CONTINUOUS OUTPUT:

Sanity checking Validation: iteration 1/2
Sanity checking Validation: iteration 2/2
Training epoch 0, iteration 0/9 | lr: 0 | global_batch_size: 2 | global_step: 0 | reduced_train_loss: 4.87
Training epoch 0, iteration 1/9 | lr: 2e-06 | global_batch_size: 2 | global_step: 1 | reduced_train_loss: 4.771 | consumed_samples: 4
Training epoch 0, iteration 2/9 | lr: 4e-06 | global_batch_size: 2 | global_step: 2 | reduced_train_loss: 4.87 | consumed_samples: 6
Training epoch 0, iteration 3/9 | lr: 6e-06 | global_batch_size: 2 | global_step: 3 | reduced_train_loss: 4.783 | consumed_samples: 8
Validation: iteration 1/2
Validation: iteration 2/2
Training epoch 0, iteration 4/9 | lr: 8e-06 | global_batch_size: 2 | global_step: 4 | reduced_train_loss: 4.785 | consumed_samples: 10 | val_loss: 4.758
Training epoch 0, iteration 5/9 | lr: 1e-05 | global_batch_size: 2 | global_step: 5 | reduced_train_loss: 4.672 | consumed_samples: 12 | val_loss: 4.758
Training epoch 0, iteration 6/9 | lr: 1.2e-05 | global_batch_size: 2 | global_step: 6 | reduced_train_loss: 4.7 | consumed_samples: 14 | val_loss: 4.758
Training epoch 0, iteration 7/9 | lr: 1.4e-05 | global_batch_size: 2 | global_step: 7 | reduced_train_loss: 4.621 | consumed_samples: 16 | val_loss: 4.758
Validation: iteration 1/2
Validation: iteration 2/2
Training epoch 0, iteration 8/9 | lr: 1.6e-05 | global_batch_size: 2 | global_step: 8 | reduced_train_loss: 4.532 | consumed_samples: 18 | val_loss: 4.573
Training epoch 0, iteration 9/9 | lr: 1.8e-05 | global_batch_size: 2 | global_step: 9 | reduced_train_loss: 4.497 | consumed_samples: 20 | val_loss: 4.573
Validation: iteration 1/2
Validation: iteration 2/2
`Trainer.fit` stopped: `max_steps=10` reached.


@pstjohn pstjohn force-pushed the pstjohn/stop-and-go-test-non-validation branch 3 times, most recently from c5deefd to bbf2bd7 Compare December 2, 2024 22:44
@pstjohn pstjohn marked this pull request as ready for review December 2, 2024 22:44
@pstjohn
Copy link
Collaborator Author

pstjohn commented Dec 2, 2024

/build-ci

Copy link
Collaborator

@sichu2023 sichu2023 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go. Only some comments to help me understand it better.

@pstjohn pstjohn force-pushed the pstjohn/stop-and-go-test-non-validation branch from e8e4853 to 364659c Compare December 4, 2024 16:05
@pstjohn pstjohn force-pushed the pstjohn/stop-and-go-test-non-validation branch from 364659c to 117e00c Compare December 4, 2024 16:59
@pstjohn
Copy link
Collaborator Author

pstjohn commented Dec 4, 2024

/build-ci

@pstjohn pstjohn enabled auto-merge (squash) December 4, 2024 17:01
@pstjohn
Copy link
Collaborator Author

pstjohn commented Dec 4, 2024

/build-ci

@pstjohn pstjohn merged commit 38be873 into NVIDIA:main Dec 4, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants