Pstjohn/stop and go test non validation #476

pstjohn · 2024-11-26T00:11:23Z

Explicitly test that train inputs and outputs are consistent when we use PreemptionCallback to handle a training interrupt.
Currently it seems that doing this changes the validation step schedule, but otherwise training should be identical.

STOP OUTPUT:

Sanity checking Validation: iteration 1/2
Sanity checking Validation: iteration 2/2
Training epoch 0, iteration 0/9 | lr: 0 | global_batch_size: 2 | global_step: 0 | reduced_train_loss: 4.87
Training epoch 0, iteration 1/9 | lr: 2e-06 | global_batch_size: 2 | global_step: 1 | reduced_train_loss: 4.771 | consumed_samples: 4
[NeMo I 2024-12-02 19:12:44 preemption:87] Received signal 12, initiating graceful stop
[NeMo I 2024-12-02 19:12:44 preemption:67] Preemption detected, saving checkpoint and exiting
2024-12-02 19:12:44,487 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}

RESUME OUTPUT:

Restored all states from the checkpoint at /tmp/tmpn8luhenb/TestESM2StopAndGoCheckpointNotAtValidation/checkpoints/epoch=0-step=2-val_loss=0.00-last/weights
Training epoch 0, iteration 3/9 | lr: 6e-06 | consumed_samples: 8 | global_batch_size: 2 | global_step: 3 | reduced_train_loss: 4.783
Training epoch 0, iteration 4/9 | lr: 8e-06 | consumed_samples: 10 | global_batch_size: 2 | global_step: 4 | reduced_train_loss: 4.785
Validation: iteration 1/2
Validation: iteration 2/2
2024-12-02 19:12:50,690 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
Training epoch 0, iteration 5/9 | lr: 1e-05 | consumed_samples: 12 | global_batch_size: 2 | global_step: 5 | reduced_train_loss: 4.672 | val_loss: 4.711
Training epoch 0, iteration 6/9 | lr: 1.2e-05 | consumed_samples: 14 | global_batch_size: 2 | global_step: 6 | reduced_train_loss: 4.7 | val_loss: 4.711
Training epoch 0, iteration 7/9 | lr: 1.4e-05 | consumed_samples: 16 | global_batch_size: 2 | global_step: 7 | reduced_train_loss: 4.621 | val_loss: 4.711
Training epoch 0, iteration 8/9 | lr: 1.6e-05 | consumed_samples: 18 | global_batch_size: 2 | global_step: 8 | reduced_train_loss: 4.532 | val_loss: 4.711
Validation: iteration 1/2
Validation: iteration 2/2
Training epoch 0, iteration 9/9 | lr: 1.8e-05 | consumed_samples: 20 | global_batch_size: 2 | global_step: 9 | reduced_train_loss: 4.497 | val_loss: 4.544
`Trainer.fit` stopped: `max_steps=10` reached.

CONTINUOUS OUTPUT:

Sanity checking Validation: iteration 1/2
Sanity checking Validation: iteration 2/2
Training epoch 0, iteration 0/9 | lr: 0 | global_batch_size: 2 | global_step: 0 | reduced_train_loss: 4.87
Training epoch 0, iteration 1/9 | lr: 2e-06 | global_batch_size: 2 | global_step: 1 | reduced_train_loss: 4.771 | consumed_samples: 4
Training epoch 0, iteration 2/9 | lr: 4e-06 | global_batch_size: 2 | global_step: 2 | reduced_train_loss: 4.87 | consumed_samples: 6
Training epoch 0, iteration 3/9 | lr: 6e-06 | global_batch_size: 2 | global_step: 3 | reduced_train_loss: 4.783 | consumed_samples: 8
Validation: iteration 1/2
Validation: iteration 2/2
Training epoch 0, iteration 4/9 | lr: 8e-06 | global_batch_size: 2 | global_step: 4 | reduced_train_loss: 4.785 | consumed_samples: 10 | val_loss: 4.758
Training epoch 0, iteration 5/9 | lr: 1e-05 | global_batch_size: 2 | global_step: 5 | reduced_train_loss: 4.672 | consumed_samples: 12 | val_loss: 4.758
Training epoch 0, iteration 6/9 | lr: 1.2e-05 | global_batch_size: 2 | global_step: 6 | reduced_train_loss: 4.7 | consumed_samples: 14 | val_loss: 4.758
Training epoch 0, iteration 7/9 | lr: 1.4e-05 | global_batch_size: 2 | global_step: 7 | reduced_train_loss: 4.621 | consumed_samples: 16 | val_loss: 4.758
Validation: iteration 1/2
Validation: iteration 2/2
Training epoch 0, iteration 8/9 | lr: 1.6e-05 | global_batch_size: 2 | global_step: 8 | reduced_train_loss: 4.532 | consumed_samples: 18 | val_loss: 4.573
Training epoch 0, iteration 9/9 | lr: 1.8e-05 | global_batch_size: 2 | global_step: 9 | reduced_train_loss: 4.497 | consumed_samples: 20 | val_loss: 4.573
Validation: iteration 1/2
Validation: iteration 2/2
`Trainer.fit` stopped: `max_steps=10` reached.

pstjohn · 2024-12-02T22:48:03Z

/build-ci

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py

sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py

sichu2023

Good to go. Only some comments to help me understand it better.

pstjohn · 2024-12-04T17:00:50Z

/build-ci

pstjohn · 2024-12-04T19:42:05Z

/build-ci

pstjohn force-pushed the pstjohn/stop-and-go-test-non-validation branch 3 times, most recently from c5deefd to bbf2bd7 Compare December 2, 2024 22:44

pstjohn marked this pull request as ready for review December 2, 2024 22:44

pstjohn requested review from dorotat-nv, farhadrgh, jstjohn, malcolmgreaves and skothenhill-nv as code owners December 2, 2024 22:44

jstjohn approved these changes Dec 3, 2024

View reviewed changes

jstjohn reviewed Dec 3, 2024

View reviewed changes

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py Show resolved Hide resolved

sichu2023 reviewed Dec 3, 2024

View reviewed changes

sichu2023 approved these changes Dec 3, 2024

View reviewed changes

pstjohn force-pushed the pstjohn/stop-and-go-test-non-validation branch from e8e4853 to 364659c Compare December 4, 2024 16:05

pstjohn added 9 commits December 4, 2024 08:58

Explicitly test interrupting training and saving with PreemptionCallback

dfd0397

remove print statement

89b2fcb

send signal at start of step to ensure it's captured

391ba15

add link to internal slack discussion

bb02054

removing some extraneous changes

2a639a8

add test for valid inputs

0dec43d

add duration logging to pytest call

a870bf9

add nvbug link to the xfail

85f7a30

fix ruff version mismatch

117e00c

pstjohn force-pushed the pstjohn/stop-and-go-test-non-validation branch from 364659c to 117e00c Compare December 4, 2024 16:59

pstjohn requested review from ohadmo and trvachov as code owners December 4, 2024 16:59

pstjohn enabled auto-merge (squash) December 4, 2024 17:01

pstjohn merged commit 38be873 into NVIDIA:main Dec 4, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pstjohn/stop and go test non validation #476

Pstjohn/stop and go test non validation #476

pstjohn commented Nov 26, 2024 •

edited

Loading

pstjohn commented Dec 2, 2024

sichu2023 left a comment

pstjohn commented Dec 4, 2024

pstjohn commented Dec 4, 2024

Pstjohn/stop and go test non validation #476

Pstjohn/stop and go test non validation #476

Conversation

pstjohn commented Nov 26, 2024 • edited Loading

pstjohn commented Dec 2, 2024

sichu2023 left a comment

Choose a reason for hiding this comment

pstjohn commented Dec 4, 2024

pstjohn commented Dec 4, 2024

pstjohn commented Nov 26, 2024 •

edited

Loading