Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Optionally disable logging in the data sampler to support predict_step (
NVIDIA#10127) * Resolve merge conflicts with consumed sample logging Signed-off-by: John St John <jstjohn@nvidia.com> * Add test file that captures the predict step error Signed-off-by: John St John <jstjohn@nvidia.com> * Add fixme comment around proper checkpoint nemo2 handling Signed-off-by: John St John <jstjohn@nvidia.com> * Skip megatron training test on CPU nodes Signed-off-by: John St John <jstjohn@nvidia.com> * Move output_log to last arg for compatibility Signed-off-by: John St John <jstjohn@nvidia.com> * try setting the default root dir in predict to avoid writing artifacts to cwd Signed-off-by: John St John <jstjohn@nvidia.com> * Handle the new check for batch samplers to enable predict_step Signed-off-by: John St John <jstjohn@nvidia.com> * Only reset the global microbatch, not entire parallel state Signed-off-by: John St John <jstjohn@nvidia.com> * Destroy the right sets of state in test of lightning trainer Signed-off-by: John St John <jstjohn@nvidia.com> * Fix typo and rename state resetting functions Signed-off-by: John St John <jstjohn@nvidia.com> * Run test in a subprocess to avoid contaminating global state Signed-off-by: John St John <jstjohn@nvidia.com> --------- Signed-off-by: John St John <jstjohn@nvidia.com>
- Loading branch information