Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally disable logging in the data sampler to support predict_step #10127

Merged
merged 11 commits into from
Aug 21, 2024

Commits on Aug 21, 2024

  1. Resolve merge conflicts with consumed sample logging

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    9fa0364 View commit details
    Browse the repository at this point in the history
  2. Add test file that captures the predict step error

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    6374c2e View commit details
    Browse the repository at this point in the history
  3. Add fixme comment around proper checkpoint nemo2 handling

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    c6c93ac View commit details
    Browse the repository at this point in the history
  4. Skip megatron training test on CPU nodes

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    7058723 View commit details
    Browse the repository at this point in the history
  5. Move output_log to last arg for compatibility

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    e391c72 View commit details
    Browse the repository at this point in the history
  6. try setting the default root dir in predict to avoid writing artifact…

    …s to cwd
    
    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    9997393 View commit details
    Browse the repository at this point in the history
  7. Handle the new check for batch samplers to enable predict_step

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    3720193 View commit details
    Browse the repository at this point in the history
  8. Only reset the global microbatch, not entire parallel state

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    70fe6fa View commit details
    Browse the repository at this point in the history
  9. Destroy the right sets of state in test of lightning trainer

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    8c1ea86 View commit details
    Browse the repository at this point in the history
  10. Fix typo and rename state resetting functions

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    dfdf426 View commit details
    Browse the repository at this point in the history
  11. Run test in a subprocess to avoid contaminating global state

    Signed-off-by: John St John <jstjohn@nvidia.com>
    jstjohn committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    a6ff157 View commit details
    Browse the repository at this point in the history