Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows non-strict load with distributed checkpoints #9613

Merged
merged 16 commits into from
Jul 12, 2024

Conversation

mikolajblaz
Copy link
Collaborator

@mikolajblaz mikolajblaz commented Jul 4, 2024

What does this PR do ?

With distributed checkpoints, the mismatches between the runtime model and checkpoint model manifest during dist_checkpoint.load (not during model.load_state_dict as with regular checkpoints).
This PR adds a flag that allows to adjust load strictness (e.g. ignore unexpected keys).

This PR relies on MCore feature that is not merged yet (merge ETA 5th July): https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1628

Collection: NLP

Changelog

  • Add model.dist_ckpt_load_strictness flag to control dist ckpt load strictness. The most useful value is log_all which warns about all mismatches but performs the checkpoint load for a matching state dict subset.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@mikolajblaz mikolajblaz self-assigned this Jul 4, 2024
@github-actions github-actions bot added the NLP label Jul 4, 2024
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Dockerfile.ci Outdated Show resolved Hide resolved
Dockerfile.ci Outdated Show resolved Hide resolved
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@dimapihtar dimapihtar self-requested a review July 12, 2024 10:19
Copy link
Collaborator

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@mikolajblaz mikolajblaz merged commit 4239511 into r2.0.0rc1 Jul 12, 2024
217 of 219 checks passed
@mikolajblaz mikolajblaz deleted the mblaz/non-strict-load branch July 12, 2024 13:12
github-actions bot pushed a commit that referenced this pull request Jul 12, 2024
* Allow non-strict load

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Point to non-stric load MCore branch

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Avoid module level StrictHandling

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use MCore fork

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update to MCore fix

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Restore ackward compatibility

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update flag defaults

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update MCore tag

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update PyT Dist interface

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update to latest core_r0.8.0

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
mikolajblaz added a commit that referenced this pull request Jul 12, 2024
* Allow non-strict load

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Point to non-stric load MCore branch

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Avoid module level StrictHandling

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use MCore fork

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update to MCore fix

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Restore ackward compatibility

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update flag defaults

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update MCore tag

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update PyT Dist interface

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update to latest core_r0.8.0

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
mikolajblaz added a commit that referenced this pull request Jul 12, 2024
* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
nikitaved pushed a commit to nikitaved/NeMo that referenced this pull request Jul 16, 2024
…IDIA#9715)

* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
@ko3n1g ko3n1g mentioned this pull request Jul 18, 2024
2 tasks
ertkonuk pushed a commit that referenced this pull request Jul 19, 2024
* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Tugrul Konuk <ertkonuk@gmail.com>
malay-nagda pushed a commit to malay-nagda/NeMo that referenced this pull request Jul 26, 2024
…IDIA#9715)

* Allow non-strict load

* Point to non-stric load MCore branch

* Avoid module level StrictHandling

* Use MCore fork

* Update to MCore fix

* Restore ackward compatibility

* Update flag defaults

* Update MCore tag

* Update PyT Dist interface

* Update to latest core_r0.8.0

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Malay Nagda <malayn@malayn-mlt.client.nvidia.com>
tonyjie pushed a commit to tonyjie/NeMo that referenced this pull request Aug 6, 2024
…IDIA#9715)

* Allow non-strict load

* Point to non-stric load MCore branch

* Avoid module level StrictHandling

* Use MCore fork

* Update to MCore fix

* Restore ackward compatibility

* Update flag defaults

* Update MCore tag

* Update PyT Dist interface

* Update to latest core_r0.8.0

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: tonyjie <jl4257@cornell.edu>
dimapihtar pushed a commit that referenced this pull request Aug 27, 2024
* Allow non-strict load

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Point to non-stric load MCore branch

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Avoid module level StrictHandling

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use MCore fork

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update to MCore fix

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Restore ackward compatibility

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update flag defaults

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update MCore tag

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update PyT Dist interface

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update to latest core_r0.8.0

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
…IDIA#9715)

* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants