Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed checkpointing user guide #9494

Merged
merged 10 commits into from
Jul 11, 2024

Conversation

mikolajblaz
Copy link
Collaborator

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

# Distributed checkpoint save
sharded_state_dict = {
'weight': dist_checkpointing.ShardedTensor.from_rank_offsets('weight', local_ten, (0, rank, world_size))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment explain (0, rank, world_size)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

4. All other objects are treated as "common" and saved according to a sharded strategy (see `Save and load strategies`_)
5. All ShardedObjects are extracted from point (3) objects and saved with a common strategy (see `Save and load strategies`_)
6. All ShardedTensors are saved.
7. `metadata.json` file with backend and version metadata is saved to the checkpoint directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put a link to source code here to show where those steps happen

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sharded state dict is processed in the following way:

1. The ShardedTensorFactories are applied
2. LocalNonPersistentObject are extracted from the sharded state dict and ignored
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalNonPersistentObject wasn't explained. What's this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return dist_checkpointing.load(sharded_state_dict, ckpt_dir, fully_parallel_load_strategy)


The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • explain what are backends and versions here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from megatron.core.dist_checkpointing.strategies.torch import TorchDistLoadShardedStrategy, TorchDistSaveShardedStrategy
from megatron.core.dist_checkpointing.strategies.fully_parallel import FullyParallelLoadStrategyWrapper, FullyParallelSaveStrategyWrapper

base_save_strategy = TorchDistSaveShardedStrategy('torch_dist', 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comments

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy.
Backends and versions are stored in a `metadata.json` file inside the checkpoint so that the loading strategy can be determined automatically (provided that there exists a default loading strategy for a given backend and version).

For "sharded" strategies, currently the backends supported by default are based on `torch.distributed.checkpoint` format (`torch_dist` backend) and Zarr format (`zarr` backend).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a bit more to explain the difference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added (also change the order of appearance in intro.rst)

Note: in order to reuse model SharderTensors to create optimizer ShardedTensors, the model **SharderTensors must wrap model parameters**, not just tensors
(obtaining a state dict with model parameters can be achieved by passing `keep_vars=True` to the model `state_dict` function).
Otherwise the correspondence between model ShardedTensors and optimizer states is impossible to recreate.
This is the reason for introducing ShardedTensorFactories - we have to register the original model parameter as `ShardedTensorFactories.data` and apply any subsequent transformations as a factory function in order to make sure that the same transformation can be applied to the optimizer states.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show an example source code in mcore if there's any

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


Extra flattening comes with an efficiency challenge during checkpoint resharding.
Since flattening is applied after the global tensors is sharded into the grid of local chunks, loading after resharding requires accessing incontiguous data fragments.
An example solution for that is implemented in the `dist_checkpointing/strategies/resharding.py` module and involves saving the flattened tensor with a different global shape than the original one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use github path

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* - 3
- [5, 9]
* - 5
- [10, 11]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why DP affects the local shards?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistOpt shards flattened tensors by DP (added in docs)

@mikolajblaz mikolajblaz changed the title Mblaz/docs dist ckpt Distributed checkpointing user guide Jun 28, 2024
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Copy link
Collaborator Author

@mikolajblaz mikolajblaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews!

@jgerh I applied all your suggestions without changes in this commit and one non-trivial change in this commit .

@yaoyu-33 I addressed your suggestions in this and this commit. @jgerh those commits add some more text, can you take a look at them as well?

The sharded state dict is processed in the following way:

1. The ShardedTensorFactories are applied
2. LocalNonPersistentObject are extracted from the sharded state dict and ignored
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4. All other objects are treated as "common" and saved according to a sharded strategy (see `Save and load strategies`_)
5. All ShardedObjects are extracted from point (3) objects and saved with a common strategy (see `Save and load strategies`_)
6. All ShardedTensors are saved.
7. `metadata.json` file with backend and version metadata is saved to the checkpoint directory.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from megatron.core.dist_checkpointing.strategies.torch import TorchDistLoadShardedStrategy, TorchDistSaveShardedStrategy
from megatron.core.dist_checkpointing.strategies.fully_parallel import FullyParallelLoadStrategyWrapper, FullyParallelSaveStrategyWrapper

base_save_strategy = TorchDistSaveShardedStrategy('torch_dist', 1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return dist_checkpointing.load(sharded_state_dict, ckpt_dir, fully_parallel_load_strategy)


The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: in order to reuse model SharderTensors to create optimizer ShardedTensors, the model **SharderTensors must wrap model parameters**, not just tensors
(obtaining a state dict with model parameters can be achieved by passing `keep_vars=True` to the model `state_dict` function).
Otherwise the correspondence between model ShardedTensors and optimizer states is impossible to recreate.
This is the reason for introducing ShardedTensorFactories - we have to register the original model parameter as `ShardedTensorFactories.data` and apply any subsequent transformations as a factory function in order to make sure that the same transformation can be applied to the optimizer states.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


Extra flattening comes with an efficiency challenge during checkpoint resharding.
Since flattening is applied after the global tensors is sharded into the grid of local chunks, loading after resharding requires accessing incontiguous data fragments.
An example solution for that is implemented in the `dist_checkpointing/strategies/resharding.py` module and involves saving the flattened tensor with a different global shape than the original one.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* - 3
- [5, 9]
* - 5
- [10, 11]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistOpt shards flattened tensors by DP (added in docs)


# Distributed checkpoint save
sharded_state_dict = {
'weight': dist_checkpointing.ShardedTensor.from_rank_offsets('weight', local_ten, (0, rank, world_size))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Note that the `key` doesn't have to (and usually doesn't) correspond to the key in the state dict.
The key in the state dict is ephemeral, while the `ShardedTensor.key` is used to identify the tensor in the checkpoint.

Example:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in this commit: e8e78da

The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy.
Backends and versions are stored in a `metadata.json` file inside the checkpoint so that the loading strategy can be determined automatically (provided that there exists a default loading strategy for a given backend and version).

For "sharded" strategies, currently the backends supported by default are based on `torch.distributed.checkpoint` format (`torch_dist` backend) and Zarr format (`zarr` backend).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added (also change the order of appearance in intro.rst)

@yaoyu-33 yaoyu-33 merged commit 3003ec0 into yuya/add_checkpoints_section Jul 11, 2024
10 checks passed
@yaoyu-33 yaoyu-33 deleted the mblaz/docs-dist-ckpt branch July 11, 2024 16:01
ericharper pushed a commit that referenced this pull request Jul 17, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
ertkonuk pushed a commit that referenced this pull request Jul 19, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Tugrul Konuk <ertkonuk@gmail.com>
tonyjie pushed a commit to tonyjie/NeMo that referenced this pull request Jul 24, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
akoumpa pushed a commit that referenced this pull request Jul 25, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
malay-nagda pushed a commit to malay-nagda/NeMo that referenced this pull request Jul 26, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Malay Nagda <malayn@malayn-mlt.client.nvidia.com>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants