-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed checkpointing user guide #9494
Conversation
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
|
||
# Distributed checkpoint save | ||
sharded_state_dict = { | ||
'weight': dist_checkpointing.ShardedTensor.from_rank_offsets('weight', local_ten, (0, rank, world_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment explain (0, rank, world_size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
4. All other objects are treated as "common" and saved according to a sharded strategy (see `Save and load strategies`_) | ||
5. All ShardedObjects are extracted from point (3) objects and saved with a common strategy (see `Save and load strategies`_) | ||
6. All ShardedTensors are saved. | ||
7. `metadata.json` file with backend and version metadata is saved to the checkpoint directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put a link to source code here to show where those steps happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a link to the MCore docs, does it make sense?
https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html#module-core.dist_checkpointing.serialization
The sharded state dict is processed in the following way: | ||
|
||
1. The ShardedTensorFactories are applied | ||
2. LocalNonPersistentObject are extracted from the sharded state dict and ignored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LocalNonPersistentObject
wasn't explained. What's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return dist_checkpointing.load(sharded_state_dict, ckpt_dir, fully_parallel_load_strategy) | ||
|
||
|
||
The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- explain what are backends and versions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from megatron.core.dist_checkpointing.strategies.torch import TorchDistLoadShardedStrategy, TorchDistSaveShardedStrategy | ||
from megatron.core.dist_checkpointing.strategies.fully_parallel import FullyParallelLoadStrategyWrapper, FullyParallelSaveStrategyWrapper | ||
|
||
base_save_strategy = TorchDistSaveShardedStrategy('torch_dist', 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy. | ||
Backends and versions are stored in a `metadata.json` file inside the checkpoint so that the loading strategy can be determined automatically (provided that there exists a default loading strategy for a given backend and version). | ||
|
||
For "sharded" strategies, currently the backends supported by default are based on `torch.distributed.checkpoint` format (`torch_dist` backend) and Zarr format (`zarr` backend). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a bit more to explain the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added (also change the order of appearance in intro.rst)
Note: in order to reuse model SharderTensors to create optimizer ShardedTensors, the model **SharderTensors must wrap model parameters**, not just tensors | ||
(obtaining a state dict with model parameters can be achieved by passing `keep_vars=True` to the model `state_dict` function). | ||
Otherwise the correspondence between model ShardedTensors and optimizer states is impossible to recreate. | ||
This is the reason for introducing ShardedTensorFactories - we have to register the original model parameter as `ShardedTensorFactories.data` and apply any subsequent transformations as a factory function in order to make sure that the same transformation can be applied to the optimizer states. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show an example source code in mcore if there's any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
||
Extra flattening comes with an efficiency challenge during checkpoint resharding. | ||
Since flattening is applied after the global tensors is sharded into the grid of local chunks, loading after resharding requires accessing incontiguous data fragments. | ||
An example solution for that is implemented in the `dist_checkpointing/strategies/resharding.py` module and involves saving the flattened tensor with a different global shape than the original one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use github path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* - 3 | ||
- [5, 9] | ||
* - 5 | ||
- [10, 11] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why DP affects the local shards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DistOpt shards flattened tensors by DP (added in docs)
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews!
@jgerh I applied all your suggestions without changes in this commit and one non-trivial change in this commit .
@yaoyu-33 I addressed your suggestions in this and this commit. @jgerh those commits add some more text, can you take a look at them as well?
The sharded state dict is processed in the following way: | ||
|
||
1. The ShardedTensorFactories are applied | ||
2. LocalNonPersistentObject are extracted from the sharded state dict and ignored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4. All other objects are treated as "common" and saved according to a sharded strategy (see `Save and load strategies`_) | ||
5. All ShardedObjects are extracted from point (3) objects and saved with a common strategy (see `Save and load strategies`_) | ||
6. All ShardedTensors are saved. | ||
7. `metadata.json` file with backend and version metadata is saved to the checkpoint directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a link to the MCore docs, does it make sense?
https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html#module-core.dist_checkpointing.serialization
from megatron.core.dist_checkpointing.strategies.torch import TorchDistLoadShardedStrategy, TorchDistSaveShardedStrategy | ||
from megatron.core.dist_checkpointing.strategies.fully_parallel import FullyParallelLoadStrategyWrapper, FullyParallelSaveStrategyWrapper | ||
|
||
base_save_strategy = TorchDistSaveShardedStrategy('torch_dist', 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return dist_checkpointing.load(sharded_state_dict, ckpt_dir, fully_parallel_load_strategy) | ||
|
||
|
||
The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: in order to reuse model SharderTensors to create optimizer ShardedTensors, the model **SharderTensors must wrap model parameters**, not just tensors | ||
(obtaining a state dict with model parameters can be achieved by passing `keep_vars=True` to the model `state_dict` function). | ||
Otherwise the correspondence between model ShardedTensors and optimizer states is impossible to recreate. | ||
This is the reason for introducing ShardedTensorFactories - we have to register the original model parameter as `ShardedTensorFactories.data` and apply any subsequent transformations as a factory function in order to make sure that the same transformation can be applied to the optimizer states. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
||
Extra flattening comes with an efficiency challenge during checkpoint resharding. | ||
Since flattening is applied after the global tensors is sharded into the grid of local chunks, loading after resharding requires accessing incontiguous data fragments. | ||
An example solution for that is implemented in the `dist_checkpointing/strategies/resharding.py` module and involves saving the flattened tensor with a different global shape than the original one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* - 3 | ||
- [5, 9] | ||
* - 5 | ||
- [10, 11] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DistOpt shards flattened tensors by DP (added in docs)
|
||
# Distributed checkpoint save | ||
sharded_state_dict = { | ||
'weight': dist_checkpointing.ShardedTensor.from_rank_offsets('weight', local_ten, (0, rank, world_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Note that the `key` doesn't have to (and usually doesn't) correspond to the key in the state dict. | ||
The key in the state dict is ephemeral, while the `ShardedTensor.key` is used to identify the tensor in the checkpoint. | ||
|
||
Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in this commit: e8e78da
The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy. | ||
Backends and versions are stored in a `metadata.json` file inside the checkpoint so that the loading strategy can be determined automatically (provided that there exists a default loading strategy for a given backend and version). | ||
|
||
For "sharded" strategies, currently the backends supported by default are based on `torch.distributed.checkpoint` format (`torch_dist` backend) and Zarr format (`zarr` backend). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added (also change the order of appearance in intro.rst)
* Add checkpoints section Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix title Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add section on ".qnemo" checkpoints (#9503) * Add 'Quantized Checkpoints' section Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Address review comments Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Distributed checkpointing user guide (#9494) * Describe shardings and entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Strategies, optimizers, finalize entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Transformations Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Integration Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add link from intro Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply grammar suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Explain the example Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply review suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add zarr and torch_dist explanation --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * add subsection Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update docs/source/checkpoints/intro.rst Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix code block Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * formatting Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com>
* Add checkpoints section Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix title Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add section on ".qnemo" checkpoints (#9503) * Add 'Quantized Checkpoints' section Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Address review comments Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Distributed checkpointing user guide (#9494) * Describe shardings and entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Strategies, optimizers, finalize entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Transformations Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Integration Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add link from intro Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply grammar suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Explain the example Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply review suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add zarr and torch_dist explanation --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * add subsection Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update docs/source/checkpoints/intro.rst Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix code block Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * formatting Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Tugrul Konuk <ertkonuk@gmail.com>
* Add checkpoints section Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix title Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add section on ".qnemo" checkpoints (NVIDIA#9503) * Add 'Quantized Checkpoints' section Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Address review comments Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Distributed checkpointing user guide (NVIDIA#9494) * Describe shardings and entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Strategies, optimizers, finalize entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Transformations Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Integration Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add link from intro Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply grammar suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Explain the example Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply review suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add zarr and torch_dist explanation --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * add subsection Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update docs/source/checkpoints/intro.rst Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix code block Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * formatting Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com>
* Add checkpoints section Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix title Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add section on ".qnemo" checkpoints (#9503) * Add 'Quantized Checkpoints' section Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Address review comments Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Distributed checkpointing user guide (#9494) * Describe shardings and entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Strategies, optimizers, finalize entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Transformations Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Integration Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add link from intro Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply grammar suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Explain the example Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply review suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add zarr and torch_dist explanation --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * add subsection Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update docs/source/checkpoints/intro.rst Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix code block Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * formatting Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
* Add checkpoints section Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix title Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add section on ".qnemo" checkpoints (NVIDIA#9503) * Add 'Quantized Checkpoints' section Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Address review comments Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Distributed checkpointing user guide (NVIDIA#9494) * Describe shardings and entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Strategies, optimizers, finalize entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Transformations Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Integration Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add link from intro Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply grammar suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Explain the example Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply review suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add zarr and torch_dist explanation --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * add subsection Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update docs/source/checkpoints/intro.rst Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix code block Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * formatting Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Malay Nagda <malayn@malayn-mlt.client.nvidia.com>
* Add checkpoints section Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix title Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add section on ".qnemo" checkpoints (#9503) * Add 'Quantized Checkpoints' section Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Address review comments Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Distributed checkpointing user guide (#9494) * Describe shardings and entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Strategies, optimizers, finalize entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Transformations Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Integration Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add link from intro Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply grammar suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Explain the example Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply review suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add zarr and torch_dist explanation --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * add subsection Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update docs/source/checkpoints/intro.rst Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix code block Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * formatting Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com>
* Add checkpoints section Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix title Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add section on ".qnemo" checkpoints (NVIDIA#9503) * Add 'Quantized Checkpoints' section Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Address review comments Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Distributed checkpointing user guide (NVIDIA#9494) * Describe shardings and entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Strategies, optimizers, finalize entrypoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Transformations Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Integration Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add link from intro Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply grammar suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Explain the example Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply review suggestions Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add zarr and torch_dist explanation --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * add subsection Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update docs/source/checkpoints/intro.rst Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix code block Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * formatting Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information