Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section on ".qnemo" checkpoints #9503

Merged

Conversation

janekl
Copy link
Collaborator

@janekl janekl commented Jun 19, 2024

What does this PR do ?

Add section on ".qnemo" checkpoints to #9329.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

NeMo also offers :doc:`Post-Training Quantization <../nlp/quantization>` workflow to convert regular ``.nemo`` models into a `TensorRT-LLM checkpoint <https://nvidia.github.io/TensorRT-LLM/architecture/checkpoint.html>`_ conventionally referred to as ``.qnemo`` checkpoints in NeMo. Such a checkpoint can be used with `NVIDIA TensorRT-LLM library <https://nvidia.github.io/TensorRT-LLM/index.html>`_ for efficient inference.

Much as in the case of ``.nemo`` checkpoints, a ``.qnemo`` checkpoint is a tar file that bundles the model configuration given in ``config.json`` file and ``rank{i}.safetensors`` files storing model weights for each rank separately. Additionally a ``tokenizer_config.yaml`` file is saved which is just ``tokenizer`` section from ``model_config.yaml`` file from the original NeMo model. This configuration file defines a tokenizer for the model given.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.qnemo would not support distributed checkpoint format? i.e. you saved with world_size 2 and have to load with world_size 2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to clarify is that these config.json + rank{i}.safetensors output is a TRT-LLM checkpoint. This should not be confused by distributed checkpoint in Nemo sense.

Anyway, the feature you asked for is not available in TRT-LLM currently. So to build a TRT-LLM engine with world_size=2 one needs to calibrate/quantize model to TRT-LLM checkpoint with world_size=2 and provide this as the input to trtllm-build command. In other words, world_size cannnot be changed at engine build.

@@ -20,6 +20,26 @@ With sharded model weights, you can save and load the state of your training scr

NeMo supports the distributed (sharded) checkpoint format from Megatron-Core. In Megatron-Core, it supports two backends: Zarr-based and PyTorch-based.
Copy link
Collaborator

@jgerh jgerh Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edits for 1 - 21.

Checkpoints

This section presents the key functionalities of NVIDIA NeMo that pertain to checkpoint management.

Understand Checkpoint Formats

A .nemo checkpoint is essentially a tar file that combines various components of a trained model. These components include the model configurations (specified in a YAML file), the model weights, and other related artifacts such as tokenizer models or vocabulary files. This design simplifies tasks like sharing, loading, tuning, evaluating, and performing inference with the model.

On the other hand, the .ckpt file, generated during PyTorch Lightning training, contains both the model weights and the optimizer states. It is typically used to resume training from a paused state.

Sharded Model Weights

In both .nemo and .ckpt checkpoints, the model weights can be saved in either a regular format (as a single file named model_weights.ckpt within model parallelism folders) or a sharded format (where they are stored in a folder called model_weights).

Sharded model weights allow you to efficiently save and load the state of your training script across multiple GPUs or nodes. This approach avoids the necessity to modify model partitions when resuming tuning with a different model parallelism setup.

NeMo supports the distributed (sharded) checkpoint format from Megatron Core. In Megatron Core, there are two supported backends: Zarr-based and PyTorch-based.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaoyu-33 that would be sth for you to account for in the destination branch yuya/add_checkpoints_section

├── rank1.safetensors
├── tokenizer.model
└── tokenizer_config.yaml

Community Checkpoint Converter
Copy link
Collaborator

@jgerh jgerh Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edits to 45-47

NVIDIA provides easy-to-use tools that enable users to convert community checkpoints into the NeMo format. These tools facilitate various operations, including resuming training, Supervised Fine-tuning (SFT), Parameter Efficient Fine-Tuning (PEFT), and deployment. Please consult our documentation for detailed instructions and guidelines. We provide comprehensive guides to assist both end users and developers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaoyu-33 this is also sth for you to address here #9329, please have a look

@jgerh jgerh mentioned this pull request Jun 26, 2024
8 tasks
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
@yaoyu-33 yaoyu-33 merged commit ae1c806 into yuya/add_checkpoints_section Jun 27, 2024
10 checks passed
@yaoyu-33 yaoyu-33 deleted the jlasek/add_checkpoints_section_qnemo branch June 27, 2024 16:32
ericharper pushed a commit that referenced this pull request Jul 17, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
ertkonuk pushed a commit that referenced this pull request Jul 19, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Tugrul Konuk <ertkonuk@gmail.com>
tonyjie pushed a commit to tonyjie/NeMo that referenced this pull request Jul 24, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
akoumpa pushed a commit that referenced this pull request Jul 25, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
malay-nagda pushed a commit to malay-nagda/NeMo that referenced this pull request Jul 26, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Malay Nagda <malayn@malayn-mlt.client.nvidia.com>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants