Fix LoRA contiguous tensor #10611

cuichenx · 2024-09-25T00:57:32Z

What does this PR do ?

Fix contiguous tensor issue in LoRA with mbs>1

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Chen Cui <chcui@nvidia.com>

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

tests/collections/llm/gpt_finetuning.py

stevehuang52

I'm seeing some strange warnings, are they normal?

not able to find val_loss, where does the model call self.log or self.log_dict?:

[NeMo W 2024-09-25 20:12:48 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:384: `ModelCheckpoint(monitor='val_loss')` could not find the monitored key in the returned metrics: ['lr', 'consumed_samples', 'global_batch_size', 'global_step', 'step', 'reduced_train_loss', 'grad_norm', 'epoch']. HINT: Did you call `log('val_loss', value)` in the `LightningModule`?

torch/_dynamo/convert_frame.py:

Validation: iteration 4/5
[rank0]:W0925 20:09:59.218000 140276111091520 torch/_dynamo/convert_frame.py:744] [4/8] torch._dynamo hit config.cache_size_limit (8)
[rank0]:W0925 20:09:59.218000 140276111091520 torch/_dynamo/convert_frame.py:744] [4/8]    function: 'calculate_cross_entropy_loss' (/opt/megatron-lm/megatron/core/fusions/fused_cross_entropy.py:47)
[rank0]:W0925 20:09:59.218000 140276111091520 torch/_dynamo/convert_frame.py:744] [4/8]    last reason: tensor 'L['exp_logits']' size mismatch at index 0. expected 224, actual 144
[rank0]:W0925 20:09:59.218000 140276111091520 torch/_dynamo/convert_frame.py:744] [4/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank0]:W0925 20:09:59.218000 140276111091520 torch/_dynamo/convert_frame.py:744] [4/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
[NeMo W 2024-09-25 20:09:59 nemo_logging:349] /opt/megatron-lm/megatron/core/tensor_parallel/layers.py:609: UserWarning: async_grad_allreduce is deprecated, not in use anymore and will be fully removed with 0.10.0. Please use allreduce_dgrad instead.
      warnings.warn(

cuichenx · 2024-09-25T21:09:29Z

These are not related to the PR so let's discuss elsewhere.

stevehuang52

LGTM, thanks!

cuichenx · 2024-09-26T14:37:04Z

adding CI tests in separate PR: #10632

Signed-off-by: Chen Cui <chcui@nvidia.com>

* contiguous Signed-off-by: Chen Cui <chcui@nvidia.com> * fix load Signed-off-by: Chen Cui <chcui@nvidia.com> * add test script Signed-off-by: Chen Cui <chcui@nvidia.com> * Apply isort and black reformatting Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: artbataev <artbataev@users.noreply.github.com> --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Signed-off-by: artbataev <artbataev@users.noreply.github.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: artbataev <artbataev@users.noreply.github.com>

cuichenx added 2 commits September 24, 2024 20:48

contiguous

4f67ff4

Signed-off-by: Chen Cui <chcui@nvidia.com>

fix load

41a6528

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx added the r2.0.0 label Sep 25, 2024

cuichenx and others added 3 commits September 25, 2024 14:25

add test script

a04e517

Signed-off-by: Chen Cui <chcui@nvidia.com>

Apply isort and black reformatting

8c743c1

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

Apply isort and black reformatting

80bf07d

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

github-advanced-security bot found potential problems Sep 25, 2024

View reviewed changes

tests/collections/llm/gpt_finetuning.py Dismissed Show dismissed Hide dismissed

ericharper requested a review from stevehuang52 September 25, 2024 19:17

stevehuang52 reviewed Sep 25, 2024

View reviewed changes

stevehuang52 approved these changes Sep 25, 2024

View reviewed changes

ko3n1g added the Run CICD label Sep 26, 2024

cuichenx merged commit 51f47f1 into main Sep 26, 2024
153 of 158 checks passed

cuichenx deleted the chcui/lora_contiguous branch September 26, 2024 14:33

cuichenx added a commit that referenced this pull request Sep 26, 2024

cherrypick #10466 #10611 #10632 without the ci test

9d7b39c

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx mentioned this pull request Sep 26, 2024

Cherrypick #10466 #10611 #10632 without the ci test #10638

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LoRA contiguous tensor #10611

Fix LoRA contiguous tensor #10611

cuichenx commented Sep 25, 2024

stevehuang52 left a comment

cuichenx commented Sep 25, 2024

stevehuang52 left a comment

cuichenx commented Sep 26, 2024

Fix LoRA contiguous tensor #10611

Fix LoRA contiguous tensor #10611

Conversation

cuichenx commented Sep 25, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

stevehuang52 left a comment

Choose a reason for hiding this comment

cuichenx commented Sep 25, 2024

stevehuang52 left a comment

Choose a reason for hiding this comment

cuichenx commented Sep 26, 2024