Add MemoryProfileCallback #10166

ShriyaPalsamudram · 2024-08-15T21:51:06Z

What does this PR do ?

This callback enables recording a timeline of memory allocations during training.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/lightning/pytorch/callbacks/memory_profiler.py

maanug-nv · 2024-08-17T00:39:45Z

Currently, this will dump the snapshot for all ranks. Should we add an option to only save a snapshot from rank 0? or a specifiable rank?

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

ShriyaPalsamudram · 2024-08-21T20:29:25Z

Currently, this will dump the snapshot for all ranks. Should we add an option to only save a snapshot from rank 0? or a specifiable rank?

Addressed in latest commit

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

maanug-nv

Callback LGTM. I have no concerns about memory profiling, since this is almost identical to the one callback we've been using. For memory cycles, we should test by introducing an intentional cycle. Also curious if this produces warnings about any cycles not related to our code (eg in PTL).

maanug-nv · 2024-08-22T20:22:52Z

also, can we allow disabling the profiling, so users can only run the cycle detector?

Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com>

hemildesai

LGTM, just one minor comment.

hemildesai · 2024-08-23T14:17:05Z

nemo/lightning/pytorch/callbacks/memory_profiler.py

+
+import torch
+from pytorch_lightning.callbacks.callback import Callback
+from torch.utils.viz._cycles import warn_tensor_cycles


Can we move this import inside the callback to be safe?

Isn't it better to fail early if there is an issue with this import than fail inside the callback?

hemildesai · 2024-08-23T15:14:40Z

Yeah I was just being cautious to avoid side effects of the import but I don't think this import should have any, so it's probably fine. Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Shriya Rishab ***@***.***> Sent: Friday, August 23, 2024 10:07:29 AM To: NVIDIA/NeMo ***@***.***> Cc: Hemil Desai ***@***.***>; Comment ***@***.***> Subject: Re: [NVIDIA/NeMo] Add MemoryProfileCallback (PR #10166) @ShriyaPalsamudram commented on this pull request.

________________________________ In nemo/lightning/pytorch/callbacks/memory_profiler.py<#10166 (comment)>:

@@ -0,0 +1,78 @@

+import os + +import torch +from pytorch_lightning.callbacks.callback import Callback +from torch.utils.viz._cycles import warn_tensor_cycles Isn't it better to fail early if there is an issue with this import than fail inside the callback? — Reply to this email directly, view it on GitHub<#10166 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB6Q25H6ZIEEJZOY5WCDAWDZS5F3DAVCNFSM6AAAAABMS6X6TCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENJXGQ3TONZVGY>. You are receiving this because you commented.Message ID: ***@***.***>

* Add MemoryProfileCallback Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Remove reference cycles, save snapshot on specific ranks Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Remove unnecessary imports Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Update docstring Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> --------- Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com> Co-authored-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

* Add MemoryProfileCallback Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Remove reference cycles, save snapshot on specific ranks Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Remove unnecessary imports Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Update docstring Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> --------- Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com> Co-authored-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: adityavavre <aditya.vavre@gmail.com>

ShriyaPalsamudram force-pushed the shriya/mem_profiler branch from f79a1b7 to dfdb979 Compare August 15, 2024 21:57

github-advanced-security bot found potential problems Aug 15, 2024

View reviewed changes

nemo/lightning/pytorch/callbacks/memory_profiler.py Fixed Show fixed Hide fixed

nemo/lightning/pytorch/callbacks/memory_profiler.py Fixed Show fixed Hide fixed

hemildesai reviewed Aug 15, 2024

View reviewed changes

nemo/lightning/pytorch/callbacks/memory_profiler.py Outdated Show resolved Hide resolved

ShriyaPalsamudram and others added 2 commits August 21, 2024 09:05

Add MemoryProfileCallback

7aff422

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

Apply isort and black reformatting

34d7f52

Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

ShriyaPalsamudram force-pushed the shriya/mem_profiler branch from f4e2171 to 34d7f52 Compare August 21, 2024 16:07

ShriyaPalsamudram added 2 commits August 21, 2024 13:23

Remove reference cycles, save snapshot on specific ranks

43f7ddb

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

Remove unnecessary imports

cb2a83a

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

ShriyaPalsamudram force-pushed the shriya/mem_profiler branch from d43d69f to cb2a83a Compare August 21, 2024 20:27

Apply isort and black reformatting

84e3c25

Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

Update docstring

73d0459

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

ShriyaPalsamudram added the Run CICD label Aug 21, 2024

ShriyaPalsamudram requested review from hemildesai and maanug-nv August 21, 2024 20:33

maanug-nv reviewed Aug 22, 2024

View reviewed changes

Merge branch 'main' into shriya/mem_profiler

7b5741b

Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com>

ShriyaPalsamudram added Run CICD and removed Run CICD labels Aug 23, 2024

hemildesai approved these changes Aug 23, 2024

View reviewed changes

ShriyaPalsamudram merged commit 6d1be93 into main Aug 23, 2024
131 of 132 checks passed

ShriyaPalsamudram deleted the shriya/mem_profiler branch August 23, 2024 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MemoryProfileCallback #10166

Add MemoryProfileCallback #10166

ShriyaPalsamudram commented Aug 15, 2024

maanug-nv commented Aug 17, 2024

ShriyaPalsamudram commented Aug 21, 2024

maanug-nv left a comment

maanug-nv commented Aug 22, 2024

hemildesai left a comment

hemildesai Aug 23, 2024

ShriyaPalsamudram Aug 23, 2024

hemildesai commented Aug 23, 2024 via email

Add MemoryProfileCallback #10166

Add MemoryProfileCallback #10166

Conversation

ShriyaPalsamudram commented Aug 15, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

maanug-nv commented Aug 17, 2024

ShriyaPalsamudram commented Aug 21, 2024

maanug-nv left a comment

Choose a reason for hiding this comment

maanug-nv commented Aug 22, 2024

hemildesai left a comment

Choose a reason for hiding this comment

hemildesai Aug 23, 2024

Choose a reason for hiding this comment

ShriyaPalsamudram Aug 23, 2024

Choose a reason for hiding this comment

hemildesai commented Aug 23, 2024 via email