-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MemoryProfileCallback #10166
Add MemoryProfileCallback #10166
Conversation
f79a1b7
to
dfdb979
Compare
Currently, this will dump the snapshot for all ranks. Should we add an option to only save a snapshot from rank 0? or a specifiable rank? |
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
f4e2171
to
34d7f52
Compare
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
d43d69f
to
cb2a83a
Compare
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Addressed in latest commit |
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Callback LGTM. I have no concerns about memory profiling, since this is almost identical to the one callback we've been using. For memory cycles, we should test by introducing an intentional cycle. Also curious if this produces warnings about any cycles not related to our code (eg in PTL).
also, can we allow disabling the profiling, so users can only run the cycle detector? |
Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one minor comment.
|
||
import torch | ||
from pytorch_lightning.callbacks.callback import Callback | ||
from torch.utils.viz._cycles import warn_tensor_cycles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this import inside the callback to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to fail early if there is an issue with this import than fail inside the callback?
Yeah I was just being cautious to avoid side effects of the import but I don't think this import should have any, so it's probably fine.
Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Shriya Rishab ***@***.***>
Sent: Friday, August 23, 2024 10:07:29 AM
To: NVIDIA/NeMo ***@***.***>
Cc: Hemil Desai ***@***.***>; Comment ***@***.***>
Subject: Re: [NVIDIA/NeMo] Add MemoryProfileCallback (PR #10166)
@ShriyaPalsamudram commented on this pull request.
________________________________
In nemo/lightning/pytorch/callbacks/memory_profiler.py<#10166 (comment)>:
@@ -0,0 +1,78 @@
+import os
+
+import torch
+from pytorch_lightning.callbacks.callback import Callback
+from torch.utils.viz._cycles import warn_tensor_cycles
Isn't it better to fail early if there is an issue with this import than fail inside the callback?
—
Reply to this email directly, view it on GitHub<#10166 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB6Q25H6ZIEEJZOY5WCDAWDZS5F3DAVCNFSM6AAAAABMS6X6TCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENJXGQ3TONZVGY>.
You are receiving this because you commented.Message ID: ***@***.***>
|
* Add MemoryProfileCallback Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Remove reference cycles, save snapshot on specific ranks Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Remove unnecessary imports Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Update docstring Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> --------- Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com> Co-authored-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
* Add MemoryProfileCallback Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Remove reference cycles, save snapshot on specific ranks Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Remove unnecessary imports Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Update docstring Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> --------- Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com> Co-authored-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
* Add MemoryProfileCallback Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Remove reference cycles, save snapshot on specific ranks Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Remove unnecessary imports Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Update docstring Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> --------- Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com> Co-authored-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
* Add MemoryProfileCallback Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Remove reference cycles, save snapshot on specific ranks Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Remove unnecessary imports Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> * Apply isort and black reformatting Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> * Update docstring Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> --------- Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com> Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com> Co-authored-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: adityavavre <aditya.vavre@gmail.com>
What does this PR do ?
This callback enables recording a timeline of memory allocations during training.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information