-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Mechanism to dereference automatically in the Loops #9385
Comments
Great idea ! I really like the option 2. Seems to cover all possible hooks and the cleaning is handled automatically. Should we have 2 sharded object ? sm_run ? Created and cleaned for the entire run ? |
We can start implementing and see if having two would be particularly helpful. |
@tchaton I don't think this approach alone would be sufficient to address #9390 Specifically, since each DataLoader uses a different set of worker processes there is no guarantee that the old set of processes will release memory before the new set allocate new memory. This is a big deal because in my case each set of worker processes reserve 20GB of memory. For other users it might be even larger. I am looking for an explicit call to https://github.com/pytorch/pytorch/blob/e5ab0d1013072c26586b369536bccac648843958/torch/utils/data/dataloader.py#L1328 to resolve this. |
@cowwoc I wouldn't want to call a private |
@carmocca Sorry, I'm new to Python so maybe I misunderstood. I thought that the presence of Should I ask the pytorch guys to add a public method to close/dispose a |
Sure, why not :) |
If I'm not wrong, |
I usually prefer the approach you mentioned but in this specific case I think it makes sense to explicitly It's possible for the main process to have little garbage (so GC will never get triggered) in spite of the idle sub-processes hogging memory. As far as I understand it, garbage held by sub-processes do not apply any pressure on the GC of the main process. |
That link reads:
But I just tested this locally and what I saw was that the reference count has to drop to zero and the variable has to be assigned to I'll ask the DataLoader guys to add an explicit cleanup method. |
Here is my request for an explicit You should probably start with clearing out references for now and if they add a cleanup method we could revisit this issue. |
It looks like the PyTorch-side changes might take a while to arrive. Are you able to release the Lightning changes without it? I am still running out of memory even though I am invoking |
I figured out a workaround that seems to work:
within a try-finally block to ensure that I hope this helps others. |
Per pytorch/pytorch#64766 (comment) and pytorch/pytorch#64766 (comment) they do not plan to add an explicit shutdown API in the near future. I'll keep on eye on this ticket and try |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Closing as this is a cool solution but not a natural solution for most people. It also limits the flexibility in memory management. |
Proposed refactoring or deprecation
The
Loop
srun
designhttps://github.com/PyTorchLightning/pytorch-lightning/blob/41ba639859cf6c6bf319eb33e5b3394504315962/pytorch_lightning/loops/base.py#L94-L120
doesn't include a mechanism to share data between hooks during the same iteration or between iterations. This forces to write everything to
self
which is flexible but opens the door to easily forgetting to clear state as the variables will not get garbage collected.Motivation
#9386 is the perfect example as it shows that
self.dataloader_iter
was added but the reference was not freed. It gets defined inon_run_start
but we only use it in a later hook.This pattern is also seen in other places:
Pitch
Automaticaly dereference data at the end of
run
.Option 1:
Option 2:
where loop writers save the temporal state with
self.shm.dataloader_iter = ...
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @tchaton @justusschock @awaelchli @carmocca @ananthsub @ninginthecloud @rohitgr7 @akihironitta
The text was updated successfully, but these errors were encountered: