-
-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient Checkpointing, Improved Logging, and Data Pipeline Updates #77
Gradient Checkpointing, Improved Logging, and Data Pipeline Updates #77
Conversation
Removed ``` if isinstance(trainer.logger, LoggerCollection): for logger in trainer.logger: if isinstance(logger, WandbLogger): return logger ``` because pytorch_lightning does not have LoggerCollection anymore.
for more information, see https://pre-commit.ci
…ty with newer versions.
…i/skillful_nowcasting into dev-oom-checkpoint-fix
… updated wandb logging to prevent init of several loggers and added progmatic control over dataloader
for more information, see https://pre-commit.ci
@jacobbieker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! Thanks for doing this!
@jacobbieker Glad to help! I believe only you can merge it into the main branch from here. |
@jacobbieker would it be possible to add me to the list of contributors? I am also looking to cite this repository in my thesis, and additing the |
Yes, of course! The comment above should trigger the bot. I've also added a |
@all-contributors please add @rutkovskii for code |
I've put up a pull request to add @rutkovskii! 🎉 |
Thank you very much! |
Pull Request
Description
This pull request introduces several enhancements and fixes to the DGMR project, focusing on optimization, logging, and data processing. The key updates include:
Memory Optimization:
self.forward
withtorch.utils.checkpoint.checkpoint
to enable gradient checkpointing and reduce memory consumption during training. Added by colleague – @xuzhe951024Improved Logging:
run.py
and restructured logger initialization for simplicity to prevent initialization of multiple loggers in multigpu environemnt..Data Loading Enhancements:
TFDataset
initialization to includetrust_remote_code
for compatibility with remote dataset loading.Code Cleanup:
__main__
block for better readability and modularity.Dependencies:
wandb
,datasets
, andtensorflow
torequirements.txt
to support new functionalities.Fixes # (Include the relevant issue ID if applicable)
How Has This Been Tested?
The changes were tested using the following methods:
fast_dev_run
successfully on MacBook Pro M2.generation step = 6
,batch size = 16
, andprecision = 32
on NVidia A100-80GB.Steps to reproduce:
requirements.txt
.run.py
script with the default configuration.Have you plotted any changes?
Checklist: