Gradient Checkpointing, Improved Logging, and Data Pipeline Updates #77

rutkovskii · 2024-11-19T19:54:50Z

Pull Request

Description

This pull request introduces several enhancements and fixes to the DGMR project, focusing on optimization, logging, and data processing. The key updates include:

Memory Optimization:
- Replaced direct calls to self.forward with torch.utils.checkpoint.checkpoint to enable gradient checkpointing and reduce memory consumption during training. Added by colleague – @xuzhe951024
Improved Logging:
- Removed depricated logger checks in run.py and restructured logger initialization for simplicity to prevent initialization of multiple loggers in multigpu environemnt..
Data Loading Enhancements:
- Updated TFDataset initialization to include trust_remote_code for compatibility with remote dataset loading.
- Added configurable batch size, enabling dynamic adjustments during training.
Code Cleanup:
- Consolidated the __main__ block for better readability and modularity.
- Added default values for batch size and streamlined DataLoader creation.
Dependencies:
- Added wandb, datasets, and tensorflow to requirements.txt to support new functionalities.

Fixes # (Include the relevant issue ID if applicable)

How Has This Been Tested?

The changes were tested using the following methods:

Training Runs:
- Ran fast_dev_run successfully on MacBook Pro M2.
- Ran full training for 4 days on 2 nodes and 4 GPUs per node (8 total) using DDP strategy using generation step = 6, batch size = 16, and precision = 32 on NVidia A100-80GB.
Data Pipeline: Confirmed that the modified DataLoader processes batches correctly and handles remote datasets without errors.

Steps to reproduce:

Set up the environment using the updated requirements.txt.
Run the run.py script with the default configuration.
Monitor Wandb logs for training metrics and validate output consistency.

Have you plotted any changes?

Yes

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

Removed ``` if isinstance(trainer.logger, LoggerCollection): for logger in trainer.logger: if isinstance(logger, WandbLogger): return logger ``` because pytorch_lightning does not have LoggerCollection anymore.

for more information, see https://pre-commit.ci

…ty with newer versions.

…i/skillful_nowcasting into dev-oom-checkpoint-fix

… updated wandb logging to prevent init of several loggers and added progmatic control over dataloader

for more information, see https://pre-commit.ci

rutkovskii · 2024-11-19T20:02:35Z

@jacobbieker
Hi Jacob,
Here is the comment on this PR.
#59 (comment)

jacobbieker

This looks really good! Thanks for doing this!

rutkovskii · 2024-11-19T20:22:07Z

@jacobbieker Glad to help! I believe only you can merge it into the main branch from here.

rutkovskii · 2024-11-23T18:41:16Z

@jacobbieker would it be possible to add me to the list of contributors?

I am also looking to cite this repository in my thesis, and additing the CITATION.cff file could be useful for others who would be citing your work in the future.
https://citation-file-format.github.io/
https://citation-file-format.github.io/cff-initializer-javascript/#/

jacobbieker · 2024-11-23T19:07:02Z

@jacobbieker would it be possible to add me to the list of contributors?

I am also looking to cite this repository in my thesis, and additing the CITATION.cff file could be useful for others who would be citing your work in the future. https://citation-file-format.github.io/ https://citation-file-format.github.io/cff-initializer-javascript/#/

Yes, of course! The comment above should trigger the bot. I've also added a CITATION.cff file now too, so hopefully that helps!

jacobbieker · 2024-11-23T19:08:57Z

@all-contributors please add @rutkovskii for code

allcontributors · 2024-11-23T19:09:06Z

@jacobbieker

I've put up a pull request to add @rutkovskii! 🎉

rutkovskii · 2024-11-23T19:22:34Z

Thank you very much!

rutkovskii and others added 14 commits December 30, 2023 23:35

Update run.py

343ef55

Removed ``` if isinstance(trainer.logger, LoggerCollection): for logger in trainer.logger: if isinstance(logger, WandbLogger): return logger ``` because pytorch_lightning does not have LoggerCollection anymore.

[pre-commit.ci] auto fixes from pre-commit.com hooks

1787368

for more information, see https://pre-commit.ci

[MODIFY] dgmr.py to use checkpointing

b799a66

fix token issue

72e19a0

Pin huggingface-hub to version 0.21.4 to work around an incompatibili…

139f1c0

…ty with newer versions.

docs: update README.md [skip ci]

87e25df

docs: update .all-contributorsrc [skip ci]

79ecead

Run tests on ubuntu only

3c31301

docs: update README.md [skip ci]

1fce918

docs: update .all-contributorsrc [skip ci]

9412431

[MODIFY] dgmr.py to use checkpointing

6e2f176

Merge branch 'dev-oom-checkpoint-fix' of https://github.com/rutkovski…

3dfd9f6

…i/skillful_nowcasting into dev-oom-checkpoint-fix

[MODIFY] requirements.txt to include missing libraries; in run.py…

872256a

… updated wandb logging to prevent init of several loggers and added progmatic control over dataloader

[pre-commit.ci] auto fixes from pre-commit.com hooks

ef8b2b1

for more information, see https://pre-commit.ci

rutkovskii mentioned this pull request Nov 19, 2024

Problems with running ./train/run.py and Concerns with dependency versions #59

Closed

jacobbieker approved these changes Nov 19, 2024

View reviewed changes

jacobbieker merged commit 615ca91 into openclimatefix:main Nov 19, 2024
1 check failed

allcontributors bot mentioned this pull request Nov 23, 2024

docs: add rutkovskii as a contributor for code #78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient Checkpointing, Improved Logging, and Data Pipeline Updates #77

Gradient Checkpointing, Improved Logging, and Data Pipeline Updates #77

rutkovskii commented Nov 19, 2024

rutkovskii commented Nov 19, 2024

jacobbieker left a comment

rutkovskii commented Nov 19, 2024

rutkovskii commented Nov 23, 2024 •

edited

Loading

jacobbieker commented Nov 23, 2024

jacobbieker commented Nov 23, 2024

allcontributors bot commented Nov 23, 2024

rutkovskii commented Nov 23, 2024

Gradient Checkpointing, Improved Logging, and Data Pipeline Updates #77

Gradient Checkpointing, Improved Logging, and Data Pipeline Updates #77

Conversation

rutkovskii commented Nov 19, 2024

Pull Request

Description

How Has This Been Tested?

Steps to reproduce:

Have you plotted any changes?

Checklist:

rutkovskii commented Nov 19, 2024

jacobbieker left a comment

Choose a reason for hiding this comment

rutkovskii commented Nov 19, 2024

rutkovskii commented Nov 23, 2024 • edited Loading

jacobbieker commented Nov 23, 2024

jacobbieker commented Nov 23, 2024

allcontributors bot commented Nov 23, 2024

rutkovskii commented Nov 23, 2024

rutkovskii commented Nov 23, 2024 •

edited

Loading