Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient Checkpointing, Improved Logging, and Data Pipeline Updates #77

Merged
merged 14 commits into from
Nov 19, 2024

Conversation

rutkovskii
Copy link
Contributor

Pull Request

Description

This pull request introduces several enhancements and fixes to the DGMR project, focusing on optimization, logging, and data processing. The key updates include:

  1. Memory Optimization:

    • Replaced direct calls to self.forward with torch.utils.checkpoint.checkpoint to enable gradient checkpointing and reduce memory consumption during training. Added by colleague – @xuzhe951024
  2. Improved Logging:

    • Removed depricated logger checks in run.py and restructured logger initialization for simplicity to prevent initialization of multiple loggers in multigpu environemnt..
  3. Data Loading Enhancements:

    • Updated TFDataset initialization to include trust_remote_code for compatibility with remote dataset loading.
    • Added configurable batch size, enabling dynamic adjustments during training.
  4. Code Cleanup:

    • Consolidated the __main__ block for better readability and modularity.
    • Added default values for batch size and streamlined DataLoader creation.
  5. Dependencies:

    • Added wandb, datasets, and tensorflow to requirements.txt to support new functionalities.

Fixes # (Include the relevant issue ID if applicable)

How Has This Been Tested?

The changes were tested using the following methods:

  • Training Runs:
    • Ran fast_dev_run successfully on MacBook Pro M2.
    • Ran full training for 4 days on 2 nodes and 4 GPUs per node (8 total) using DDP strategy using generation step = 6, batch size = 16, and precision = 32 on NVidia A100-80GB.
  • Data Pipeline: Confirmed that the modified DataLoader processes batches correctly and handles remote datasets without errors.

Steps to reproduce:

  1. Set up the environment using the updated requirements.txt.
  2. Run the run.py script with the default configuration.
  3. Monitor Wandb logs for training metrics and validate output consistency.

Have you plotted any changes?

  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@rutkovskii
Copy link
Contributor Author

@jacobbieker
Hi Jacob,
Here is the comment on this PR.
#59 (comment)

Copy link
Member

@jacobbieker jacobbieker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! Thanks for doing this!

@rutkovskii
Copy link
Contributor Author

@jacobbieker Glad to help! I believe only you can merge it into the main branch from here.

@jacobbieker jacobbieker merged commit 615ca91 into openclimatefix:main Nov 19, 2024
1 check failed
@rutkovskii
Copy link
Contributor Author

rutkovskii commented Nov 23, 2024

@jacobbieker would it be possible to add me to the list of contributors?

I am also looking to cite this repository in my thesis, and additing the CITATION.cff file could be useful for others who would be citing your work in the future.
https://citation-file-format.github.io/
https://citation-file-format.github.io/cff-initializer-javascript/#/

@jacobbieker
Copy link
Member

@jacobbieker would it be possible to add me to the list of contributors?

I am also looking to cite this repository in my thesis, and additing the CITATION.cff file could be useful for others who would be citing your work in the future. https://citation-file-format.github.io/ https://citation-file-format.github.io/cff-initializer-javascript/#/

Yes, of course! The comment above should trigger the bot. I've also added a CITATION.cff file now too, so hopefully that helps!

@jacobbieker
Copy link
Member

@all-contributors please add @rutkovskii for code

Copy link
Contributor

@jacobbieker

I've put up a pull request to add @rutkovskii! 🎉

@rutkovskii
Copy link
Contributor Author

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants