Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(config): passing gradient_checkpoint_kwargs #1412

Merged
merged 4 commits into from
Mar 19, 2024

Conversation

NanoCode012
Copy link
Collaborator

@NanoCode012 NanoCode012 commented Mar 16, 2024

According to huggingface/transformers#28339 , setting it to False increases VRAM. My quick testing shows ~1GB increase at lowest settings.

Furthermore, the default in transformers and torch is going to be True huggingface/transformers#29638 (comment)

Finally, removing the default to False in trainer_building to clean old configs. I see that this kwarg is now set in https://github.com/OpenAccess-AI-Collective/axolotl/blob/a914cb37dc455a3fd0368e3a0898867f25b3a6c9/src/axolotl/utils/config/__init__.py#L170-L176

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

@NanoCode012 NanoCode012 changed the title fix(config): change default use_reentrant to true fix(config): passing gradient_checkpoint_kwargs Mar 16, 2024
@NanoCode012 NanoCode012 marked this pull request as ready for review March 16, 2024 05:25
@winglian
Copy link
Collaborator

In my experience, use_reentrant: False used less VRAM prevented OOM in some situations where use_reentrant: True would OOM, hence the current default.

@winglian
Copy link
Collaborator

see my comment in #1167

@NanoCode012
Copy link
Collaborator Author

NanoCode012 commented Mar 16, 2024

In that code block above, it seems that the validate_config which runs first, sets it to True by default already.

Then, the trainer_builder which runs second, redundantly checks for it again and tries to set it to False.

Either way, I think one of the defaults should be removed to prevent future confusion.

Edit: your linked PR sets it to true, despite the comment saying it's false.

@winglian
Copy link
Collaborator

In that code block above, it seems that the validate_config which runs first, sets it to True by default already.

Then, the trainer_builder which runs second, redundantly checks for it again and tries to set it to False.

Either way, I think one of the defaults should be removed to prevent future confusion.

Edit: your linked PR sets it to true, despite the comment saying it's false.

Thanks for digging into this. Good to go!

@NanoCode012 NanoCode012 merged commit b1e3e1b into main Mar 19, 2024
6 checks passed
@NanoCode012 NanoCode012 deleted the NanoCode012-checkpointing branch March 19, 2024 03:57
seungduk-yanolja pushed a commit to Y-IAB/axolotl that referenced this pull request Mar 19, 2024
* fix(config): change default use_reentrant to true

* Update trainer_builder.py

* fix: make sure to pass kwargs to enable checkpoint

* chore: lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants