Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix align_module_device, ensure only cpu tensors for get_state_dict_offloaded_model #3217

Merged

Conversation

kylesayrs
Copy link
Contributor

@kylesayrs kylesayrs commented Nov 4, 2024

Background

  • Previous PR [Utils] align_module_device #3204 introduce an unintended behavior change where a module being aligned would also attempt to align parameters belonging to its submodules. This is a problem for functions like get_state_dict_offloaded_model which calls align_module_device on non-leaf modules.
tests/test_modeling_utils.py:808
    state_dict = get_state_dict_offloaded_model(model)
src/accelerate/utils/modeling.py:1532: in get_state_dict_offloaded_model                                                                                                             
    with align_module_device(module, "cpu"): 
/usr/lib/python3.10/contextlib.py:135: in __enter__                                                                                                                                  
    return next(self.gen)  
src/accelerate/utils/modeling.py:1929: in align_module_device                                                                                                                        
    set_module_tensor_to_device(module, name, execution_device)
ValueError: weight is on the meta device, we need a `value` to put in on cpu.

Purpose

  • Fix align_module_device bug where the function attempts to align meta tensors belonging to submodule parameters
  • Fix get_state_dict_offloaded_model(model) behavior to match model.state_dict()
  • Introduce tests for get_state_dict_offloaded_model

Changes

  • align_module_device now only aligns parameters directly attached to the parent
  • Move all tensors in module state dict to cpu before returning, including both parameters and buffers
  • Add usage tests for get_state_dict_offloaded_model

Testing

  • Added tests fail without changes, but pass with changes
  • Use below script to test end-to-end
test_e2e.py
from transformers import AutoModelForCausalLM
from accelerate import cpu_offload
from accelerate.utils.modeling import get_state_dict_offloaded_model

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
cpu_offload(model)

state_dict = get_state_dict_offloaded_model(model)

@kylesayrs kylesayrs marked this pull request as draft November 4, 2024 21:18
@kylesayrs kylesayrs marked this pull request as ready for review November 4, 2024 21:19
Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing! cc @SunMarc :)

@muellerzr muellerzr requested a review from SunMarc November 5, 2024 02:26
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks for fixing this !

@SunMarc SunMarc merged commit c0552c9 into huggingface:main Nov 5, 2024
25 checks passed
@kylesayrs kylesayrs deleted the kylesayrs/align-module-device-fix branch November 5, 2024 15:28
muellerzr pushed a commit that referenced this pull request Nov 6, 2024
…t_offloaded_model` (#3217)

* only onload direct parameter descendants, move buffers to cpu, add tests

* remove no longer applicable comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants