Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to modified the _runtime_utils.py #3

Open
YuMJie opened this issue Jan 11, 2025 · 1 comment
Open

How to modified the _runtime_utils.py #3

YuMJie opened this issue Jan 11, 2025 · 1 comment

Comments

@YuMJie
Copy link

YuMJie commented Jan 11, 2025

I noticed that we should modified _pre_forward as followed.

`_pre_forward` is modified, the original path is `software/miniconda3/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py`

But when I used the container you provided, it seem that the version of torch is mismatch.

For example:

from torch.distributed.fsdp._utils import (
    _apply_to_tensors,
    _no_dispatch_record_stream,
    p_assert,
)

it does not exist module name _utils in torch.distributed.fsdp.

So How could I modify the _pre_forward function

Thank.

@YuMJie
Copy link
Author

YuMJie commented Jan 11, 2025

What is more. When using 1F1B, it occurs

Traceback (most recent call last):
  File "/workspace/flashflex/llama_train.py", line 241, in <module>
    train(model, loss_func, optimizer, args)
  File "/workspace/flashflex/llama_train.py", line 229, in train
    train_step(model, loss_func, optimizer, trainloader)
  File "/workspace/flashflex/llama_train.py", line 181, in train_step
    optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/adam.py", line 165, in step
    adam(
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/adam.py", line 314, in adam
    func(params,
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/adam.py", line 520, in _multi_tensor_adam
    device_grads = torch._foreach_add(device_grads, device_params, alpha=weight_decay)
RuntimeError: The size of tensor a (147849216) must match the size of tensor b (73924608) at non-singleton dimension 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant