You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/workspace/flashflex/llama_train.py", line 241, in <module>
train(model, loss_func, optimizer, args)
File "/workspace/flashflex/llama_train.py", line 229, in train
train_step(model, loss_func, optimizer, trainloader)
File "/workspace/flashflex/llama_train.py", line 181, in train_step
optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/adam.py", line 165, in step
adam(
File "/usr/local/lib/python3.10/dist-packages/torch/optim/adam.py", line 314, in adam
func(params,
File "/usr/local/lib/python3.10/dist-packages/torch/optim/adam.py", line 520, in _multi_tensor_adam
device_grads = torch._foreach_add(device_grads, device_params, alpha=weight_decay)
RuntimeError: The size of tensor a (147849216) must match the size of tensor b (73924608) at non-singleton dimension 0
I noticed that we should modified _pre_forward as followed.
But when I used the container you provided, it seem that the version of torch is mismatch.
For example:
it does not exist module name _utils in torch.distributed.fsdp.
So How could I modify the _pre_forward function
Thank.
The text was updated successfully, but these errors were encountered: