32 bit optimizer update error despite gradients being the same #1185

Edenzzzz · 2024-04-19T14:18:35Z

System Info

A100 GPU, torch 2.1, cuda 12.1, bitsandbytes 0.43.1

Reproduction

The tensors to be loaded are zipped here:
grads.zip

import torch
import bitsandbytes.functional as F
low_rank_grad = torch.load("low_rank_grad.pt")
dist_low_rank_grad = torch.load("dist_low_rank_grad.pt")
def update(low_rank_grad):
    p = torch.zeros_like(low_rank_grad, dtype=torch.float32, device="cuda")
    p.grad = low_rank_grad
    lr = 1e-2
    state = {}
    state["state1"] = torch.zeros_like(low_rank_grad, dtype=torch.float32, device="cuda")
    state["state2"] = torch.zeros_like(low_rank_grad, dtype=torch.float32, device="cuda")
    beta1 = 0.9
    beta2 = 0.999
    step = 1
    args = None
    eps = 1e-8
    weight_decay = 1e-2


    F.optimizer_update_32bit(
        "adam",
        p.grad,
        p,
        state["state1"],
        beta1,
        eps,
        step,
        lr,
        state["state2"],
        beta2,
        weight_decay
    )
    return p
print(low_rank_grad.shape, dist_low_rank_grad.shape)
assert (low_rank_grad[:, :32] == dist_low_rank_grad).all()
# adam update step
low_rank_grad = update(low_rank_grad)
dist_low_rank_grad = update(dist_low_rank_grad)
low_rank_grad[:, :32] == dist_low_rank_grad

My result showing that most updates on the same grad chunk diverged

Expected behavior

This comes from adapting the Galore optimizer for Tensor parallel, when testing precision of the distributed and original optimizer.
Here the gradient is shared along dim 1 by tensor parallel, but the corresponding grad chunk clearly matches. However after the optim step the chunks are not exactly the same. I first doubted this is due to quantization statistics, but using 32 bit and disabling quantization stably leads to this bug.

The text was updated successfully, but these errors were encountered:

Edenzzzz · 2024-04-19T15:34:10Z

@matthewdouglas @TimDettmers any insights? Thanks!

matthewdouglas · 2024-04-19T18:48:39Z

Hi @Edenzzzz,

Make sure that this chunk is contiguous as F.optimizer_update_32bit ultimately treats it as 1D.

dist_low_rank_grad = torch.load("dist_low_rank_grad.pt").contiguous()

I was able to reproduce your results, and after this change I believe I'm seeing the desired result.

RTX 3060, CUDA 12.4, torch==2.2.2+cu121, bitsandbytes==0.43.1

Edenzzzz · 2024-04-20T04:16:24Z

Thanks a lot! This worked. The non-contiguous tensor came from torch.chunk and torch.distributed.all_gather.
I wonder if that's due to the c++ kernel not considering the reshaped strides and assuming row-major format? I can file in a PR to make it contiguous if you feel that's helpful.

matthewdouglas · 2024-04-22T15:02:17Z

I wonder if that's due to the c++ kernel not considering the reshaped strides and assuming row-major format?

Yes, exactly. The C++ kernel assumes it's row-major and only knows the total number of elements.

I can file in a PR to make it contiguous if you feel that's helpful.

That seems like a reasonable check to me, so a PR to add that sounds good!

Edenzzzz changed the title ~~(Urgent) 32 bit optimizer update error despite gradients being the same~~ 32 bit optimizer update error despite gradients being the same Apr 19, 2024

Edenzzzz mentioned this issue Apr 23, 2024

Fixed optim update error with non-contiguous grads/params #1187

Merged

Titus-von-Koeller closed this as completed in #1187 Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

32 bit optimizer update error despite gradients being the same #1185

32 bit optimizer update error despite gradients being the same #1185

Edenzzzz commented Apr 19, 2024

Edenzzzz commented Apr 19, 2024

matthewdouglas commented Apr 19, 2024

Edenzzzz commented Apr 20, 2024 •

edited

Loading

matthewdouglas commented Apr 22, 2024

32 bit optimizer update error despite gradients being the same #1185

32 bit optimizer update error despite gradients being the same #1185

Comments

Edenzzzz commented Apr 19, 2024

System Info

Reproduction

Expected behavior

Edenzzzz commented Apr 19, 2024

matthewdouglas commented Apr 19, 2024

Edenzzzz commented Apr 20, 2024 • edited Loading

matthewdouglas commented Apr 22, 2024

Edenzzzz commented Apr 20, 2024 •

edited

Loading