How to do JAX mixed precision properly? #25434

FirstQuadrantSam · 2024-12-12T15:21:27Z

FirstQuadrantSam
Dec 12, 2024

Hi All!
I am trying to use mixed precision (say, BF16 on V100) to accelerate my training. I did something like this:

data = load_batch_data() # Originally in float32
data = data.astype(jnp.bfloat16)
jax.block_until_ready(data)
params_low = params.astype(jnp.bfloat16)
loss, grad = loss_and_grad_fun(data, params_low)
grad = grad.astype(jnp.float32)

and then send the gradient calculated to the optimizers etc. It turns out that this actually gives a slowdown compared to the high-precision version. Note that I used jax.block_until_ready, and only timed the gradient calculation. Then I tried the following:

data = jax.random.uniform(...) # setup something random
jax.block_until_ready(data)
params_low = params.astype(jnp.bfloat16)
loss, grad = loss_and_grad_fun(data, params_low)
grad = grad.astype(jnp.float32)

And it works well with the random data, giving the desired 2x acceleration. Any ideas/suggestions of what could be wrong?

homerjed · 2024-12-20T13:12:49Z

homerjed
Dec 20, 2024

Not a full answer but this depends on a number of choices / conventions.

I looked at jmp to implement something like this myself. Note the casting (like you are doing) for the grads / data / model.

Some operations need to be kept in float32, such as attention softmax'ing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do JAX mixed precision properly? #25434

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to do JAX mixed precision properly? #25434

FirstQuadrantSam Dec 12, 2024

Replies: 1 comment

homerjed Dec 20, 2024

FirstQuadrantSam
Dec 12, 2024

homerjed
Dec 20, 2024