Does updating inside of jitted batch data parallelism work? #24882

logan-dunbar · 2024-11-13T18:27:48Z

logan-dunbar
Nov 13, 2024

Under the "8-Way batch data parallelism" the data is batch sharded while the model is replicated sharded.

# see link above for other funcs here #

mesh = jax.make_mesh((8,), ('batch',))
sharding = NamedSharding(mesh, P('batch'))
replicated_sharding = NamedSharding(mesh, P())

batch = jax.device_put(batch, sharding)
params = jax.device_put(params, replicated_sharding)

The params update happens outside of any jitted function, and I'm assuming the sharding takes care of replicating the changes to all devices.

step_size = 1e-5

for _ in range(30):
  grads = gradfun(params, batch)
  params = [(W - step_size * dW, b - step_size * db)
            for (W, b), (dW, db) in zip(params, grads)]

print(loss_jit(params, batch))

Say now the update to params is done inside a jitted function, like below, does the update happen correctly, or does the params get overwritten in some unpredictable way because essentially 8 GPUs are trying to update the replicated data with a chunk of the computation?

step_size = 1e-5

@jax.jit
def do_param_update(params, batch):
  grads = gradfun(params, batch)
  params = [(W - step_size * dW, b - step_size * db)
            for (W, b), (dW, db) in zip(params, grads)]
  return params

for _ in range(30):
  params = do_param_update(params, batch)

print(loss_jit(params, batch))

logan-dunbar · 2024-11-13T18:55:12Z

logan-dunbar
Nov 13, 2024
Author

So I just ran the code in a colab, and it produces the same loss at the end of training, so can I assume it is working correctly? Does it know that when trying to update the replicate sharded params object it needs to wait for all the gradfun() returns and then accumulate them all into the replicated object?

I'm really surprised that worked tbh, I thought for sure it would shout about race conditions on the update or produce some bogus values by overwriting, if it really works as advertised then I'm stunned by how easy it has been to parallelize my code across multiple GPUs, and I'm even more stoked that I chose to invest in using Jax :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does updating inside of jitted batch data parallelism work? #24882

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Does updating inside of jitted batch data parallelism work? #24882

logan-dunbar Nov 13, 2024

Replies: 1 comment

logan-dunbar Nov 13, 2024 Author

logan-dunbar
Nov 13, 2024

logan-dunbar
Nov 13, 2024
Author