Add model parallel distribution. #797

qlzh727 · 2023-08-26T20:05:02Z

No description provided.

fchollet

Thanks for the PR!

fchollet · 2023-08-26T21:46:13Z

keras_core/distribution/distribution_lib.py

+    distribution = ModelParallel(device_mesh=device_mesh,
+                                 layout_map=layout_map,
+                                 batch_dim_name='batch')
+    with distribution.scope():


If the primary usage is via the global setter, we should show that in the code example

fchollet · 2023-08-26T21:46:18Z

keras_core/distribution/distribution_lib.py

+        corresponding `TensorLayout`.
+
+    Example:
+    ```


Add: python

fchollet · 2023-08-26T21:47:53Z

keras_core/distribution/distribution_lib.py

+                             devices=devices)
+    ```
+
+    To figure out a proper layout mapping rule for all the model weights, you


Do variables other than weights ever need to be sharded? e.g. optimizer variables, metrics. I assume optimizer variables will need to be sharded.

In the DTensor implementation, all the weights need to be either replicated/sharded. The normal tf.Variable doesn't work with DTensor variable within one tf.function.

For JAX, it might not be explicitly required, but might have been done (replicated by default) under the hood.

For optimizer variables, since it has similar name/path as the weights name, it will probably get the same layout as the variable. The metric variable are by default replicated.

In the DTensor implementation, all the weights need to be either replicated/sharded

In this case, we should show the code example with model.variables which is more complete. Or we could move get_variable_map() to the public API and recommend that. It's probably the best move since it includes optimizer variables as well.

For optimizer variables, since it has similar name/path as the weights name

This isn't the case today -- try printing some optimizer variables for a given model for some examples. We should figure out the recommended best practices for specifying layouts for optimizer variables.

fchollet · 2023-08-26T21:48:41Z

keras_core/distribution/distribution_lib.py

+                corresponding `TensorLayout`. The axis names of the
+                `TensorLayout`s should match to the axis names in the
+                device_mesh, or exception will be raised.
+            batch_dim_name: optional string, the axis name in the device_mesh


Is this argument necessary? Can it not be inferred every time?

I hope to not infer it which could lead to some bizarre behavior, and in JAX backend, it might sliently run without raising error.

fchollet

LGTM, thanks!

Add model parallel distribution.

b40eb0e

qlzh727 requested a review from fchollet August 26, 2023 20:05

fchollet reviewed Aug 26, 2023

View reviewed changes

Address review comments.

5063be3

qlzh727 requested a review from fchollet August 27, 2023 00:00

fchollet approved these changes Aug 27, 2023

View reviewed changes

fchollet merged commit a051b5c into keras-team:main Aug 27, 2023

qlzh727 deleted the true_model_parallel branch August 29, 2023 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model parallel distribution. #797

Add model parallel distribution. #797

qlzh727 commented Aug 26, 2023

fchollet left a comment

fchollet Aug 26, 2023

qlzh727 Aug 26, 2023

fchollet Aug 26, 2023

qlzh727 Aug 26, 2023

fchollet Aug 26, 2023

qlzh727 Aug 26, 2023

fchollet Aug 27, 2023

fchollet Aug 26, 2023

qlzh727 Aug 27, 2023

fchollet left a comment

Add model parallel distribution. #797

Add model parallel distribution. #797

Conversation

qlzh727 commented Aug 26, 2023

fchollet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet left a comment

Choose a reason for hiding this comment