[Feature] Support TPU backend for ShardParallel #764

ZYHowell · 2022-11-03T18:21:19Z

This PR:

Add has_cuda and backend in global_config
Add IS_CUDA in setup.py
Change the semantic of run_auto_sharding_pass and run_backend_compilation: the output of the run_auto_sharding_pass and input of run_backend_compilation is no longer the spmd-partitioned module, but instead the sharding annotated module. This is because TPU cannot compile an spmd-partitioned module with sharding configurations;
write testcases specifically for TPUs. The pass that rewrites AllReduce to ReduceScatter and AllGather is after SPMD partitioner, so we cannot do it for the tpu backend.

zhisbug · 2022-11-03T21:05:36Z

One quick Q: how do we run TPU tests in CI?

ZYHowell · 2022-11-03T23:23:39Z

One quick Q: how do we run TPU tests in CI?

Google provides free TPU access for several days, but I've already extended the time twice(in the first extension they gave me 90 days and the second has 180 days). Can we use that machine for CI?

OhadRubin · 2022-11-09T10:56:34Z

@merrymercy hey, it seems all the tpu specific tests are being skipped (on mobile, will provide a link soon).
So it is not clear to me what is the current working TPU capabilities.

ZYHowell · 2022-11-09T18:31:58Z

@merrymercy hey, it seems all the tpu specific tests are being skipped (on mobile, will provide a link soon). So it is not clear to me what is the current working TPU capabilities.

Currently we skip all tests other than those for ShardParallel/reduce-scatter optimization. You can see we rewrite all tests other than tests such as test_n_layer_mlp_data_parallel. For a full list of these kept tests, please refer to the if __name__ == "__main__": part.

merrymercy · 2022-11-09T18:42:32Z

Not all test case are skipped. The current code can pass these test cases on TPU

alpa/tests/tpu/test_shard_parallel.py

Lines 99 to 110 in e710f21

    
           add_mlp("test_n_layer_mlp_data_parallel") 
        
           add_mlp("test_n_layer_mlp_model_parallel") 
        
           add_mlp("test_n_layer_mlp_2d_mesh") 
        
           add_mlp("test_n_layer_mlp_force_data_parallel") 
        
           add_mlp("test_n_layer_mlp_force_batch_dim_mapping") 
        
           add_mlp("test_weight_init") 
        
           add_moe("test_moe_layer") 
        
           add_moe("test_moe_layer_2d") 
        
           add_moe("test_moe_lm") 
        
           add_moe("test_moe_lm_2d") 
        
           add_moe("test_moe_lm_data_parallel")

ZYHowell and others added 6 commits November 5, 2022 16:39

rebase alpa and applyl all changes

ea6c524

minor update

02f3fa6

throw not implemented in all other than normal 2d

7597990

get SHARDING_ANNOTATED module in run_auto_sharding

19cf127

fix format

67bb8aa

[skip ci]fix after rebase

c9647b1

ZYHowell force-pushed the tpu-support branch from d559a83 to c9647b1 Compare November 5, 2022 17:04

ZYHowell and others added 7 commits November 5, 2022 19:40

[skip ci] update detection of cuda

889c18a

[skip ci] minor fix

91b17d7

add tpu tests

87ab146

format tpu tests

c8c5f1c

Merge branch 'main' into tpu-support

3ab551a

test

8ebd50b

test again

d043b47

ZYHowell changed the title ~~[WIP][Feature] Support TPU backend for ShardParallel~~ [Feature] Support TPU backend for ShardParallel Nov 7, 2022

ZYHowell and others added 8 commits November 7, 2022 15:19

fix circular deps

4380b56

minor fix

6e0c975

[skip ci] fix follow parallel test semantic

e657ab4

update submodule

ad55e01

[skip ci]fix run all

2395b3b

minimal walk around gpu spmd

b0a03bd

format...

393a6e9

f

e710f21

ZYHowell merged commit 4f14955 into main Nov 8, 2022

ZYHowell deleted the tpu-support branch November 8, 2022 17:39

merrymercy mentioned this pull request Nov 9, 2022

[RUNTIME] TPU Backend Support #433

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support TPU backend for ShardParallel #764

[Feature] Support TPU backend for ShardParallel #764

ZYHowell commented Nov 3, 2022 •

edited

Loading

zhisbug commented Nov 3, 2022

ZYHowell commented Nov 3, 2022

OhadRubin commented Nov 9, 2022

ZYHowell commented Nov 9, 2022 •

edited

Loading

merrymercy commented Nov 9, 2022 •

edited

Loading

[Feature] Support TPU backend for ShardParallel #764

[Feature] Support TPU backend for ShardParallel #764

Conversation

ZYHowell commented Nov 3, 2022 • edited Loading

zhisbug commented Nov 3, 2022

ZYHowell commented Nov 3, 2022

OhadRubin commented Nov 9, 2022

ZYHowell commented Nov 9, 2022 • edited Loading

merrymercy commented Nov 9, 2022 • edited Loading

ZYHowell commented Nov 3, 2022 •

edited

Loading

ZYHowell commented Nov 9, 2022 •

edited

Loading

merrymercy commented Nov 9, 2022 •

edited

Loading