AutoSharded_Transformer_Based_on_PyTorch 4 nodes deployment: CUDA_VISIBLE_DEVICES=4,5,6,7 python Transformer_AutoShard_Test.py result(max mem set to be 10000): 50%+ speed up relative to FSDP 8 nodes deployment: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python Transformer_AutoShard_Test.py result(max mem set to be 10000): 50%+ speed up relative to FSDP