请教,shardfomer中GPT2FusedLinearConv1D_Col为什么反向做了两次allreduce #4961
lichenlu
started this conversation in
Community | General
Replies: 3 comments 3 replies
-
@lichenlu TP的时候,是在column parallel layer中,input在forward时不做任何操作在backward时reduce梯度,相对地,在row parallel layer中output在forward时需要reduce,在backward时不做任何操作。也就是说,像 |
Beta Was this translation helpful? Give feedback.
1 reply
-
@FrankLeeeee 这个细节能帮忙解释一些么 |
Beta Was this translation helpful? Give feedback.
0 replies
-
在GPT2FusedLinearConv1D_Col中,ctx.async_grad_allreduce默认值为False,所以matmul_with_async_comm的allreduce没有被执行,总共还是只执行了一次。 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
如图,GPT2FusedLinearConv1D_Col的forward函数中使用了两个function,reduce_backward和matmul_with_async_comm,这两个函数在backward的时候都进行了allreduce操作,这里是否发生了冗余?
Beta Was this translation helpful? Give feedback.
All reactions