-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc]: Memory Order in Custom Allreduce #8404
Comments
this is a great observation! @kanghui0204 also fond this problem. it seems adding @HydraQYH do you have any ideas on how to solve it? also cc @hanzhi713 if you still have bandwidth to investigate. |
I don't think this is the issue. Step 3 will be executed after step 1 and 2 due to Even if step 3 got the wrong data, it won't cause a hang. If it hangs it must occur in one of the while loops. |
@hanzhi713 adding vllm/csrc/custom_all_reduce.cuh Line 137 in 0af3abe
seems to work, a solution found by @kanghui0204 I don't know if we can use some weaker sync op here, |
@youkaichao Can you try what I proposed in the second bullet point of #8410? I think the rationale behind this (I'm thinking about this too) is that the vllm/csrc/custom_all_reduce.cuh Line 135 in 0af3abe
vllm/csrc/custom_all_reduce.cuh Line 162 in 0af3abe
causing an indefinite wait. If this is indeed the case, it should be fixed by changing vllm/csrc/custom_all_reduce.cuh Line 156 in 0af3abe
|
@hanzhi713 Thanks for reply. I also think about the __syncthreads(). I'm not sure that if __syncthreads() has a memory fence semantic. In CUDA programming guide, it just say: |
@hanzhi713 which will be more efficient? adding vllm/csrc/custom_all_reduce.cuh Line 137 in 0af3abe
or unconditionally use vllm/csrc/custom_all_reduce.cuh Line 156 in 0af3abe
|
"... are visible to all threads in the block" this is a even stronger guarantee than a memory fence. Memory fence only guarantees ordering. This also guarantees visibility. |
@youkaichao I have tried this. In my A100, it will cause about 6us latency. I tried to change the code to use weaker memory fence just like TensorRT-LLM. It seems that it will cause about 1~3us latency. It is better than I can make a code review for my plan. |
Second. It will add some latency to one stage allreduce, but two stage allreduce already has it, so overall impact is smaller. |
TensorRT-LLM use both of fence(Acquire-Release) and __syncthreads: @hanzhi713 |
I saw @youkaichao 's comment , I think the problem is I don't think switch end and start with 0/1 is a good way , and I think the solution of below should be better , and don't need fence in one shot, how do you think? @hanzhi713 @HydraQYH |
@kanghui0204 Your solution seems reasonable. It's worth a shot to see the performance. Using increments removes the need to reset flags and race condition. I like the idea. |
OK, I'll try it sometime later |
Very interesting! I guess that in two-cards scenario, it seems really good. How about the 4-cards or 8 cards? I'm exciting to seen the performance result. |
I think this works for all num of GPUs, because you can prepare a pair of flags for each other GPUs. |
@kanghui0204 I think you only need one local flag regardless of gpus, but global flags increase as the number of gpus? every gpu has a flag array every gpu concurrently execute the following: const int N = 4; // Number of GPUs
int i = 0; // GPU index
// Assuming all_flags is an N x N 2D array
int all_flags[N][N] = {0}; // Initialize all elements to 0, and this array is shared across all gpus
// Update flags for the current GPU
all_flags[i][i] += 1;
// Update flags for peer GPUs
for (int j = 0; j < N; ++j) {
if (j != i) {
all_flags[j][i] += 1;
}
}
// Wait until synchronization is achieved
bool synced = false;
while (!synced) {
synced = true;
for (int j = 0; j < N; ++j) {
if (all_flags[i][j] != all_flags[i][i]) {
synced = false;
break; // No need to check further, already out of sync
}
}
} the diagram: this essentially act as a barrier for all gpus. |
cc@youkaichao #8410 (comment) |
Move to #8457 |
yes I agree with you. |
@kanghui0204 I can take a stab at this idea if you haven't started. I happen to have some time this week. |
@hanzhi713 Sorry , I don't start it because Mid-autumn festival , if you have time , you can have a try , thanks , and happy Mid-autumn festival. |
@kanghui0204 Sure. I will get started today. Happy holiday! |
Memory Order in Custom Allreduce
In custom allreduce, i notice that
Signal*
has avolatile
qualifier. And there are no memory fence instart_sync
function. I want to know that canvolatile
will make right memory order?The
start_sync
program order is:In my opinion, without memory fence, the step 3 may be visible before Step 2 or 1.
The text was updated successfully, but these errors were encountered: