You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The reason for the deadlock seems to be following: the two tensors have different mappings
The tensor e has the mapping e[p72(0)c0] whereas f has f topo (, 2, 2, 2, 3, 3); CTF: Tensor mapping is ETXS01[p2(0)c0]
This leads to wrong numbers in src/redistribuion/slice.cxx:190 which implies an incorrect MPI_Sendrecv_replace operation in src/redistribution/slice.cxx:207
(only rank 2-71 reach the MPI_Sendrecv_replace command with send/recv destination 0 or 1).
The most crude workaround would be changing the if statement in src/tensor/untyped_tensor.cxx:1043. Is there any other solution or am I generally missing something here?
Best,
Andreas
The text was updated successfully, but these errors were encountered:
@airmler I was able to reproduce this, thanks for reporting. Please try out the fix on branch fix_slice. The problem is that the new optimized slice code seems erroneous when a tensor is not distributed over all processors, such as the case with the small vector here. I simply turned off fast slice for such cases, which should hopefully be a robust fix. If performance of slice becomes an issue as a result, please let me know, it should be possible to extend the optimized code to handle such cases as well, will just require more work.
Thanks for the fast fix. This works perfect for us.
In our calculations the large tensors are fully distributed, that is why this change is satisfactory.
Cannot see an example where the extension would be really necessary.
Dear Edgar,
The following code snipped leads to a MPI deadlock when running on 72 mpi ranks.
The reason for the deadlock seems to be following: the two tensors have different mappings
The tensor e has the mapping
e[p72(0)c0]
whereas f hasf topo (, 2, 2, 2, 3, 3); CTF: Tensor mapping is ETXS01[p2(0)c0]
This leads to wrong numbers in src/redistribuion/slice.cxx:190 which implies an incorrect MPI_Sendrecv_replace operation in src/redistribution/slice.cxx:207
(only rank 2-71 reach the MPI_Sendrecv_replace command with send/recv destination 0 or 1).
The most crude workaround would be changing the if statement in src/tensor/untyped_tensor.cxx:1043. Is there any other solution or am I generally missing something here?
Best,
Andreas
The text was updated successfully, but these errors were encountered: