-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numba-dpex generating incorrect results for sliding window matmul kernel #892
Comments
@fcharras Here is some initial analysis: If I change the matrix to 6x6 so that none of the work items perform noops then the result is correct. If I add a print between the two barriers the result is correct even for the 5x5 matrix multiplication. Based on my preliminary analysis, the problem seems to be a synchronization problem caused when some of the work items are doing noops. I will next look at the generated LLVM and SPIR-V code to further analyze. |
TY for looking at this ! I also noticed that the issue is hard to grasp, changing minor things in the example can led to a different behavior, or even no bug at all, and I couldn't notice a pattern. I hope you can find better information with the LLVM amd SPIR-V code 🤞 |
@fcharras I was trying to write a serial equivalent of the matmul code, but my results are different. import numpy as np
square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32
result_idx = np.array([[0, 0],[1, 0],[0, 1],[1, 1],[0, 2],[1, 2],
[0, 3],[1, 3],[0, 4],[1, 4],[0, 5],[1, 5],
[4, 2],[5, 2],[4, 3],[5, 3],[4, 0],[5, 0],
[4, 1],[5, 1],[4, 4],[5, 4],[4, 5],[5, 5],
[2, 0],[3, 0],[2, 1],[3, 1],[2, 2],[3, 2],
[2, 3],[3, 3],[2, 4],[3, 4],[2, 5],[3, 5]])
local_idx = np.array([[0, 0],[1, 0],[0, 1],[1, 1],
[0, 0],[1, 0],[0, 1],[1, 1],
[0, 0],[1, 0],[0, 1],[1, 1],
[0, 0],[1, 0],[0, 1],[1, 1],
[0, 0],[1, 0],[0, 1],[1, 1],
[0, 0],[1, 0],[0, 1],[1, 1],
[0, 0],[1, 0],[0, 1],[1, 1],
[0, 0],[1, 0],[0, 1],[1, 1],
[0, 0],[1, 0],[0, 1],[1, 1]])
def matmul_no_kernel(X, Y):
X_n_rows = X.shape[0]
Y_n_cols = Y.shape[1]
n_cols = X.shape[1]
n_blocks_for_cols = n_cols // square_block_side
if (n_cols % square_block_side) > 0:
n_blocks_for_cols += 1
X_sliding_window = np.zeros(shape=work_group_size, dtype=dtype)
Y_sliding_window = np.zeros(shape=work_group_size, dtype=dtype)
result = np.zeros((X_n_rows,Y_n_cols), dtype=dtype)
for i in range(result_idx.shape[0]):
local_row_idx, local_col_idx = local_idx[i]
result_row_idx, result_col_idx = result_idx[i]
output = 0.0
for block_idx in range(n_blocks_for_cols):
if (result_row_idx < X_n_rows) and (
(local_col_idx + (square_block_side * block_idx)) < n_cols
):
X_sliding_window[local_row_idx, local_col_idx] = X[
result_row_idx, local_col_idx + (square_block_side * block_idx)
]
if (result_col_idx < Y_n_cols) and (
(local_row_idx + (square_block_side * block_idx)) < n_cols
):
Y_sliding_window[local_row_idx, local_col_idx] = Y[
local_row_idx + (square_block_side * block_idx), result_col_idx
]
for idx in range(square_block_side):
output += (
X_sliding_window[local_row_idx, idx]
* Y_sliding_window[idx, local_col_idx]
)
if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
result[result_row_idx, result_col_idx] = output
return result
def _arange_reshaped(shape, dtype):
n_items = shape[0] * shape[1]
return np.arange(n_items, dtype=dtype).reshape(shape)
if __name__ == "__main__":
X = _arange_reshaped((5, 5), dtype)
Y = _arange_reshaped((5, 5), dtype)
print(np.matmul(X, Y))
print(result_idx.shape)
print(local_idx.shape)
print(matmul_no_kernel(X, Y)) But I am getting a different result. [[ 253. 258. 112. 156. 259.]
[ 700. 859. 462. 957. 880.]
[1711. 1128. 1057. 1138. 1207.]
[1300. 1869. 1860. 2087. 2070.]
[ 871. 1864. 1867. 2008. 2137.]] Is my code correct? |
Fully serialized, this can't work @chudur-budur , it really requires the barrier steps for the synchronization in work items of the same work group to work. So the fix is not trivial here. |
See #906 which I think is a simpler instance of this bug. |
@fcharras I am finally able to devote time to the issue. My initial intuition after going over the generated LLVM IR is that with To test my hypothesis, I ran the reproducer with My next steps are: a) I will extract the optimization flag that dpcpp uses and use just those when we compiler a kernel. Will update with latest findings soon. |
Thank you! I will experiment with |
I found that this particular reproducer works even with |
@fcharras @ZzEeKkAa We finally have a full solution to the issue. The following script is a modified version of your original code. The There were several issues that resulted in the erroneous results you reported: a) Incorrect code generation for barriers. We fixed that issue by implementing a new b) The indexing functions were getting optimized in a way that was leading to wrong SPIR-V code generation. We are still investigating with the graphics compiler team if the bug is in the SPIR-V compiler, but as a detour we modified what LLVM IR we generate from the indexing functions (fedd4d5) and now generate the same code as dpcpp. With these fixes that were done as part of the near full rewrite of the kernel API (currently still in I will close the issue once you confirm or after we merge #1331 and move all changes from from numba_dpex import kernel_api as kapi
import numba_dpex.experimental as dpex_exp
import numpy as np
import dpctl.tensor as dpt
square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32
@dpex_exp.kernel
def matmul(
nditem: kapi.NdItem,
X, # IN READ-ONLY (X_n_rows, n_cols)
y, # IN READ-ONLY (n_cols, y_n_rows),
X_slm, # SLM to store a sliding window over X
Y_slm, # SLM to store a sliding window over Y
result, # OUT (X_n_rows, y_n_rows)
):
X_n_rows = X.shape[0]
Y_n_cols = y.shape[1]
n_cols = X.shape[1]
result_row_idx = nditem.get_global_id(0)
result_col_idx = nditem.get_global_id(1)
local_row_idx = nditem.get_local_id(0)
local_col_idx = nditem.get_local_id(1)
n_blocks_for_cols = n_cols // square_block_side
if (n_cols % square_block_side) > 0:
n_blocks_for_cols += 1
output = dtype(0)
gr = nditem.get_group()
for block_idx in range(n_blocks_for_cols):
X_slm[local_row_idx, local_col_idx] = dtype(0)
Y_slm[local_row_idx, local_col_idx] = dtype(0)
if (result_row_idx < X_n_rows) and (
(local_col_idx + (square_block_side * block_idx)) < n_cols
):
X_slm[local_row_idx, local_col_idx] = X[
result_row_idx, local_col_idx + (square_block_side * block_idx)
]
if (result_col_idx < Y_n_cols) and (
(local_row_idx + (square_block_side * block_idx)) < n_cols
):
Y_slm[local_row_idx, local_col_idx] = y[
local_row_idx + (square_block_side * block_idx), result_col_idx
]
kapi.group_barrier(gr)
for idx in range(square_block_side):
output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx]
kapi.group_barrier(gr)
if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
result[result_row_idx, result_col_idx] = output
def _arange_reshaped(shape, dtype):
n_items = shape[0] * shape[1]
return np.arange(n_items, dtype=dtype).reshape(shape)
X = _arange_reshaped((5, 5), dtype)
Y = _arange_reshaped((5, 5), dtype)
X = dpt.asarray(X)
Y = dpt.asarray(Y)
device = X.device.sycl_device
result = dpt.zeros((5, 5), dtype, device=device)
X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
dpex_exp.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result)
print(np.matmul(X, Y))
print(result) |
Here's an example that I can't get to work: it's basically a port of this numba.cuda example "fast matmul" where threads load cooperatively two arrays of shared memories.
Output"
I've tried many variations of it with no success, the last row of the output always has wrong values. Note that this seems to be deterministic: the values in the last row are always the same. Maybe there's an error in my snippet but I've questioned each row of it already and tried to inspect each of it, enough to start thinking the compiled code might be wrong instead even if the kernel is written right. WDYT ?
Originally posted by @fcharras in #871 (comment)
The text was updated successfully, but these errors were encountered: