Numba-dpex generating incorrect results for sliding window matmul kernel #892

diptorupd · 2023-01-29T23:19:59Z

Here's an example that I can't get to work: it's basically a port of this numba.cuda example "fast matmul" where threads load cooperatively two arrays of shared memories.

import sklearn_numba_dpex
import numba_dpex as dpex
import numpy as np
import dpctl.tensor as dpt

square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32


@dpex.kernel
def matmul(
    X,         # IN READ-ONLY    (X_n_rows, n_cols)
    y,         # IN READ-ONLY    (n_cols, y_n_rows),
    result,    # OUT             (X_n_rows, y_n_rows)
):

    X_n_rows = X.shape[0]
    Y_n_cols = y.shape[1]
    n_cols = X.shape[1]

    result_row_idx = dpex.get_global_id(0)
    result_col_idx = dpex.get_global_id(1)

    local_row_idx = dpex.get_local_id(0)
    local_col_idx = dpex.get_local_id(1)

    n_blocks_for_cols = n_cols // square_block_side
    if (n_cols % square_block_side) > 0:
        n_blocks_for_cols += 1

    X_sliding_window = dpex.local.array(shape=work_group_size, dtype=dtype)
    Y_sliding_window = dpex.local.array(shape=work_group_size, dtype=dtype)

    output = dtype(0)

    for block_idx in range(n_blocks_for_cols):
        X_sliding_window[local_row_idx, local_col_idx] = dtype(0)
        Y_sliding_window[local_row_idx, local_col_idx] = dtype(0)
        if (result_row_idx < X_n_rows) and (
            (local_col_idx + (square_block_side * block_idx)) < n_cols
        ):
            X_sliding_window[local_row_idx, local_col_idx] = X[
                result_row_idx, local_col_idx + (square_block_side * block_idx)
            ]

        if (result_col_idx < Y_n_cols) and (
            (local_row_idx + (square_block_side * block_idx)) < n_cols
        ):
            Y_sliding_window[local_row_idx, local_col_idx] = y[
                local_row_idx + (square_block_side * block_idx), result_col_idx
            ]

        dpex.barrier(dpex.CLK_LOCAL_MEM_FENCE)

        for idx in range(square_block_side):
            output += (
                X_sliding_window[local_row_idx, idx]
                * Y_sliding_window[idx, local_col_idx]
            )

        dpex.barrier(dpex.CLK_LOCAL_MEM_FENCE)

    if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
        result[result_row_idx, result_col_idx] = output


def _arange_reshaped(shape, dtype):
    n_items = shape[0] * shape[1]
    return np.arange(n_items, dtype=dtype).reshape(shape)


X = _arange_reshaped((5, 5), dtype)
Y = _arange_reshaped((5, 5), dtype)

print(np.matmul(X, Y))

X = dpt.asarray(X)
Y = dpt.asarray(Y)

device = X.device.sycl_device
result = dpt.zeros((5, 5), dtype, device=device)

matmul[(6,6), (2,2)](X, Y, result)

print(result)

Output"

# expected output
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]

# kernel output
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [ 700.  766.  832.  898.  964.]]

I've tried many variations of it with no success, the last row of the output always has wrong values. Note that this seems to be deterministic: the values in the last row are always the same. Maybe there's an error in my snippet but I've questioned each row of it already and tried to inspect each of it, enough to start thinking the compiled code might be wrong instead even if the kernel is written right. WDYT ?

Originally posted by @fcharras in #871 (comment)

The text was updated successfully, but these errors were encountered:

diptorupd · 2023-01-30T05:42:25Z

@fcharras Here is some initial analysis: If I change the matrix to 6x6 so that none of the work items perform noops then the result is correct. If I add a print between the two barriers the result is correct even for the 5x5 matrix multiplication.

Based on my preliminary analysis, the problem seems to be a synchronization problem caused when some of the work items are doing noops. I will next look at the generated LLVM and SPIR-V code to further analyze.

fcharras · 2023-01-30T10:24:19Z

TY for looking at this ! I also noticed that the issue is hard to grasp, changing minor things in the example can led to a different behavior, or even no bug at all, and I couldn't notice a pattern. I hope you can find better information with the LLVM amd SPIR-V code 🤞

chudur-budur · 2023-01-31T16:45:05Z

@fcharras I was trying to write a serial equivalent of the matmul code, but my results are different.

import numpy as np

square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32

result_idx = np.array([[0, 0],[1, 0],[0, 1],[1, 1],[0, 2],[1, 2],
                       [0, 3],[1, 3],[0, 4],[1, 4],[0, 5],[1, 5],
                       [4, 2],[5, 2],[4, 3],[5, 3],[4, 0],[5, 0],
                       [4, 1],[5, 1],[4, 4],[5, 4],[4, 5],[5, 5],
                       [2, 0],[3, 0],[2, 1],[3, 1],[2, 2],[3, 2],
                       [2, 3],[3, 3],[2, 4],[3, 4],[2, 5],[3, 5]])

local_idx = np.array([[0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1]])

def matmul_no_kernel(X, Y):
    X_n_rows = X.shape[0]
    Y_n_cols = Y.shape[1]
    n_cols = X.shape[1]

    n_blocks_for_cols = n_cols // square_block_side
    if (n_cols % square_block_side) > 0:
        n_blocks_for_cols += 1

    X_sliding_window = np.zeros(shape=work_group_size, dtype=dtype)     
    Y_sliding_window = np.zeros(shape=work_group_size, dtype=dtype) 
    
    result = np.zeros((X_n_rows,Y_n_cols), dtype=dtype)
    
    for i in range(result_idx.shape[0]):
        local_row_idx, local_col_idx = local_idx[i]
        result_row_idx, result_col_idx = result_idx[i]
        
        output = 0.0 
        for block_idx in range(n_blocks_for_cols):
            if (result_row_idx < X_n_rows) and (
                (local_col_idx + (square_block_side * block_idx)) < n_cols
            ):
                X_sliding_window[local_row_idx, local_col_idx] = X[
                    result_row_idx, local_col_idx + (square_block_side * block_idx)
                ]

            if (result_col_idx < Y_n_cols) and (
                (local_row_idx + (square_block_side * block_idx)) < n_cols
            ):
                Y_sliding_window[local_row_idx, local_col_idx] = Y[
                    local_row_idx + (square_block_side * block_idx), result_col_idx
                ]

            for idx in range(square_block_side):
                output += (
                    X_sliding_window[local_row_idx, idx]
                    * Y_sliding_window[idx, local_col_idx]
                )

        if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
            result[result_row_idx, result_col_idx] = output
    return result    

def _arange_reshaped(shape, dtype):
    n_items = shape[0] * shape[1]
    return np.arange(n_items, dtype=dtype).reshape(shape)

if __name__ == "__main__":
    X = _arange_reshaped((5, 5), dtype)
    Y = _arange_reshaped((5, 5), dtype)

    print(np.matmul(X, Y))

    print(result_idx.shape)
    print(local_idx.shape)
    print(matmul_no_kernel(X, Y))

But I am getting a different result.

[[ 253.  258.  112.  156.  259.]
 [ 700.  859.  462.  957.  880.]
 [1711. 1128. 1057. 1138. 1207.]
 [1300. 1869. 1860. 2087. 2070.]
 [ 871. 1864. 1867. 2008. 2137.]]

Is my code correct?

fcharras · 2023-02-01T05:46:57Z

Fully serialized, this can't work @chudur-budur , it really requires the barrier steps for the synchronization in work items of the same work group to work. So the fix is not trivial here.

fcharras · 2023-02-11T11:35:03Z

See #906 which I think is a simpler instance of this bug.

diptorupd · 2023-10-04T07:52:08Z

@fcharras I am finally able to devote time to the issue. My initial intuition after going over the generated LLVM IR is that with -O3 optimization flag LLVM is applying some optimizations that are leading to incorrect indexing when looking up values from the local array. It seems to be happening because LLVM does not really know the semantics of the get_global_id and family of SPIR-V calls that we generate.

To test my hypothesis, I ran the reproducer with NUMBA_OPT=0 environment variable. With all optimizations turned off, at least this reproducer works for me on all supported devices (opencl:cpu, level_zero:gpu, opencl:gpu).

My next steps are:

a) I will extract the optimization flag that dpcpp uses and use just those when we compiler a kernel.
b) Replace the SPIR-V instructions we use with those dpcpp currently generates. (I have already done it for barrier and tested unsuccessfully).

Will update with latest findings soon.

fcharras · 2023-10-04T09:04:52Z

Thank you! I will experiment with NUMBA_OPT and report the findings too.

diptorupd · 2023-10-04T16:15:05Z

Thank you! I will experiment with NUMBA_OPT and report the findings too.

I found that this particular reproducer works even with NUMBA_OPT=2, setting optimization level to 3 causes the problems.

diptorupd · 2024-03-15T01:56:26Z

@fcharras @ZzEeKkAa We finally have a full solution to the issue. The following script is a modified version of your original code. The LocalAccessor changes are not yet merged and I used #1331 for evaluating the script.

There were several issues that resulted in the erroneous results you reported:

a) Incorrect code generation for barriers. We fixed that issue by implementing a new group_barrier function that generates the same code as dpcpp. Note that to support group_barrier a host of other changes regarding how indexing is done inside a kernel got completely overhauled. Thus, you see the new NdItem object getting passed to the kernel.

b) The indexing functions were getting optimized in a way that was leading to wrong SPIR-V code generation. We are still investigating with the graphics compiler team if the bug is in the SPIR-V compiler, but as a detour we modified what LLVM IR we generate from the indexing functions (fedd4d5) and now generate the same code as dpcpp.

With these fixes that were done as part of the near full rewrite of the kernel API (currently still in experimental) the issue is finally resolved and I could successfully run the script with max levels of optimization (NUMBA_DPEX_OPT=3 INLINE_THRESHOLD=3) on opencl:cpu, opencl:gpu and level_zero:gpu devices.

I will close the issue once you confirm or after we merge #1331 and move all changes from experimental to stable, whichever happens first 😄

from numba_dpex import kernel_api as kapi
import numba_dpex.experimental as dpex_exp
import numpy as np
import dpctl.tensor as dpt

square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32

@dpex_exp.kernel
def matmul(
    nditem: kapi.NdItem,
    X,  # IN READ-ONLY    (X_n_rows, n_cols)
    y,  # IN READ-ONLY    (n_cols, y_n_rows),
    X_slm,  # SLM to store a sliding window over X
    Y_slm,  # SLM to store a sliding window over Y
    result,  # OUT        (X_n_rows, y_n_rows)
):

    X_n_rows = X.shape[0]
    Y_n_cols = y.shape[1]
    n_cols = X.shape[1]
    result_row_idx = nditem.get_global_id(0)
    result_col_idx = nditem.get_global_id(1)
    local_row_idx = nditem.get_local_id(0)
    local_col_idx = nditem.get_local_id(1)

    n_blocks_for_cols = n_cols // square_block_side
    if (n_cols % square_block_side) > 0:
        n_blocks_for_cols += 1
    output = dtype(0)

    gr = nditem.get_group()

    for block_idx in range(n_blocks_for_cols):
        X_slm[local_row_idx, local_col_idx] = dtype(0)
        Y_slm[local_row_idx, local_col_idx] = dtype(0)
        if (result_row_idx < X_n_rows) and (
            (local_col_idx + (square_block_side * block_idx)) < n_cols
        ):
            X_slm[local_row_idx, local_col_idx] = X[
                result_row_idx, local_col_idx + (square_block_side * block_idx)
            ]

        if (result_col_idx < Y_n_cols) and (
            (local_row_idx + (square_block_side * block_idx)) < n_cols
        ):
            Y_slm[local_row_idx, local_col_idx] = y[
                local_row_idx + (square_block_side * block_idx), result_col_idx
            ]

        kapi.group_barrier(gr)

        for idx in range(square_block_side):
            output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx]

        kapi.group_barrier(gr)

    if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
        result[result_row_idx, result_col_idx] = output

def _arange_reshaped(shape, dtype):
    n_items = shape[0] * shape[1]
    return np.arange(n_items, dtype=dtype).reshape(shape)

X = _arange_reshaped((5, 5), dtype)
Y = _arange_reshaped((5, 5), dtype)
X = dpt.asarray(X)
Y = dpt.asarray(Y)

device = X.device.sycl_device
result = dpt.zeros((5, 5), dtype, device=device)
X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)

dpex_exp.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result)
print(np.matmul(X, Y))
print(result)

diptorupd · 2024-03-22T05:59:22Z

Closing as the example work fine in main after fixes implemented in #1377 and #1357 and the overall changes to group barrier code generation and API.

diptorupd mentioned this issue Jan 29, 2023

Sum up of dpex.kernel JIT issues #891

Closed

diptorupd changed the title ~~Numba-dpex generating incorrect results for matmul kernel~~ Numba-dpex generating incorrect results for sliding window matmul kernel Jan 30, 2023

diptorupd mentioned this issue Jan 31, 2023

2D vector sum using usm_ndarrays does not work in kernels #871

Closed

This was referenced Feb 3, 2023

Minimal reproducer for regression in #804 #898

Closed

ENH small cleaning and optimizations accross the repo soda-inria/sklearn-numba-dpex#88

Closed

fcharras mentioned this issue Feb 11, 2023

Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel #906

Closed

diptorupd self-assigned this Mar 7, 2023

diptorupd added the bug Something isn't working label Mar 11, 2023

fcharras mentioned this issue Oct 4, 2023

Having a lot of segfault / incorrect outputs with latest updates #1152

Closed

diptorupd added the user User submitted issue label Feb 29, 2024

diptorupd added this to the 0.23 milestone Mar 15, 2024

diptorupd closed this as completed Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba-dpex generating incorrect results for sliding window matmul kernel #892

Numba-dpex generating incorrect results for sliding window matmul kernel #892

diptorupd commented Jan 29, 2023

diptorupd commented Jan 30, 2023

fcharras commented Jan 30, 2023

chudur-budur commented Jan 31, 2023

fcharras commented Feb 1, 2023

fcharras commented Feb 11, 2023

diptorupd commented Oct 4, 2023 •

edited

Loading

fcharras commented Oct 4, 2023

diptorupd commented Oct 4, 2023

diptorupd commented Mar 15, 2024

diptorupd commented Mar 22, 2024

Numba-dpex generating incorrect results for sliding window matmul kernel #892

Numba-dpex generating incorrect results for sliding window matmul kernel #892

Comments

diptorupd commented Jan 29, 2023

diptorupd commented Jan 30, 2023

fcharras commented Jan 30, 2023

chudur-budur commented Jan 31, 2023

fcharras commented Feb 1, 2023

fcharras commented Feb 11, 2023

diptorupd commented Oct 4, 2023 • edited Loading

fcharras commented Oct 4, 2023

diptorupd commented Oct 4, 2023

diptorupd commented Mar 15, 2024

diptorupd commented Mar 22, 2024

diptorupd commented Oct 4, 2023 •

edited

Loading