Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numba-dpex generating incorrect results for sliding window matmul kernel #892

Closed
diptorupd opened this issue Jan 29, 2023 · 10 comments
Closed
Assignees
Labels
bug Something isn't working user User submitted issue
Milestone

Comments

@diptorupd
Copy link
Contributor

Here's an example that I can't get to work: it's basically a port of this numba.cuda example "fast matmul" where threads load cooperatively two arrays of shared memories.

import sklearn_numba_dpex
import numba_dpex as dpex
import numpy as np
import dpctl.tensor as dpt

square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32


@dpex.kernel
def matmul(
    X,         # IN READ-ONLY    (X_n_rows, n_cols)
    y,         # IN READ-ONLY    (n_cols, y_n_rows),
    result,    # OUT             (X_n_rows, y_n_rows)
):

    X_n_rows = X.shape[0]
    Y_n_cols = y.shape[1]
    n_cols = X.shape[1]

    result_row_idx = dpex.get_global_id(0)
    result_col_idx = dpex.get_global_id(1)

    local_row_idx = dpex.get_local_id(0)
    local_col_idx = dpex.get_local_id(1)

    n_blocks_for_cols = n_cols // square_block_side
    if (n_cols % square_block_side) > 0:
        n_blocks_for_cols += 1

    X_sliding_window = dpex.local.array(shape=work_group_size, dtype=dtype)
    Y_sliding_window = dpex.local.array(shape=work_group_size, dtype=dtype)

    output = dtype(0)

    for block_idx in range(n_blocks_for_cols):
        X_sliding_window[local_row_idx, local_col_idx] = dtype(0)
        Y_sliding_window[local_row_idx, local_col_idx] = dtype(0)
        if (result_row_idx < X_n_rows) and (
            (local_col_idx + (square_block_side * block_idx)) < n_cols
        ):
            X_sliding_window[local_row_idx, local_col_idx] = X[
                result_row_idx, local_col_idx + (square_block_side * block_idx)
            ]

        if (result_col_idx < Y_n_cols) and (
            (local_row_idx + (square_block_side * block_idx)) < n_cols
        ):
            Y_sliding_window[local_row_idx, local_col_idx] = y[
                local_row_idx + (square_block_side * block_idx), result_col_idx
            ]

        dpex.barrier(dpex.CLK_LOCAL_MEM_FENCE)

        for idx in range(square_block_side):
            output += (
                X_sliding_window[local_row_idx, idx]
                * Y_sliding_window[idx, local_col_idx]
            )

        dpex.barrier(dpex.CLK_LOCAL_MEM_FENCE)

    if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
        result[result_row_idx, result_col_idx] = output


def _arange_reshaped(shape, dtype):
    n_items = shape[0] * shape[1]
    return np.arange(n_items, dtype=dtype).reshape(shape)


X = _arange_reshaped((5, 5), dtype)
Y = _arange_reshaped((5, 5), dtype)

print(np.matmul(X, Y))

X = dpt.asarray(X)
Y = dpt.asarray(Y)

device = X.device.sycl_device
result = dpt.zeros((5, 5), dtype, device=device)

matmul[(6,6), (2,2)](X, Y, result)

print(result)

Output"

# expected output
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]

# kernel output
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [ 700.  766.  832.  898.  964.]]

I've tried many variations of it with no success, the last row of the output always has wrong values. Note that this seems to be deterministic: the values in the last row are always the same. Maybe there's an error in my snippet but I've questioned each row of it already and tried to inspect each of it, enough to start thinking the compiled code might be wrong instead even if the kernel is written right. WDYT ?

Originally posted by @fcharras in #871 (comment)

@diptorupd diptorupd changed the title Numba-dpex generating incorrect results for matmul kernel Numba-dpex generating incorrect results for sliding window matmul kernel Jan 30, 2023
@diptorupd
Copy link
Contributor Author

@fcharras Here is some initial analysis: If I change the matrix to 6x6 so that none of the work items perform noops then the result is correct. If I add a print between the two barriers the result is correct even for the 5x5 matrix multiplication.

Based on my preliminary analysis, the problem seems to be a synchronization problem caused when some of the work items are doing noops. I will next look at the generated LLVM and SPIR-V code to further analyze.

@fcharras
Copy link

TY for looking at this ! I also noticed that the issue is hard to grasp, changing minor things in the example can led to a different behavior, or even no bug at all, and I couldn't notice a pattern. I hope you can find better information with the LLVM amd SPIR-V code 🤞

@chudur-budur
Copy link
Contributor

@fcharras I was trying to write a serial equivalent of the matmul code, but my results are different.

import numpy as np

square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32

result_idx = np.array([[0, 0],[1, 0],[0, 1],[1, 1],[0, 2],[1, 2],
                       [0, 3],[1, 3],[0, 4],[1, 4],[0, 5],[1, 5],
                       [4, 2],[5, 2],[4, 3],[5, 3],[4, 0],[5, 0],
                       [4, 1],[5, 1],[4, 4],[5, 4],[4, 5],[5, 5],
                       [2, 0],[3, 0],[2, 1],[3, 1],[2, 2],[3, 2],
                       [2, 3],[3, 3],[2, 4],[3, 4],[2, 5],[3, 5]])

local_idx = np.array([[0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1],
                      [0, 0],[1, 0],[0, 1],[1, 1]])

def matmul_no_kernel(X, Y):
    X_n_rows = X.shape[0]
    Y_n_cols = Y.shape[1]
    n_cols = X.shape[1]

    n_blocks_for_cols = n_cols // square_block_side
    if (n_cols % square_block_side) > 0:
        n_blocks_for_cols += 1

    X_sliding_window = np.zeros(shape=work_group_size, dtype=dtype)     
    Y_sliding_window = np.zeros(shape=work_group_size, dtype=dtype) 
    
    result = np.zeros((X_n_rows,Y_n_cols), dtype=dtype)
    
    for i in range(result_idx.shape[0]):
        local_row_idx, local_col_idx = local_idx[i]
        result_row_idx, result_col_idx = result_idx[i]
        
        output = 0.0 
        for block_idx in range(n_blocks_for_cols):
            if (result_row_idx < X_n_rows) and (
                (local_col_idx + (square_block_side * block_idx)) < n_cols
            ):
                X_sliding_window[local_row_idx, local_col_idx] = X[
                    result_row_idx, local_col_idx + (square_block_side * block_idx)
                ]

            if (result_col_idx < Y_n_cols) and (
                (local_row_idx + (square_block_side * block_idx)) < n_cols
            ):
                Y_sliding_window[local_row_idx, local_col_idx] = Y[
                    local_row_idx + (square_block_side * block_idx), result_col_idx
                ]

            for idx in range(square_block_side):
                output += (
                    X_sliding_window[local_row_idx, idx]
                    * Y_sliding_window[idx, local_col_idx]
                )

        if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
            result[result_row_idx, result_col_idx] = output
    return result    

def _arange_reshaped(shape, dtype):
    n_items = shape[0] * shape[1]
    return np.arange(n_items, dtype=dtype).reshape(shape)

if __name__ == "__main__":
    X = _arange_reshaped((5, 5), dtype)
    Y = _arange_reshaped((5, 5), dtype)

    print(np.matmul(X, Y))

    print(result_idx.shape)
    print(local_idx.shape)
    print(matmul_no_kernel(X, Y))

But I am getting a different result.

[[ 253.  258.  112.  156.  259.]
 [ 700.  859.  462.  957.  880.]
 [1711. 1128. 1057. 1138. 1207.]
 [1300. 1869. 1860. 2087. 2070.]
 [ 871. 1864. 1867. 2008. 2137.]]

Is my code correct?

@fcharras
Copy link

fcharras commented Feb 1, 2023

Fully serialized, this can't work @chudur-budur , it really requires the barrier steps for the synchronization in work items of the same work group to work. So the fix is not trivial here.

@fcharras
Copy link

See #906 which I think is a simpler instance of this bug.

@diptorupd diptorupd self-assigned this Mar 7, 2023
@diptorupd diptorupd added the bug Something isn't working label Mar 11, 2023
@diptorupd
Copy link
Contributor Author

diptorupd commented Oct 4, 2023

@fcharras I am finally able to devote time to the issue. My initial intuition after going over the generated LLVM IR is that with -O3 optimization flag LLVM is applying some optimizations that are leading to incorrect indexing when looking up values from the local array. It seems to be happening because LLVM does not really know the semantics of the get_global_id and family of SPIR-V calls that we generate.

To test my hypothesis, I ran the reproducer with NUMBA_OPT=0 environment variable. With all optimizations turned off, at least this reproducer works for me on all supported devices (opencl:cpu, level_zero:gpu, opencl:gpu).

My next steps are:

a) I will extract the optimization flag that dpcpp uses and use just those when we compiler a kernel.
b) Replace the SPIR-V instructions we use with those dpcpp currently generates. (I have already done it for barrier and tested unsuccessfully).

Will update with latest findings soon.

@fcharras
Copy link

fcharras commented Oct 4, 2023

Thank you! I will experiment with NUMBA_OPT and report the findings too.

@diptorupd
Copy link
Contributor Author

Thank you! I will experiment with NUMBA_OPT and report the findings too.

I found that this particular reproducer works even with NUMBA_OPT=2, setting optimization level to 3 causes the problems.

@diptorupd diptorupd added the user User submitted issue label Feb 29, 2024
@diptorupd
Copy link
Contributor Author

@fcharras @ZzEeKkAa We finally have a full solution to the issue. The following script is a modified version of your original code. The LocalAccessor changes are not yet merged and I used #1331 for evaluating the script.

There were several issues that resulted in the erroneous results you reported:

a) Incorrect code generation for barriers. We fixed that issue by implementing a new group_barrier function that generates the same code as dpcpp. Note that to support group_barrier a host of other changes regarding how indexing is done inside a kernel got completely overhauled. Thus, you see the new NdItem object getting passed to the kernel.

b) The indexing functions were getting optimized in a way that was leading to wrong SPIR-V code generation. We are still investigating with the graphics compiler team if the bug is in the SPIR-V compiler, but as a detour we modified what LLVM IR we generate from the indexing functions (fedd4d5) and now generate the same code as dpcpp.

With these fixes that were done as part of the near full rewrite of the kernel API (currently still in experimental) the issue is finally resolved and I could successfully run the script with max levels of optimization (NUMBA_DPEX_OPT=3 INLINE_THRESHOLD=3) on opencl:cpu, opencl:gpu and level_zero:gpu devices.

I will close the issue once you confirm or after we merge #1331 and move all changes from experimental to stable, whichever happens first 😄

from numba_dpex import kernel_api as kapi
import numba_dpex.experimental as dpex_exp
import numpy as np
import dpctl.tensor as dpt

square_block_side = 2
work_group_size = (square_block_side, square_block_side)
dtype = np.float32

@dpex_exp.kernel
def matmul(
    nditem: kapi.NdItem,
    X,  # IN READ-ONLY    (X_n_rows, n_cols)
    y,  # IN READ-ONLY    (n_cols, y_n_rows),
    X_slm,  # SLM to store a sliding window over X
    Y_slm,  # SLM to store a sliding window over Y
    result,  # OUT        (X_n_rows, y_n_rows)
):

    X_n_rows = X.shape[0]
    Y_n_cols = y.shape[1]
    n_cols = X.shape[1]
    result_row_idx = nditem.get_global_id(0)
    result_col_idx = nditem.get_global_id(1)
    local_row_idx = nditem.get_local_id(0)
    local_col_idx = nditem.get_local_id(1)

    n_blocks_for_cols = n_cols // square_block_side
    if (n_cols % square_block_side) > 0:
        n_blocks_for_cols += 1
    output = dtype(0)

    gr = nditem.get_group()

    for block_idx in range(n_blocks_for_cols):
        X_slm[local_row_idx, local_col_idx] = dtype(0)
        Y_slm[local_row_idx, local_col_idx] = dtype(0)
        if (result_row_idx < X_n_rows) and (
            (local_col_idx + (square_block_side * block_idx)) < n_cols
        ):
            X_slm[local_row_idx, local_col_idx] = X[
                result_row_idx, local_col_idx + (square_block_side * block_idx)
            ]

        if (result_col_idx < Y_n_cols) and (
            (local_row_idx + (square_block_side * block_idx)) < n_cols
        ):
            Y_slm[local_row_idx, local_col_idx] = y[
                local_row_idx + (square_block_side * block_idx), result_col_idx
            ]

        kapi.group_barrier(gr)

        for idx in range(square_block_side):
            output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx]

        kapi.group_barrier(gr)

    if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
        result[result_row_idx, result_col_idx] = output

def _arange_reshaped(shape, dtype):
    n_items = shape[0] * shape[1]
    return np.arange(n_items, dtype=dtype).reshape(shape)

X = _arange_reshaped((5, 5), dtype)
Y = _arange_reshaped((5, 5), dtype)
X = dpt.asarray(X)
Y = dpt.asarray(Y)

device = X.device.sycl_device
result = dpt.zeros((5, 5), dtype, device=device)
X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)

dpex_exp.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result)
print(np.matmul(X, Y))
print(result)

@diptorupd diptorupd added this to the 0.23 milestone Mar 15, 2024
@diptorupd
Copy link
Contributor Author

Closing as the example work fine in main after fixes implemented in #1377 and #1357 and the overall changes to group barrier code generation and API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user User submitted issue
Projects
None yet
Development

No branches or pull requests

3 participants