get_global_id() indexing order doesn't follow that of SYCL indexing #889

chudur-budur · 2023-01-27T08:25:40Z

Let's say we want to mutliply two matrices (of dimensions |a| = X*Y and |b| = Y*Z, i.e. a x b = c, |c| = X*Z) using this SYCL code:

#include <iostream>
#include <vector>
#include <string>
#include <CL/sycl.hpp>

int matmul(std::vector<int> &a,
            std::vector<int> &b,
            size_t row_a, size_t col_a,
            size_t row_b, size_t col_b) {

    size_t X = row_a, Y = col_a, Z = col_b;

    std::vector<int> c(X * Z, 0);

    sycl::queue queue;
    std::cout << "Selected device: "
            << queue.get_device().get_info<sycl::info::device::name>()
            << "\n";

    sycl::buffer Ab{a};
    sycl::buffer Bb{b};
    sycl::buffer Cb{c};

    queue.submit([&](sycl::handler& h) {

        sycl::accessor Ad{Ab,h,sycl::read_only};
        sycl::accessor Bd{Bb,h,sycl::read_only};
        sycl::accessor Cd{Cb,h,sycl::read_write};

        sycl::range global{X, Z};

        sycl::stream out(1024, 256, h);
        h.parallel_for(sycl::range{global}, [=](sycl::id<2> index) {
            auto i = index[0];
            auto j = index[1];
            out << "i = " << i << ", j = " << j << sycl::endl;
            for (auto k = 0; k < Y; ++k) {
              Cd[i*Z+j] += Ad[i*Y+k] * Bd[k*Z+j];
            }
        });
    });

    sycl::host_accessor Ch{Cb, sycl::read_only};
    for(auto i = 0 ; i < X ; i++) {
        for(auto j = 0 ; j < Z ; j++)
            std::cout << Ch[i*Z+j] << " ";
        std::cout << std::endl;
    }
    return 0;
}

int main () {
    std::vector<int> a(20,1);
    std::vector<int> b(24,1);

    matmul(a, b, 5, 4, 4, 6);
    return 0;
}

What we are essentially doing is parallalizing this loop:

for (auto i = 0 ; i < X ; i++)
    for(auto j = 0; j < Z; j++) 
        for(auto k = 0 ; k < Y; k++)
            c[i,j] += a[i,k] * b[k,j]

The parallel_for gets the two outer-most loop indices through get_global_id() in the kernel. Now if we print the indices, we get an expected iteration:

i = 0, j = 0
i = 0, j = 1
i = 0, j = 2
i = 0, j = 3
i = 0, j = 4
i = 0, j = 5
i = 1, j = 0
i = 1, j = 1
i = 1, j = 2
i = 1, j = 3
i = 1, j = 4
i = 1, j = 5
i = 2, j = 0
i = 2, j = 1
i = 2, j = 2
i = 2, j = 3
i = 2, j = 4
i = 2, j = 5
i = 3, j = 0
i = 3, j = 1
i = 3, j = 2
i = 3, j = 3
i = 3, j = 4
i = 3, j = 5
i = 4, j = 0
i = 4, j = 1
i = 4, j = 2
i = 4, j = 3
i = 4, j = 4
i = 4, j = 5

Now, if we look at the similar matrix multiplication code in numba_dpex:

import dpctl
import dpctl.tensor as dpt
import numba_dpex as dpex

# Array dimensions
X = 5
Y = 4
Z = 6
global_size = X, Z

@dpex.kernel
def matmul(a, b, c):
    i = dpex.get_global_id(0)
    j = dpex.get_global_id(1)
    print("i =", i, "j =", j)
    for k in range(Y):
        c[i, j] += a[i, k] * b[k, j]


def driver(a, b, c):
    # Invoke the kernel
    matmul[global_size, dpex.DEFAULT_LOCAL_SIZE](a,b,c)


if __name__ == "__main__":
    device = dpctl.select_default_device()
    a = dpt.ones(X * Y, dtype=dpt.int32, device=device)
    a = dpt.reshape(a, (X, Y))
    
    b = dpt.ones(Y * Z, dtype=dpt.int32, device=device)
    b = dpt.reshape(b, (Y, Z))
    
    c = dpt.zeros(X * Z, dtype=dpt.int32, device=device)
    c = dpt.reshape(c, (X, Z))

    print("Using device ...")
    device.print_device_info()

    driver(a, b, c)

The iteration order is flipped:

i = 0 j = 0
i = 1 j = 0
i = 2 j = 0
i = 3 j = 0
i = 4 j = 0
i = 0 j = 1
i = 1 j = 1
i = 2 j = 1
i = 3 j = 1
i = 4 j = 1
i = 0 j = 2
i = 1 j = 2
i = 2 j = 2
i = 3 j = 2
i = 4 j = 2
i = 0 j = 3
i = 1 j = 3
i = 2 j = 3
i = 3 j = 3
i = 4 j = 3
i = 0 j = 4
i = 1 j = 4
i = 2 j = 4
i = 3 j = 4
i = 4 j = 4
i = 0 j = 5
i = 1 j = 5
i = 2 j = 5
i = 3 j = 5
i = 4 j = 5

Why is that? Given that all the parameter settings in the python code is identical to the c++ code.

The text was updated successfully, but these errors were encountered:

diptorupd · 2023-01-29T22:46:33Z

@chudur-budur I do not understand the issue. In both SYCL and numba-dpex i, i.e., get_global_id(0) or the first index is going from [0...5) and j, i.e., get_global_id(1) goes from [0..6)].

So what is the discrepancy?

The concept of iteration order is irrelevant in this case. There is no iteration as we are not executing any loop. You are executing a data parallel kernel where X*Z work items are executing in parallel. There is some internal serialization mechanism in both dpcpp and numba-dpex to generate prints and from the look of it they differ, but that in itself proves nothing about the order in which these work items execute. In fact, by definition the order of execution of work items in a work group is undefined.

The real issue will be if you are getting different results from seemingly identical kernels written in dpcpp and dpex. Is that what you are seeing?

chudur-budur · 2023-01-30T07:03:47Z

@chudur-budur I do not understand the issue. In both SYCL and numba-dpex i, i.e., get_global_id(0) or the first index is going from [0...5) and j, i.e., get_global_id(1) goes from [0..6)].

So what is the discrepancy?

Yea, it's not a discrepancy per se, since iteration is not related to any loop, I think we should be fine as long as indices are accessing the correct values in the array.

fcharras · 2023-01-30T10:44:18Z

While there are more important sources of concern as the moment, couldn't that actually matter in the perspective of optimizing execution ? I wonder if knowing in what order the tasks will be dispatched could be leveraged to have a better cache hit rate in some kernels. In this regard I would expect tasks to be dispatched in a row-major order and could agree that this example might in fact show a concerning behavior from numba_dpex, because the SYCL kernel dispatches task in row-major order, and numba_dpex seems to show column-major order.

chudur-budur self-assigned this Jan 27, 2023

fcharras mentioned this issue Jan 29, 2023

Sum up of dpex.kernel JIT issues #891

Closed

chudur-budur closed this as completed Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_global_id() indexing order doesn't follow that of SYCL indexing #889

get_global_id() indexing order doesn't follow that of SYCL indexing #889

chudur-budur commented Jan 27, 2023

diptorupd commented Jan 29, 2023 •

edited

Loading

chudur-budur commented Jan 30, 2023

fcharras commented Jan 30, 2023 •

edited

Loading

get_global_id() indexing order doesn't follow that of SYCL indexing #889

get_global_id() indexing order doesn't follow that of SYCL indexing #889

Comments

chudur-budur commented Jan 27, 2023

diptorupd commented Jan 29, 2023 • edited Loading

chudur-budur commented Jan 30, 2023

fcharras commented Jan 30, 2023 • edited Loading

diptorupd commented Jan 29, 2023 •

edited

Loading

fcharras commented Jan 30, 2023 •

edited

Loading