oneAPI with nested integrals #398

foglienimatteo · 2024-03-16T00:39:43Z

foglienimatteo
Mar 16, 2024

We are working on the open-source Julia code GaPSE.jl, and there we have to perform for some evaluations three nested integrals:

where cosmo is a Julia Struct that contains relevant data (Dierckx.Spline1D and Float64 mostly) needed for the computation of f.

Our idea was to create a 2D kernel for computing the integrand function f (defined here) over the two inner integrals (performed here), and pass as parameter the cosmo Struct. However, it seems that the 2D kernel has not been implemented, and there isn't a oneStruct wrapper for this kind of task. We can use something like oneArray([cosmo]) but it seems more like a workaround than a solution.

What is the standard procedure to use oneAPI.jl for this scenario, where an integrand function must be evaluated over a (2d) vector with input struct and parameters?

(In line of principle, working with oneAPI splines could solve this particular problem, but they are not implemented as far as we understood)

maleadt · 2024-03-16T16:52:29Z

maleadt
Mar 16, 2024
Maintainer

There is no standard procedure to compute an integral using Julia's GPU packages, i.e., there is no pre-existing parallel abstraction do to so. Maybe it's possible to re-use some of the existing abstractions we have (like broadcast or sum), but I'm not familiar with parallelizing integral calculations, so I wouldn't know. You probably will have to write your own kernel, at which point you have to think about how to best apply the device's parallelism to your computational problem. Again, I'm not familiar with parallelizing the calculation of a (nested) integral, so I'd look around for generic GPU resources for doing so. These don't need to be Julia or oneAPI specific, as at the kernel programming level you're working at the same abstraction level of the vendor toolkits anyway.

I did find this on the Julia Discourse though, https://discourse.julialang.org/t/julia-integral-calculation-community-module-or-own-module/24278, which seems relevant.

However, it seems that the 2D kernel has not been implemented, and there isn't a oneStruct wrapper for this kind of task.

It is not clear what you are referring to here. What 2D kernel are you expecting to be implemented? What would oneStruct do? Please be more specific, assuming domain knowledge makes it hard to actually help.

I'll also convert this to a discussion, as this is not an bug. I would suggest to move this discussion to the Julia Discourse instead, where other people familiar with (GPU programming in) Julia, or domain experts, can chime in.

Finally, with questions like this it is always recommended to include a MWE. Nothing real, i.e. not using GaPSE.jl, but a simple demonstration of the concept and how you would currently perform the calculation on the CPU.

0 replies

foglienimatteo · 2024-03-20T14:20:46Z

foglienimatteo
Mar 20, 2024
Author

Spline1D

We had an interesting discussion about this with @kballeda. Actually, the most part of the integral computations fall back on spline evaluations; we are using Dierckx.Spline1D, which we show below are not supported. If an equivalent object would exist in oneAPI.jl (for instance the oneMKL splines for C/C++/Fortran https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2023-2/splines.html) would solve the problem at the origin. We see no other workaround:

using oneAPI, Dierckx

struct lims
    a::Float64
    b::Float64
    spline::Dierckx.Spline1D
end


function kernel_2(res, xs, lims)
    i = get_global_id()

    res[i] = lims[1].spline(xs[i])
    return
end

xs = oneArray(rand(5))
res = similar(xs)

as = 1:1:10
bs = as .^ 2
spline = Spline1D(as, bs; bc="error")
lms = lims(1.0, 2.0, spline)
lms_oneapi = oneArray([lms])

@oneapi items = 10 kernel_2(res, xs, lms_oneapi)

Output:

ERROR: LoadError: InvalidIRError: compiling MethodInstance for kernel_2(::oneDeviceVector{Float64, 1}, ::oneDeviceVector{Float64, 1}, ::oneDeviceVector{lims, 1}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to julia.new_gc_frame)
Stacktrace:
 [1] evaluate
   @ ~/.julia/packages/Dierckx/TDOyl/src/Dierckx.jl:296
 [2] Spline1D
   @ ~/.julia/packages/Dierckx/TDOyl/src/Dierckx.jl:1112
 [3] kernel_2
   @ ~/julia_experiments/t_spline_oneapi.jl:13
Reason: unsupported call to an unknown function (call to julia.push_gc_frame)
Stacktrace:
 [1] evaluate
   @ ~/.julia/packages/Dierckx/TDOyl/src/Dierckx.jl:296
 [2] Spline1D
   @ ~/.julia/packages/Dierckx/TDOyl/src/Dierckx.jl:1112
 [3] kernel_2
   @ ~/julia_experiments/t_spline_oneapi.jl:13
Reason: unsupported call to an unknown function (call to julia.pop_gc_frame)
Stacktrace:
 ...

incidentally, we are encapsulating our Struct lims with a oneArray, following what we described in the previous message with the Struct cosmo. Wrapping in oneArray artificially creates array for each kernel params, so we were wondering if there is a more native way (that is what we meant by our hypothetical oneStruct).

Kwargs with @oneapi

we would like to know what are the options for the kernel definition: can we pass keyword arguments? we tried, but seems that the @oneapi macro doesn't allow that:

using oneAPI

f(x; b=1.0) = b*x^2

function kernel_1(res, xs, lims; b=1.0)
    i = get_global_id()
    
    res[i] = f(xs[i]; b=b)
    return 
end

xs = oneArray(rand(5))
res = similar(xs)

@oneapi items=10 kernel_1(res, xs; b=2.0)

Output:

julia> include("test.jl")
ERROR: LoadError: syntax: invalid syntax ; b = 2
Stacktrace:
 [1] top-level scope
   @ ~/julia_experiments/test.jl:15
 [2] include(fname::String)
   @ Base.MainInclude ./client.jl:489
 [3] top-level scope
   @ REPL[1]:1
in expression starting at /dss/dsshome1/08/di75tom/julia_experiments/test.jl:15

2D kernels

Our reference for this is SYCL (https://github.com/oneapi-src/DPCPP_Reference/blob/727e42af27f9a7c7237462cbaa02d7753b4e02e6/reference/headers/item.h#L18), on top of which oneAPI.jl is based. In SYCL one can use a 1D, 2D or even 3D kernel, and get_global_id() changes dimentionality accordingly. Our kernel here would be naturally 2D, so we were wondering if it's possiblr to have that in opeAPI.jl. At the moment only a 1D kernel seems to exist. Does it exist an equivalent 2D version, or are you planning to developing it?

0 replies

maleadt · 2024-03-20T15:37:02Z

maleadt
Mar 20, 2024
Maintainer

we are using Dierckx.Spline1D, which we show below are not supported

Dierckx.jl is just a wrapper for the DIERCKX CPU library, so it's not expected to be GPU compatible. GPU compatibility of that package would be up to that package to provide, e.g., by using an extension package building on the oneMKL functionality you linked.

Wrapping in oneArray artificially creates array for each kernel params, so we were wondering if there is a more native way (that is what we meant by our hypothetical oneStruct).

There is no need to wrap structs in arrays; you can just pass arbitrary structs (as long as they're GPU compatible, i.e., do not contain CPU pointers).

can we pass keyword arguments?

Keyword arguments to kernels are currently not implemented. It would be possible, but we've seen almost no requests for it, so nobody has spent the time developing the capability.

At the moment only a 1D kernel seems to exist.

ND kernels are supported. Adapting the vadd example to 2D:

using oneAPI, Test

function vadd(a, b, c)
    i, j = get_global_id(0), get_global_id(1)
    @inbounds c[i,j] = a[i,j] + b[i,j]
    return
end

dims = (2,2)
a = round.(rand(Float32, dims) * 100)
b = round.(rand(Float32, dims) * 100)
c = similar(a)

d_a = oneArray(a)
d_b = oneArray(b)
d_c = oneArray(c)

@oneapi items=(2,2) vadd(d_a, d_b, d_c)
c = Array(d_c)
@test a+b ≈ c

0 replies

maleadt · 2024-03-20T15:56:50Z

maleadt
Mar 20, 2024
Maintainer

we are using Dierckx.Spline1D, which we show below are not supported

Dierckx.jl is just a wrapper for the DIERCKX CPU library, so it's not expected to be GPU compatible. GPU compatibility of that package would be up to that package to provide, e.g., by using an extension package building on the oneMKL functionality you linked.

Alternatively, you could look into pure-Julia packages, like Interpolations.jl, some of which have (limited, but more easily extended) GUP support: http://juliamath.github.io/Interpolations.jl/latest/devdocs/#GPU-Support

0 replies

svtcli · 2024-03-22T15:45:19Z

svtcli
Mar 22, 2024

Hi @maleadt, I'm working with @foglienimatteo on this code.
Thanks a lot for your replies, the insight was very useful and we'll soon come up with something to accelerate our splines.

In the meantime I'd have an additonal related question about the items size.
With reference with your vadd 2D example, if I try to execute it on a larger grid (just change dims and items to (40,40), for instance), I get some out-of-memory errors:

ERROR: LoadError: ZeError: group size dimension is not valid for the kernel or device (code 2013265939, ZE_RESULT_ERROR_INVALID_GROUP_SIZE_DIMENSION)
Stacktrace:
  [1] throw_api_error(res::oneAPI.oneL0._ze_result_t)
    @ oneAPI.oneL0 ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/libze.jl:8
  [2] check
    @ ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/libze.jl:19 [inlined]
  [3] zeKernelSetGroupSize
    @ ~/.julia/packages/oneAPI/2gxUb/lib/utils/call.jl:24 [inlined]
  [4] groupsize!
    @ ~/.julia/packages/oneAPI/2gxUb/lib/level-zero/module.jl:185 [inlined]
  [5] #onecall#64
    @ ~/.julia/packages/oneAPI/2gxUb/src/compiler/execution.jl:183 [inlined]
  [6] onecall
    @ ~/.julia/packages/oneAPI/2gxUb/src/compiler/execution.jl:177 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/oneAPI/2gxUb/src/compiler/execution.jl:134 [inlined]
  [8] macro expansion
    @ ./none:0 [inlined]
  [9] call(::oneAPI.HostKernel{typeof(vadd), Tuple{oneDeviceMatrix{Float32, 1}, oneDeviceMatrix{Float32, 1}, oneDeviceMatrix{Float32, 1}}}, ::oneDeviceMatrix{Float32, 1}, ::oneDeviceMatrix{Float32, 1}, ::oneDeviceMatrix{Float32, 1}; call_kwargs::@Kwargs{items::Tuple{Int64, Int64}})
    @ oneAPI ./none:0
 [10] (::oneAPI.HostKernel{typeof(vadd), Tuple{oneDeviceMatrix{Float32, 1}, oneDeviceMatrix{Float32, 1}, oneDeviceMatrix{Float32, 1}}})(::oneArray{Float32, 2, oneAPI.oneL0.DeviceBuffer}, ::Vararg{oneArray{Float32, 2, oneAPI.oneL0.DeviceBuffer}}; kwargs::@Kwargs{items::Tuple{Int64, Int64}})
    @ oneAPI ~/.julia/packages/oneAPI/2gxUb/src/compiler/execution.jl:190
 [11] top-level scope
    @ ~/.julia/packages/oneAPI/2gxUb/src/compiler/execution.jl:60
in expression starting at /home/di52vum/bits/julia/vadd2D.jl:18

The exact minimal item size for which this error occurs logically depends on the device,
but once again this limitation does not exist on SYCL, the runtime will block the data with suitable group sizes... assuming that here items correspond to sycl::range<>.
Also because lowering the values just makes it work on just the first item of the oneArray.

Once again this proably due to lack of documentation... unfortunately typing ? @oneapi we don't get any clue.

Sorry if the post was long... the underlying question is easy: is there a syntax to process larger array?

We wouldn't really want to do manual data blocking...

2 replies

pengtu Mar 26, 2024

@amontoison: Any advices on porting 2D integral to JuliaGPU?

amontoison Mar 26, 2024
Maintainer

I checked the documentation and it should not be too hard to generate a C interface for the data_fitting routines. We can reuse what was done for blas, lapack and sparse routines.

maleadt · 2024-03-27T12:15:42Z

maleadt
Mar 27, 2024
Maintainer

With reference with your vadd 2D example, if I try to execute it on a larger grid (just change dims and items to (40,40), for instance), I get some out-of-memory errors:

That's because you're using an invalid launch configuration, with groups that are too large for your device:

julia> oneL0.compute_properties(device())
(maxTotalGroupSize = 512, maxGroupSizeX = 512, maxGroupSizeY = 512, maxGroupSizeZ = 512, maxGroupCountX = 4294967295, maxGroupCountY = 4294967295, maxGroupCountZ = 4294967295, maxSharedLocalMemory = 65536, subGroupSizes = (8, 16, 32))

Either clamp those group sizes against the device limits, or use the groupsize suggestion API from Level Zero (wrapped as oneL0.suggest_groupsize in oneAPI.jl):

using oneAPI, .oneL0, Test

function vadd(a, b, c)
    i, j = get_global_id(0), get_global_id(1)
    if i > size(a, 1) || j > size(a, 2)
        return
    end
    @inbounds c[i,j] = a[i,j] + b[i,j]
    return
end

dims = (40,40)
a = round.(rand(Float32, dims) * 100)
b = round.(rand(Float32, dims) * 100)
c = similar(a)

d_a = oneArray(a)
d_b = oneArray(b)
d_c = oneArray(c)

kernel = @oneapi launch=false vadd(d_a, d_b, d_c)
groupsize = suggest_groupsize(kernel.fun, prod(dims))
items = min.((groupsize.x, groupsize.y), dims)
groups = cld.(dims, items)
kernel(d_a, d_b, d_c; items, groups)

c = Array(d_c)
@test a+b ≈ c

Once again this proably due to lack of documentation... unfortunately typing ? @oneapi we don't get any clue.

The documentation is indeed very sparse. However, oneAPI.jl's low-level kernel API (which you are using) doesn't aim to provide much abstraction over Level Zero, so you should just refer to the Level Zero documentation.

this limitation does not exist on SYCL, the runtime will block the data with suitable group sizes... assuming that here items correspond to sycl::range<>.

Similarly as above, that's because oneAPI.jl's kernel programming model sits at the Level Zero abstraction level. If you want something more higher level, you could consider e.g. using KernelAbstractions.jl, which does provide you with a more user friendly range abstraction. Added benefit is portability to our other GPU backends.

1 reply

svtcli Mar 27, 2024

Many thanks! We had a first look and decided to switch to KernelAbstractions.jl.
We just tested the oneAPI backend and it works very well!

Once again, unfortunately we had no idea this existed until you suggested it to us...
So from one hand, thanks again... from the other it's a bit odd that the users don't know what's available to them. Also, very nice that there's so many possibilities for offload.

We should now have all the pieces that we need for this project... :D
It's just a matter of writing our kernel. We'll keep you posted!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oneAPI with nested integrals #398

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

oneAPI with nested integrals #398

foglienimatteo Mar 16, 2024

Replies: 6 comments · 3 replies

maleadt Mar 16, 2024 Maintainer

foglienimatteo Mar 20, 2024 Author

Spline1D

Kwargs with @oneapi

2D kernels

maleadt Mar 20, 2024 Maintainer

maleadt Mar 20, 2024 Maintainer

svtcli Mar 22, 2024

pengtu Mar 26, 2024

amontoison Mar 26, 2024 Maintainer

maleadt Mar 27, 2024 Maintainer

svtcli Mar 27, 2024

foglienimatteo
Mar 16, 2024

Replies: 6 comments 3 replies

maleadt
Mar 16, 2024
Maintainer

foglienimatteo
Mar 20, 2024
Author

maleadt
Mar 20, 2024
Maintainer

maleadt
Mar 20, 2024
Maintainer

svtcli
Mar 22, 2024

amontoison Mar 26, 2024
Maintainer

maleadt
Mar 27, 2024
Maintainer