CUFFT: support for arbitrary dims #119

whymusticode · 2019-09-20T02:39:16Z

A 1d fft across the nth dimension of > n dimensional CuArray (where n is not 1) is not enabled by the wrapper (ERROR: ArgumentError: batching dims must be sequential from CuArrays>wXQp8>src>fft>wrappers.jl)

to reproduce:
using CuArrays, CUDAnative, CUDAdrv, FFTW
dim = 2
data = CuArrays.rand(ComplexF32,512,512,512);
myFft = plan_fft!(data,dim);

the wrapper for regular cpu arrays allows this, and if dim is 1 or 3 it also works as expected

btmit · 2020-05-19T01:04:03Z

I'm running into this issue as well.

using CuArrays, FFTW
a = cu(rand(ComplexF32, 16, 8, 4, 32))
p = plan_fft!(a, 2)

The error seems to apply to plan_fft as well.

More generally it appears that the dims argument has to (1) include the first or last dimension and (2) be sequential. For example with the previous 4D array, the following fail:

p = plan_fft!(a, 2)
p = plan_fft!(a, 3)
p = plan_fft!(a, [2,3])
p = plan_fft!(a, [1,3])

and the following are successful

p = plan_fft!(a, 1)
p = plan_fft!(a, [1,2])
p = plan_fft!(a, 4)
p = plan_fft!(a, [2,3,4])

This is a problem because permutedims!(src, dims) doesn't exist, so changing the order of dimensions requires an expensive copy operation.

ali-ramadhan · 2020-10-25T14:46:29Z

Is this a limitation of the CUFFT library?

From looking at the CUFFT documentation (https://docs.nvidia.com/cuda/cufft/index.html) it wasn't super clear to me whether e.g. an FFT along dimension 2 for a 3D CuArray is possible, although it would probably involve a call to cufftPlanMany() which says that

With this function, batched plans of 1, 2, or 3 dimensions may be created.

~~but maybe it's possible with clever use of the advanced data layout parameters?~~ @kburns pointed out that this is the equivalent of the FFTW "advanced" interface, not the full "guru" interface which supports arbitrary loops of strided FFTs. So indeed it seems to be a limitation of the CUFFT library (and of the MKL FFT library).

If FFTs along non-batched dimensions are not supported by CUFFT then shouldn't this issue be closed as "can't fix"?

I'm happy to help work on this issue as it would help resolve a number of important Oceananigans.jl issues (CliMA/Oceananigans.jl#586, CliMA/Oceananigans.jl#1007) and speed up channel models (an important class of simulations) by a factor of ~2x due to the 2D DCT I had to implement in CliMA/Oceananigans.jl#290.

If CUFFT cannot be used along non-batched dimensions, it seems like it should be possible to implement with permutedims!(dest::CuArray, src::CuArray, dims)? Although I'm not sure whether this functionality should enter CUDA.jl.

I think @btmit was thinking of a permutedims!(src::CuArray, dims) function but I don't think such a function is possible if you want efficiency for all possible permutations (due to overlapping memory between input and output, similar to Base.permutedims!)?

For reference, permutedims! on a 256^3 CuArray{Complex{Float64}} takes ~1.8 ms on a V100. So it's relatively fast operation but could introduce a performance hit, especially if you use a lot of FFTs, and could require allocating at least one extra array.

julia> @show typeof(src), typeof(dest);
(typeof(src), typeof(dest)) = (CuArray{Complex{Float64},3}, CuArray{Complex{Float64},3})

julia> @show size(src), size(dest);
(size(src), size(dest)) = ((256, 256, 256), (256, 256, 256))

julia> @benchmark CUDA.@sync permutedims!(dest, src, (2, 1, 3))
BenchmarkTools.Trial: 
  memory estimate:  800 bytes
  allocs estimate:  34
  --------------
  minimum time:     1.828 ms (0.00% GC)
  median time:      1.971 ms (0.00% GC)
  mean time:        2.103 ms (0.00% GC)
  maximum time:     37.842 ms (0.00% GC)
  --------------
  samples:          2371
  evals/sample:     1

ali-ramadhan · 2020-11-29T16:26:22Z

Since this is a limitation of the CUFFT library, it seems that transposing the array then performing an FFT along a non-batched dimension is the way to go.

E.g. to compute an FFT along dimension 2 of an array A with 3 dimensions, you would permutedims!(B, A, (2, 1, 3)) then compute an FFT along dimension 1 of B, then permutedims!(A, B, (2, 1, 3)).

Should this issue be closed since it's can't really be fixed? It's not really a bug either.

maleadt · 2020-11-30T08:27:17Z

Should this issue be closed since it's can't really be fixed? It's not really a bug either.

I guess; I'm not familiar with CUFFT or the fft API.

Seeing permutedims mentioned here, I had a quick look at optimizing its implementation. It's far from optimal, but some quick improvement is up at JuliaGPU/GPUArrays.jl#338.

btmit · 2020-11-30T15:25:08Z

I believe you can do this with the various options in cufftPlanMany() in CUFFT.

The problem I found with permutedims is that it can't be done inplace, so this approach isn't an option for large arrays.

ali-ramadhan · 2020-11-30T15:30:31Z

I believe you can do this with the various options in cufftPlanMany() in CUFFT.

Ah I'm definitely not an expert in FFTs but can you see how to do an FFT along dimension 2 for an array with 3 dimensions with cufftPlanMany()?

btmit · 2020-11-30T15:46:11Z

I will try and figure it out as soon as I have a minute. I think I might be able to track down some old code.

As another data point, MATLAB CUDA can do this. Though there is a chance that they are doing a permutedim copy under the hood as well.

btmit · 2020-12-18T15:20:37Z

I apologize for disappearing on this. It's a busy time of the year.

From: https://docs.nvidia.com/cuda/cufft/index.html#advanced-data-layout

istride: Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension

idist: Indicates the distance between the first element of two consecutive signals in a batch of the input data

Let's take the example of an fft across the 2nd dimension only of an array of dimension 3. In this case istride is size[1], and idist is size[1]*size[2]

Generalizing this, it seems like we could support strided ffts so long as the dimensions are sequential (no skips).

kburns · 2020-12-18T16:18:36Z

My understanding (if this behaves similarly to FFTW) is that that would only do FFTs along the 2nd dimension in the plane corresponding to index 1 in the 1st dimension (the istride here is skipping over the other elements along the 1st dimension, and idist is essentially looping over indices in the 3rd dimension). To apply an FFT along the 2nd dimension of the full array, you would have to do an outer loop applying this plan along each index in the 1st dimension (and if the plan can be reused, like it can on the CPU, this may be faster than doing a permutation).

On the FFTW side, this is why the guru interface exists -- to do multidimensional loops of transforms. But from https://docs.nvidia.com/cuda/cufft/index.html#fftw-supported-interface, it seems that this is not supported in cuFFT (the entry for "Guru Complex DFTs" lists multidimensional transform batches as "Unsupported").

RainerHeintzmann · 2023-05-03T10:23:42Z

Here is some code, which can be used to transform along a single (internal) dimension:

function fft!(arr::CuArray, d::Int)
    if d>1 && d < ndims(arr)
        front_ids = ntuple((dd)->Colon(), d)
        ids = ntuple((dd)->ifelse(dd <= d,Colon(),1), ndims(arr))
        p = plan_fft!((@view arr[ids...]), d)
        for c in CartesianIndices(size(arr)[d+1:end])
            p * @view arr[front_ids..., Tuple(c)...]
        end
    else
        CUDA.CUFFT.fft!(arr, d)
    end
end

It is not ideal, but at least it does not do any extra GPU allocations. It is on my computer about 2x slower than the FFT along the outer dimensions. Ideally the loop over the cartesian indices should be replace by a call to cufftPlanMany() (or the PencilFFTs mentioned above?). This code can also be extended to work with multiple inner dimensions as atuple, by simply transforming these dimensions sequantially.

RainerHeintzmann · 2023-05-09T12:21:50Z

... turns out it is even slightly worse. FFTing data with larger than 3 dims crashes:

julia> fca = fft(CuArray(rand(4,4,4,4)));
ERROR: CUFFTError: user specified an invalid pointer or parameter (code 4, CUFFT_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUFFT.cufftResult_t)
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:11
  [2] macro expansion
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:24 [inlined]
  [3] cufftMakePlanMany(plan::Int32, rank::Int64, n::Vector{Int32}, inembed::Ptr{Nothing}, istride::Int64, idist::Int64, onembed::Ptr{Nothing}, ostride::Int64, odist::Int64, type::CUDA.CUFFT.cufftType_t, batch::Int64, workSize::Base.RefValue{UInt64})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\call.jl:26
  [4] cufftMakePlan(xtype::CUDA.CUFFT.cufftType_t, xdims::NTuple{4, Int64}, region::UnitRange{Int64})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:37
  [5] #133
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:145 [inlined]
  [6] (::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})()
    @ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:28
  [7] lock(f::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, l::ReentrantLock)
    @ Base .\lock.jl:185
  [8] (::CUDA.APIUtils.var"#check_cache#9"{CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})(f::CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})
    @ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:26
  [9] pop!(f::Function, cache::CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, key::Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}})
    @ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:47
 [10] cufftGetPlan(::CUDA.CUFFT.cufftType_t, ::Vararg{Any})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:143
 [11] plan_fft(X::CuArray{ComplexF64, 4, CUDA.Mem.DeviceBuffer}, region::UnitRange{Int64})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:163
 [12] fft
    @ C:\Users\pi96doc\.julia\packages\AbstractFFTs\0uOAT\src\definitions.jl:63 [inlined]
 [13] fft (repeats 2 times)
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:124 [inlined]
 [14] top-level scope
    @ REPL[12]:1
 [15] top-level scope
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\src\initialization.jl:171

Might try to give it a shot to fix these things...

RainerHeintzmann · 2023-05-12T11:53:13Z

this should be mostly solved now by this pull request:
#1903

vpuri3 · 2023-07-10T20:28:54Z

@ali-ramadhan @whymusticode what is the consensus here? Is permutedims + fft the way to go?

BTW, I would be grateful if somebody could provide a CUDA-compatible DCT implementation. :D

ancorso · 2024-03-26T02:37:53Z

this should be mostly solved now by this pull request: #1903

I still receive the same error when attempting a 4D fft (even on the latest CUDA.jl release). Did your comment mean that the 4D fft should be working or just that some machinery is now in place to make this possible? I would be happy to help contribute to getting this functioning if someone can provide some initial guidance on what is needed. Thanks!

RainerHeintzmann · 2024-03-26T13:23:01Z

Thanks for bringing this up. I will try to look into this.

RainerHeintzmann · 2024-04-01T16:06:45Z

Yes, currenty only a total of up to 3 transform directions are supported. E.g.
fca = fft(CuArray(rand(4,4,4,4,4)), (1,3,4));
transforming directions 1,3,4 out of the 5D array does work. Do you have a use-case of a 4D FFT being needed? I guess one could try to support it by sequentially calling transforms for more trailing directions (e.g. in-place for the result), but it would presumably mean more programming work. @maleadt, any opinion on this? Should the work be invested?
Apparently there is also a new FFT library https://github.com/PaulVirally/VkFFTCUDA.jl, which possibly may want to be supported?

ancorso · 2024-04-01T17:16:27Z

In support of investing the work (which I'm happy to help with if pointed in the right direction), I am using 4D FFTs in conjuction with the Fourier Neural Operator (https://github.com/SciML/NeuralOperators.jl) to model 4D PDE solutions (3 spatial dimensions, and time). This is a feature that is supported in PyTorch (https://pytorch.org/docs/stable/generated/torch.fft.fftn.html#torch.fft.fftn) so I think supporting an n-dimensional FFT would help improve feature parity between julia and python ML toolkits.

maleadt · 2024-04-05T11:31:57Z

@maleadt, any opinion on this?

Not really, sorry. I'm myself not a user of the CUDA FFT functionality.

maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020

maleadt added bug Something isn't working cuda libraries Stuff about CUDA library wrappers. labels May 27, 2020

btmit mentioned this issue Sep 17, 2020

FFT of strided array #430

Open

ali-ramadhan mentioned this issue Nov 9, 2020

contributing to Oceananigans CliMA/Oceananigans.jl#1153

Closed

ali-ramadhan mentioned this issue Nov 23, 2020

Which topologies are actually supported? CliMA/Oceananigans.jl#1192

Closed

This was referenced Feb 3, 2021

Faster, cleaner pressure solvers for all topologies on CPUs and GPUs CliMA/Oceananigans.jl#1338

Merged

PencilFFTs on GPUs? jipolanco/PencilFFTs.jl#3

Closed

maleadt changed the title ~~CUDA fft wrapper problem~~ CUFFT: support for arbitrary dims May 3, 2023

maleadt added the help wanted Extra attention is needed label May 3, 2023

maleadt mentioned this issue May 17, 2023

Added support for more transform directions #1903

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUFFT: support for arbitrary dims #119

CUFFT: support for arbitrary dims #119

whymusticode commented Sep 20, 2019

btmit commented May 19, 2020

ali-ramadhan commented Oct 25, 2020 •

edited

Loading

ali-ramadhan commented Nov 29, 2020

maleadt commented Nov 30, 2020

btmit commented Nov 30, 2020

ali-ramadhan commented Nov 30, 2020

btmit commented Nov 30, 2020

btmit commented Dec 18, 2020

kburns commented Dec 18, 2020

RainerHeintzmann commented May 3, 2023

RainerHeintzmann commented May 9, 2023

RainerHeintzmann commented May 12, 2023

vpuri3 commented Jul 10, 2023

ancorso commented Mar 26, 2024

RainerHeintzmann commented Mar 26, 2024

RainerHeintzmann commented Apr 1, 2024

ancorso commented Apr 1, 2024

maleadt commented Apr 5, 2024

CUFFT: support for arbitrary dims #119

CUFFT: support for arbitrary dims #119

Comments

whymusticode commented Sep 20, 2019

btmit commented May 19, 2020

ali-ramadhan commented Oct 25, 2020 • edited Loading

ali-ramadhan commented Nov 29, 2020

maleadt commented Nov 30, 2020

btmit commented Nov 30, 2020

ali-ramadhan commented Nov 30, 2020

btmit commented Nov 30, 2020

btmit commented Dec 18, 2020

kburns commented Dec 18, 2020

RainerHeintzmann commented May 3, 2023

RainerHeintzmann commented May 9, 2023

RainerHeintzmann commented May 12, 2023

vpuri3 commented Jul 10, 2023

ancorso commented Mar 26, 2024

RainerHeintzmann commented Mar 26, 2024

RainerHeintzmann commented Apr 1, 2024

ancorso commented Apr 1, 2024

maleadt commented Apr 5, 2024

ali-ramadhan commented Oct 25, 2020 •

edited

Loading