-
Notifications
You must be signed in to change notification settings - Fork 85
Small 1D FFT and large 1D FFT algorithm
We group problems as small or large, based on if the input problem size can fit the shared memory (LDS in OpenCL term) (currently 64K on AMD Fiji or Vega GPU) and achieve reasonable occupancy on GPU. This switch is 4096 for single precision, 2048 for double precision. So, here we define a problem size less equal 4096 as small, bigger than 4096 as large.
A small 1D problem can be solved by one HIP kernel. We use a 16-point FFT (vector size 16) as the example.
On GPU, a 1D 16-point FFT is done by 4 threads (work items in OpenCL term). Usually, a thread block has 64 threads (64 threads is called a weave front), which means a thread block is able to perform up to 16 (=64 /4) 1D 16-point FFT transforms in parallel.
If there are 3 16-point vectors (the batch number is 3), they can be done in one thread block. But if there is 32 16-point vectors, it needs 2 thread blocks.
Practically, the number of transforms allowed in one thread block is determined by shared memory size and occupancy ,and usually is obtained by tuning.
For size of and beyond 8192, a problem can break into smaller size kernels, where each sub-problem size is solved by a small 1D algorithm. For example, 8192 can be break into 64*128 problems. The large 1D vector problem is thus viewed as a 2D matrix problem.
If we view this 2D matrix as row-major format. (Viewing as the column major is similar but row-major is a common practice). The first row 128 elements are consecutively stored in memory. The column elements have a stride 128. The large 1D problem is algorithmically solved by, step 1) perform 128 64-point FFTs, this FFT is performed per column. Data is read and write per column (called C-c). step 2) transpose into a 128 * 64 matrix (T); step 3), perform 64 128-point FFTs. Again, this 128-point FFT is performed per column read and write(C-c). Here for clarity, we make C (read) capital case, while c (write) for lower case.
However, C-c is not good on GPU, as the column data is not read in a coalescing manner due to row-major. A row FFT (R-r) is actually preferred. Because, a column FFT (C) followed and after by a Transpose (T) is a row FFT (R), that is, C = T-R-T or c = T-r-T. or R = T-C-T, r = T-c-T.
If we use R to change C everywhere, the 3 steps C-c-T-C-c becomes: T-R-T-T-r-T -T- T-R-T-T-r-T = T-R-r-T -T- T-R-r-T = T-R-r-T-R-r-T, since the T-T cancelled each other.
One observation is, T-R or (r-T) does not necessarily split into two operations on GPU. It can be only one operation, as a column major matrix can be read into row major by threads, so
T-R-r-T-R-r-T = (T-R)-r-(T-R)-(r-T), which means, transpose-read in row, row FFT, write in row, transpose-read in row, row FFT, and transpose-write in row, where (x-x) is one operation.