Skip to content

Commit

Permalink
Update fft.tex
Browse files Browse the repository at this point in the history
try again (pesky \gls{fft}
  • Loading branch information
rck289 authored Nov 12, 2024
1 parent 3abf8e8 commit 0d7d0ae
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions fft.tex
Original file line number Diff line number Diff line change
Expand Up @@ -314,15 +314,15 @@ \section{Background}
In a 32 point \gls{fft}, index 1 is swapped with which index? Which index is index 2 is swapped with?
\end{exercise}

This completes our mathematical treatment of the \gls{fft}. There are plenty of more details about the \gls{fft} and how to optimize it. We may have spent too much time already discussing the finer details of the \gls}; this is a book on parallel programming for FPGAs, not digital signal processing. This highlights an integral part of creating an optimum hardware implementation -- the designer must understand the algorithm under development. Without that, it isn't easy to create an optimized implementation. The following section deals with creating a good \gls{fft} implementation.
This completes our mathematical treatment of the \gls{fft}. There are plenty of more details about the \gls{fft} and how to optimize it. We may have spent too much time already discussing the finer details of the \gls{fft}; this is a book on parallel programming for FPGAs, not digital signal processing. This highlights an integral part of creating an optimum hardware implementation -- the designer must understand the algorithm under development. Without that, it isn't easy to create an optimized implementation. The following section deals with creating a good \gls{fft} implementation.

\section{Baseline Implementation}

In the remainder of this chapter, we discuss different methods to implement the Cooley-Tukey \gls{fft} \cite{cooley65} algorithm using the \VHLS tool. This is the same algorithm that we described in the previous section. We start with a standard version of the code and then describe how to restructure it to achieve a better hardware design.

When performed sequentially, the $\mathcal{O}(n \log n)$ operations in the \gls{fft} require $\mathcal{O}(n \log n)$ time steps. A parallel implementation will perform some portion of the \gls{fft} in parallel. One common way of parallelizing the \gls{fft} is to organize the computation into $\log n$ stages, as shown in Figure \ref{fig:fftstages}. The operations in each stage depend on the previous stage's operations, naturally leading to a pipelining across the tasks. Such an architecture allows $\log n$ \glspl{fft} to be computed simultaneously with a task interval determined by the architecture of each stage. We discuss task pipelining using the \lstinline|dataflow| directive in Section \ref{sec:fft_task_pipelining}.

Each stage in the \gls{fft} also contains significant parallelism since each butterfly computation is independent of other butterfly computations in the same stage. In the limit, performing $n/2$ butterfly computations every clock cycle with a Task Interval of 1 can allow the entire stage to be computed with a Task Interval of 1. When combined with a dataflow architecture, the parallelism in the \gls{fft} algorithm can be exploited. Note, however, that although such an architecture can be constructed, it is seldom used except for very small signals, since an entire new block of \lstinline|SIZE| samples must be provided every clock cycle to keep the pipeline fully utilized. For instance, a 1024-point \gls{fft} of complex 32-bit floating point values, running at 250 MHz would require 1024 \text{points}*(8 {bytes}/{point})*250*$10^9$ Hz = 1Terabyte/second of data into the FPGA. In practice, a designer must match the computation architecture to the system's required data rate.
Each stage in the \gls{fft} also contains significant parallelism since each butterfly computation is independent of other butterfly computations in the same stage. In the limit, performing $n/2$ butterfly computations every clock cycle with a Task Interval of 1 can allow the entire stage to be computed with a Task Interval of 1. The parallelism in the \gls{fft} algorithm can be exploited when combined with a dataflow architecture. Note, however, that although such an architecture can be constructed, it is seldom used except for very small signals since an entire new block of \lstinline|SIZE| samples must be provided every clock cycle to keep the pipeline fully utilized. For instance, a 1024-point \gls{fft} of complex 32-bit floating point values, running at 250 MHz would require 1024 \text{points}*(8 {bytes}/{point})*250*$10^9$ Hz = 1Terabyte/second of data into the FPGA. A designer must match the computation architecture to the system's required data rate.

\begin{exercise}
Assuming a clock rate of 250 MHz and one sample received every clock cycle, approximately how many butterfly computations must be implemented to process every sample with a 1024-point \gls{fft}? What about for a 16384-point \gls{fft}?
Expand All @@ -331,10 +331,10 @@ \section{Baseline Implementation}
In the remainder of this section, we describe the optimization of an \gls{fft} with the function prototype \lstinline|void fft(DTYPE X_R[SIZE], DTYPE X_I[SIZE])| where \lstinline|DTYPE| is a user-customizable data type for the representation of the input data. This may be \lstinline|int|, \lstinline|float|, or a fixed point type. For example, \lstinline|#define DTYPE int| defines \lstinline|DTYPE| as an \lstinline|int|. Note that we choose to implement the real and imaginary parts of the complex numbers in two separate arrays. The \lstinline|X_R| array holds the real input values, and the \lstinline|X_I| array holds the imaginary values. \lstinline|X_R[i]| and \lstinline|X_I[i]| hold the $i$th complex number in separate real and imaginary parts.

\begin{aside}
We describe one change in the \gls{fft} implementation in this section. Here, we perform an \gls{fft} on complex numbers. The previous section uses only real numbers. While this may seem like a major change, the core ideas remain unchanged. The only differences are that the data has two values (corresponding to the real and imaginary part of the complex number), and the operations (add, multiply, etc.) are complex operations.
In this section, we describe one change in the \gls{fft} implementation. Here, we perform an \gls{fft} on complex numbers. The previous section uses only real numbers. While this may seem like a major change, the core ideas remain unchanged. The only differences are that the data has two values (corresponding to the real and imaginary part of the complex number), and the operations (add, multiply, etc.) are complex operations.
\end{aside}

This function prototype forces an in-place implementation. The output data is stored in the same array as the input data. This eliminates the need for additional arrays for the output data, reducing the amount of memory required for the implementation. However, this may limit the performance since we must read the input data and write the output data to the same arrays. Using separate arrays for the output data is reasonable if it can increase the performance. There is always a tradeoff between resource usage and performance; the same is true here. The best implementation depends upon the application requirements (e.g., high throughput, low power, size of FPGA, size of the \gls{fft}, etc.).
This function prototype forces an in-place implementation. The output data is stored in the same array as the input data. This eliminates the need for additional arrays for the output data, reducing the memory required for the implementation. However, this may limit the performance since we must read the input data and write the output data to the same arrays. Using separate arrays for the output data is reasonable if it can increase the performance. There is always a tradeoff between resource usage and performance; the same is true here. The best implementation depends upon the application requirements (e.g., high throughput, low power, size of FPGA, size of the \gls{fft}, etc.).

%\section{Initial ``Software'' \gls{fft} Implementation}

Expand All @@ -356,7 +356,7 @@ \section{Baseline Implementation}

The remaining operations in \lstinline|dft_loop| perform multiplication by the twiddle factor and an addition or subtraction operation. The variables \lstinline|temp_R| and \lstinline|temp_I| hold the real and imaginary portions of the data after multiplication by the twiddle factor $W$. The variables \lstinline|c| and \lstinline|s| are $ W$'s real and imaginary parts, calculated using the \lstinline|sin()| and \lstinline|cos()| built-in functions. We could also use the CORDIC, such as the one developed in Chapter \ref{chapter:cordic}, to have more control over the implementation. Twiddle factors are also commonly precomputed and stored in on-chip memory for moderate array sizes. Lastly, elements of the \lstinline|X_R[]| and \lstinline|X_I[]| arrays are updated with the result of the butterfly computation.

\lstinline|dft_loop| and \lstinline|butterfly_loop| each executes a different number of times depending upon the stage. However, the total number of times the body of \lstinline|dft_loop| is executed in one stage is constant. The number of iterations for the \lstinline|butterfly for| loop depends upon the number of unique $W$ twiddle factors in that stage. Referring again to Figure \ref{fig:8ptFFT}, we can see that Stage 1 uses only one twiddle factor, in this case $W_8^0$. Stage 2 uses two unique twiddle factors, and Stage 3 uses four different $W$ values. Thus, \lstinline|butterfly_loop| has only one iteration in Stage 1, 2 iterations in Stage 2, and four iterations in Stage 3. Similarly, the number of iterations of \lstinline|dft_loop| changes. It iterates four times for an 8-point \gls{fft} in Stage 1, two times in Stage 2, and only once in Stage 3. However, in every stage, the body of \lstinline|dft_loop| is executed the same number of times in total, executing four butterfly operations for each stage an 8-point \gls{fft}.
\lstinline|dft_loop| and \lstinline|butterfly_loop| each executes a different number of times depending upon the stage. However, the total number of times the body of \lstinline|dft_loop| is executed in one stage is constant. The number of iterations for the \lstinline|butterfly for| loop depends upon the number of unique $W$ twiddle factors in that stage. Referring again to Figure \ref{fig:8ptFFT}, we can see that Stage 1 uses only one twiddle factor, in this case $W_8^0$. Stage 2 uses two unique twiddle factors, and Stage 3 uses four different $W$ values. Thus, \lstinline|butterfly_loop| has only one iteration in Stage 1, 2 iterations in Stage 2, and four iterations in Stage 3. Similarly, the number of iterations of \lstinline|dft_loop| changes. It iterates four times for an 8-point \gls{fft} in Stage 1, two times in Stage 2, and only once in Stage 3. However, in every stage, the body of \lstinline|dft_loop| is executed the same number of times, performing four butterfly operations for each stage of an 8-point \gls{fft}.

\begin{aside}
\VHLS performs significant static analysis on each synthesized function, including computing bounds on the number of times each loop can execute. This information comes from many sources, including variable bitwidths, ranges, and \lstinline|assert()| functions in the code. When combined with the loop II, \VHLS can compute bounds on the latency or interval of the \gls{fft} function. In some cases (usually when loop bounds are variable or contain conditional constructs), the tool cannot compute the latency or interval of the code and returns `'?'. When synthesizing the code in Figure \ref{fig:fft_sw}, \VHLS may not be able to determine the number of times that \lstinline|butterfly_loop| and \lstinline|dft_loop| iterate because these loops have variable bounds.
Expand Down

0 comments on commit 0d7d0ae

Please sign in to comment.