A machine model is fundamental to reason about algorithms without considering the hardware. For algorithms applicable to parallel computers we use PRAM models.
PRAM are classified based on their read/write abilities:
- exclusive read: all processors can simultaneously read from distinct memory locations
- exclusive write: all processors can simultaneously write to distinct memory locations
- concurrent read: all processors can simultaneously read from any memory location
- concurrent write: all processors can write to any memory location. Different criteria can apply in this case:
- priority CW: processors have different priorities
- common CW: processors complete writes iff the written values are equals
- random CW
Some definitions:
-
$T^*(n)$ is time to solve problem using best sequential algorithm -
$T_p(n)$ is time to solve problem using$p$ processors -
$S_p(n)=\frac{T^*(n)}{T_p(n)}$ is the speedup on$p$ processors -
$E_p(n)=\frac{T_1(n)}{p T_p(n)}$ is the efficiency -
$T_{\infty}(n)$ is the shortest run time on any$p$ -
$\mathrm{C}(\mathrm{n})=\mathrm{P}(\mathrm{n}) \cdot \mathrm{T}(\mathrm{n})$ is the cost that depends on processors and time -
$\mathrm{W}(\mathrm{n})$ is the work, which is the total number of operations
There could be also some variants of PRAM, like for example:
- bounded number of shared memory cells
- bounded number of processor
- bounded size of a machine word
- handling conflicts over shared memory cells
Any problem that can be solved by a
Parallelization of a sum of vector elements with the naïve algorithm can be performed with
Embarrassingly parallel because there isn't cross-dependence (just the concurrent read over the vector
The matrix
This algorithm is characterized by concurrent read but only exclusive write so runs on CREW PRAM. Let
The previous PRAM algorithms make the same amount of work of the work done by a single processor, simply faster using parallelization. The prefix sum problem is basically the same of a sum of the vector elements but exploiting idle processors.
The idea is to make more work in same time taking advantage of idle processors in sum. Basically we used all the processors all the time. Efficiency is
Example of CRCW where each processor is assigned to a product.
Gene Amdahl objected to parallelism saying that computation consists of interleaved segments of 2 types:
- a serial segments that cannot be parallelized
- parallelizable segments
The law is 'pessimist' since if the parallelized part is a fixed fraction
"We feel that it is important for the computing research community to overcome the "mental block" against massive parallelism imposed by a misuse of amdahl's speedup formula."
The key points of Gustafson are that portion
ISPC is a compiler for a variant of the C language that focuses on accelerating applications according to the SPMD paradigm. It parallelizes at the instruction level by distributing instructions over vectorized architectures (SSE and AVX units) for x86, ARM and GPUs.
The documentation for ISPC can be found here: https://ispc.github.io/ispc.html .
When a C/C++ function calls an ISPC function, the execution model instantly switches from a serial model to a parallel model, where a set of program instances called gang run in parallel. The parallelization is transparent to the OS and is managed entirely inside the program. Unless otherwise specified, variables are local to each program instance inside a gang. Doing so is memory-inefficient, and whenever possible variables should have the attribute uniform
to signal that they are shared among all instances of the gang. This also opens the door to issued arising from concurrent accesses to the same uniform
variable.
Each program instance in a gang has knowledge about the gang's size and its own index within the gang. The gang's size is stored in the programCount
variable, while the instance's index in the gang is stored in the programIndex
variable. They can be used to distribute the computation over the gang members by manually assigning the data they should work on.