- an easy, portable and scalable way to parallelize applications for many cores. – Multi-threaded, shared memory model (like pthreads)
- a standard API
- omp pragmas are supported by major C/C++ , Fortran compilers (gcc, icc, etc).
A lot of good tutorials on-line:
naive implementation
int main(int argc, char *argv[])
{
int idx;
float a[N], b[N], c[N];
for(idx=0; idx<N; ++idx)
{
a[idx] = b[idx] = 1.0;
}
for(idx=0; idx<N; ++idx)
{
c[idx] = a[idx] + b[idx];
}
}
omp implementation
#include <omp.h>
int main(int argc, char *argv[])
{
int idx;
float a[N], b[N], c[N];
#pragma omp parallel for
for(idx=0; idx<N; ++idx)
{
a[idx] = b[idx] = 1.0;
}
#pragma omp parallel for
for(idx=0; idx<N; ++idx)
{
c[idx] = a[idx] + b[idx];
}
}
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N (100)
int main(int argc, char *argv[])
{
int nthreads, tid, idx;
float a[N], b[N], c[N];
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
#pragma omp parallel for
for(idx=0; idx<N; ++idx)
{
a[idx] = b[idx] = 1.0;
}
#pragma omp parallel for
for(idx=0; idx<N; ++idx)
{
c[idx] = a[idx] + b[idx];
tid = omp_get_thread_num();
printf("Thread %d: c[%d]=%f\n", tid, idx, c[idx]);
}
}
You need to add flag –fopenmp
# compile using gcc
gcc -fopenmp omp_vecadd.c -o vecadd
# compile using icc
icc -openmp omp_vecadd.c -o vecad
Control number of threads through set enviroment variable on command line:
export OMP_NUM_THREADS=8
- Implement
- vector dot-product: c=<x,y>
- matrix-matrix multiply
- 2D matrix convolution
- Add openmp support to relu, and max-pooling layers
synch and critical sections,
- use critical section to reduce false sharing
- BUT don't put critical sections inside tight loops - doing so serializes things
improve_performance_for_deep_learning_frameworks_on_cpu
Intel’s Tim Mattson’s Introduction to OpenMP video tutorial is now available.
Outline:
- Module 1: Introduction to parallel programming
- Module 2: The boring bits: Using an OpenMP compiler (hello world)
- Discussion 1: Hello world and how threads work
- Module 3: Creating Threads (the Pi program)
- Discussion 2: The simple Pi program and why it sucks
- Module 4: Synchronization (Pi program revisited)
- Discussion 3: Synchronization overhead and eliminating false sharing
- Module 5: Parallel Loops (making the Pi program simple)
- Discussion 4: Pi program wrap-up
- Module 6: Synchronize single masters and stuff
- Module 7: Data environment
- Discussion 5: Debugging OpenMP programs
- Module 8: Skills practice … linked lists and OpenMP
- Discussion 6: Different ways to traverse linked lists
- Module 9: Tasks (linked lists the easy way)
- Discussion 7: Understanding Tasks
- Module 10: The scary stuff … Memory model, atomics, and flush (pairwise synch).
- Discussion 8: The pitfalls of pairwise synchronization
- Module 11: Threadprivate Data and how to support libraries (Pi again)
- Discussion 9: Random number generators
Thanks go to the University Program Office at Intel for making this tutorial available.
Author: Blaise Barney, Lawrence Livermore National Laboratory
https://goulassoup.wordpress.com/2011/10/28/openmp-tutorial/