Used SIMD AVX512 instruction set for vectorizing code, openmp for using multiple threads, prefetching ,locality of reference ,thread pinning, loop unrolling , fused multiply-add , blocking , mmap in first 3 assignments.
Last assignment was done using cuda
Run : ./runner_script.sh
Matrix vector : Optimizations in report
2D image convolution: Optimized till 1.7 sec. Roofline analysis(Intel advisor) gave 1.5 sec
Matrix Matrix Multiplication :Optimized from 10 second(Naive:A * B transpose ) to 52 millisecond's (compared 2 best code in main.cpp 170ms vs 52ms)
Cuda convolution : 6777 milli sec on 4096x4096 matrix, 3x3 kernel