Used SIMD AVX512 instruction set for vectorizing code, openmp for using multiple threads, prefetching ,locality of reference ,thread pinning, loop unrolling , fused multiply-add , blocking , mmap in first 3 assignments.
Last assignment was done using cuda
Run : ./runner_script.sh
Matrix vector : Optimizations in report
2D image convolution: Optimized till 1.7 sec. Roofline analysis(Intel advisor) gave 1.5 sec
Matrix Matrix Multiplication :Optimized from 10 second(Naive:A * B transpose ) to 52 millisecond's (compared 2 best code in main.cpp 170ms vs 52ms)
Cuda convolution : 6777 milli sec on 4096x4096 matrix, 3x3 kernel
-
Notifications
You must be signed in to change notification settings - Fork 0
codepk37/SPP
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Software Programming for Performance is meant for optimizing code to at most potential.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published