Skip to content
/ cuda Public

An implementation of parallel exclusive scan in CUDA

Notifications You must be signed in to change notification settings

mattdean1/cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Prefix Sum (Scan) with CUDA

My implementation of parallel exclusive scan in CUDA, following this NVIDIA paper.

Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads.

This implementation can handle very large arbitrary length vectors thanks to the recursively defined scan function.

Performance is increased with a memory-bank conflict avoidance optimization (BCAO).


See the timings for a performance comparison between:

  1. Sequential scan run on the CPU
  2. Parallel scan run on the GPU
  3. Parallel scan with BCAO

For a vector of 10 million entries:

  CPU      : 20749 ms
  GPU      : 7.860768 ms
  GPU BCAO : 4.304064 ms

Intel Core i5-4670k @ 3.4GHz, NVIDIA GeForce GTX 760

About

An implementation of parallel exclusive scan in CUDA

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published