Parallel Prefix Sum (Scan) with CUDA

My implementation of parallel exclusive scan in CUDA, following this NVIDIA paper.

Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads.

This implementation can handle very large arbitrary length vectors thanks to the recursively defined scan function.

Performance is increased with a memory-bank conflict avoidance optimization (BCAO).

See the timings for a performance comparison between:

Sequential scan run on the CPU
Parallel scan run on the GPU
Parallel scan with BCAO

For a vector of 10 million entries:

  CPU      : 20749 ms
  GPU      : 7.860768 ms
  GPU BCAO : 4.304064 ms

Intel Core i5-4670k @ 3.4GHz, NVIDIA GeForce GTX 760

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
parallel-scan		parallel-scan
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
cuda.sln		cuda.sln

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Prefix Sum (Scan) with CUDA

About

Releases

Packages

Languages

mattdean1/cuda

Folders and files

Latest commit

History

Repository files navigation

Parallel Prefix Sum (Scan) with CUDA

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages