LU decomposition using CUDA

A parallel implementation of LU decomposition
Two versions : 1) Using global memory alone 2) Using shared memory for pivot row
For both the implementations kernel with single thread scales the pivot row
Global memory : Blocks with one thread each are launched for reduction.
Shared memory : Blocks with static size are used. Thread with id==0 copies the pivot row into shared memory and after that rest of the threads in the block start reducing.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ASS4		ASS4
ReadMe.md		ReadMe.md
global.cu		global.cu
shared_mem.cu		shared_mem.cu
test.cu		test.cu
test.cu~		test.cu~

Provide feedback