Gauss-Jordan Elimination is a way of calculating the inverse of a matrix and solving many linear systems. this is an Implement matrix inversion using Gauss-Jordan Elimination in CUDA.
first you should build the project with cmake
cmake CMakeLists.txt
after a Makefile built, compile the project by following command:
make
after building the project, you should execute ./GJE
command with following flags:
./GJE -n <edge_length> [-f <input_matrix_file> | -r <random_uniform_matrix>] -o <calculated_inverse_matrix_path> [-c <execute_on_cpu> | -g <execute_on_gpu>]
the program writes the calculation runtime in on stdout (in milliseconds). for example:
calculation time: 120.31(ms)
We use simple gauss-gordan algorithm and try to port it to GPU through cuda language
the algorithm goes as follow:
-
one kernel creating the identity matrix near to the input matrix
1.1. each block is responsible for setting neccessory 1 in
COL_PER_BLK
columns (one thread for each column) thusCOL_PER_BLK
shoud be less than maximum block size set by GPU hardware -
for each row:
2.1. one kernel for calculating factor of which each row is subtracted by the current row (parallelism level: each thread is responsible for n/blockDim.x row and thus calculates n/blockDim.x factors)
2.2. one kernel for subtracting the current row from each row in the matrix and normalizing the current row
given the methodology above our time complexity would be:
since first kernel time order in CPU is of O(n) and since we break the execution into
kernel2 has similar charactristic and its execution time would be:
kernel3 on the other hand, for each block, each thread is responsible for n/blockDim.x and each row takes COL_PER_BLK iterations, so the total execution time for each thread would be:
there are a total of n/COL_PER_BLK blocks and only NUM_SM of them can be run in a prallel manner so the total execution time per each block execution is:
so the total execution time of kernel3 is:
$O(threads:execution)* O(blocks:execution) =\ O(nCOL_PER_BLK/blockDim.x)O(n/(COL_PER_BLKNUM_SM)) =\ O(n^2 /blockDim.xNUM_SM))$
and the total execution time would be:
which is:
you can run the python benchmark program by following command:
sudo python test.py
on this benchmark, we've executed the program with random matrix.
after each execution on both cpu & gpu, we capture its computation runtime and error.
we calculate matrix error by norm2(Frobenius) method. assume that we've calculated the inverse of matrix A, then we have:
we've executed the benchmark on Nvidia RTX2080 & intel i9900k.