Author: Jack Robbins
This project provides 3 C programs that are designed to generate, print and convert binary files containing sequences of 4-byte integers. The binary files are treated as sparse matrices, that is, matrices who contain at least 50% zeroes, making them truly sparse. A full description of each of these utilities, and of sparse matrices in general, is given below.
A sparse matrix is a matrix whose contents are at least 50% 0's. Here is a small 5 X 5 example:
0 | 1 | 0 | 0 | 0 |
---|---|---|---|---|
0 | 0 | 2 | 0 | 1 |
0 | 4 | 0 | 0 | 0 |
12 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 |
Notice how we only have 6 nonzero values(1, 2, 3, 4 and 12) out of 25 total values, making this matrix truly sparse.
Unlike this small 5 X 5 example, these matrices are often quite large, and therefore storing them efficiently in memory/on disk provides an interesting challenge to solve. Consider a sparse matrix of unsigned integers that is 1000 rows by 1000 columns. If we know that this matrix is sparse, we know that out of the
Sparse matrices appear in several distinct areas of science and mathematics. Machine learning, combinatorics and numerical analysis are a few examples of where the concept of sparse matrices appear.
The C program generate_sparse_mat.c allows users to generate a sparse matrix of any dimension that they choose. The number of rows and columns is provided as user-input to this program upon runtime(see more about running these programs below). Once a user enters the desired dimensions, the program creates a binary file, matrix.bin
, in the current directory. This program guarantees that all binary files generated will be at least 50% zeroes, so that when they are interpreted as sparse matrices, they are truly sparse.
Binary files generated by generate_sparse_mat.c follow these conventions:
- The first 4 bytes of the file are the number of rows the user has entered, and the second 4 bytes are the number of columns, making the first 8 bytes of the file the dimensions of the sparse matrix, and not actual values in the sparse matrix.
- The program will write row * column randomly generated unsigned integers to the file after writing the dimensions in the first 8 bytes. The program was written with unsigned integers in mind, but users could easily choose to interpret each of these 4-byte values as signed integers, floats, etc.
- The file created will always be entitled
matrix.bin
. If no file titledmatrix.bin
in the current directory exists, one will be created. Otherwise, the currentmatrix.bin
file will be overwritten.
Additionally, it is possible to modify the source code to change the density of the sparse matrices that are generated. The constant RATIO
at the top of the file may be changed to any number between 0 and 1 to adjust how dense/sparse the matrix is. The closer RATIO
is to 1, the more dense the matrix will be. RATIO
is set by default to 0.2, which is a safe defualt if users would like a truly sparse matrix.
The C program display_contents.c provides a way for users to "dump out" the contents of the binary files that generate_sparse_mat.c creates. This simple program reads in a binary file that the user gives it, and will go through the file, 4 bytes at a time, and display the unsigned integer value, hexadecimal value, and the char value of each byte for every 4 byte chunk. For example, if this program were to read in the hexadecimal value: 0x33c368e1
, it would display the following:
Integer: Hex: 0x33c368e1 Dec: 868444385
Byte 1: Hex: 0xe1 Char: \225
Byte 2: Hex: 0x68 Char: h
Byte 3: Hex: 0xc3 Char: \195
Byte 4: Hex: 0x33 Char: 3
The program diplays a message in the above format for every 4 bytes in the binary file provided, in sequential order. This program is useful for sanity checks/debugging when working with the sparse matrix binary files generated by create_sparse_mat.c.
Note
If you are more interested in signed integer values, or float values, the source code is very well documented and incredibly simple to modify.
Compressed Sparse Row(CSR) format is an efficient way of storing sparse matrices. As mentioned above, since sparse matrices contain have at least half of their values set to 0, they have the potential to waste kilobytes or megabytes of space in memory or on disk. CSR format provides a way of efficiently storing sparse matrices, while also preserving all of the important information about them.
CSR format works by storing three arrays:
- A
values
array that stores all of the nonzero values - A
column_indices
array that stores the column index of each corresponding value in their respective rows - A
row_start
array that stores the index in the values array at which each row starts. The last value in the row_start array also happens to be the number of nonzero elements that the sparse matrix has
This way of storing sparse matrices is very efficient, as only meaningful data(values and their positions) is stored. It is also very easy to use these three arrays to reconstruct a sparse matrix should the need arise, as all necessary information is provided to do so.
The description of the row_start array in particular can be a little abstract without a real example. To make it more concrete, let's take our example from above and convert it into CSR format:
0 | 1 | 0 | 0 | 0 |
---|---|---|---|---|
0 | 0 | 2 | 0 | 1 |
0 | 4 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 |
In compressed row format, this sparse matrix would be stored as:
dimensions 5 5
values = [1, 2, 1, 4, 1, 3]
column_indices = [1, 2, 4, 1, 0, 0]
row_start = [0, 1, 3, 4, 5, 6]
Properly reading CSR format would go as follows: index 0 in row start has a value of 0, meaning that row 0 starts at index 0 in values, and goes up to but does not include index 1 in values. values[0]
is 1, and the corresponding integer in column_indices
at index 0 is 1. This tells
us that the value 1 is the only value in row 0, and it is located at column index 1 in row 0. Knowing this we can reconstruct row 0 as 0 1 0 0 0
. This can be done for each value in the values array to reconstruct the entire sparse matrix.
The C program convert_to_csr.c takes in a binary file as a command line argument, and programmatically converts that file into compressed sparse row format, saving the result into a text file matrix.txt
. It is important
to note that convert_to_csr.c relies on binary sparse matrix files following the convention of having the first 8 bytes in the binary file represent the number of rows and columns in the sparse matrix. Binary files that violate this convention will still be interpreted, just not correctly. The
program itself is very well documented, so I will not reiterate its functionality here. I would encourage anyone interested to view the source code for themselves.
For convenience, a runner script run.sh has been provided that will compile, take input, and run all of the programs in this project. Some examples of how to use it are below.
Note
After downloading run.sh, be sure to grant executable permissions by running chmod +x run.sh
To generate a 5X5 sparse matrix:
example@bash:~$ ./run.sh create_sparse_mat.c
If your program requires input, enter it here: 5 5
To display the contents of the file matrix.bin
in the current directory
example@bash:~$ ./run.sh display_contents.c
If your program requires input, enter it here: matrix.bin
To convert the binary file in matrix.bin
into compressed sparse row format
example@bash:~$ ./run.sh convert_to_csr.c
If your program requires input, enter it here: matrix.bin