-
Notifications
You must be signed in to change notification settings - Fork 73
Matrix Caching
Memory on a GPU is a much scarcer resource than on a CPU (currently 3-12GB is typical). Not only that, but at least in CUDA, GPU memory allocation through the C API is very expensive. It involves interaction between CPU and GPU, and has a high almost-fixed overhead. e.g. a regression algorithm on the GPU (without caching) which allocated 1 MB blocks spent 98% of its time in malloc, and only 2% of its time computing. So even with a custom garbage collector, fully-automatic storage management on the GPU seems very difficult.
Fortunately, in machine learning applications data is often consumed in fixed-size minibatches. The minibatches, and all the intermediate results derived from them, have fixed size from one minibatch update to the next. This suggests that caching could be a very effective strategy for GPU storage management. Caching is enabled globally setting Mat.useCache=true
.
To support caching, every matrix has a unique, 64-bit GUID field which is created when the matrix is created. Whenever an operator or function is applied to matrix arguments, a cache key is created based on the GUIDs of the arguments and the function or operator to be applied. e.g.
> val c = a * b
with FMat arguments a and b will cause a check of a matrix cache with key
(a.GUID, b.GUID, "*".##)
where the third argument is a hash of the string "*" representing the operator. If the cache entry is non-null, it should be a container of appropriate size to hold the result c
. If the cache entry is null, then a new matrix of appropriate size is created, and then saved in the cache. This works because the sizes (but not necessarily the contents) of matrices are immutable, and because the size of the result of an operator like * depends on the sizes of its arguments. Thus the size of the result c
must always be the same.
As a final postscript, we have found that using caching improves the performance of CPU code as well as GPU code, sometimes significantly. e.g. on laptops with limited memory, caching sometime doubles or triples performance.
Caching is designed to support algorithms that work with fixed size blocks of (matrix) data. With that constraint, it is quite easy to use but has some hazards which we cover here.
In a functional language, data objects are immutable and computation involves deriving new objects from old using operators and functions. Functional languages are referentially transparent meaning <math>A+B</math> always means the same thing in the scope of a particular pair of declarations for <math>A</math> and <math>B</math>. BIDMat, like its parent language Scala, encourages functional programming but is not "pure". Some objects, matrices in particular, are mutable for very good reasons (performance being high among them). Most of the functions and operators in BIDMat can be used functionally, but also support mutation for efficiency. e.g.
> val b = sin(a)
creates a new container for b
while
> sin(a,b)
stores its result into an existing b
, mutating b
as a result. You can use either approach, but functional style is almost always easier to read and less error-prone. This is true in general, but especially when using caching because there are subtle interactions that can occur with cached data. e.g.
> val c = a * b // creates cached c with key (a.GUID, b.GUID, "*".##) > c(ii) = vv // a mutating operation on c, bad idea > val d = (a * b) - x // (a*b) container in cache, gets rewritten modifying c
that is, there is only one container for a*b
and any changes to it will be overwritten by any later occurrence of that expression. Here c
is reset to a*b
erasing its update.
These problems can only occur when mutating operations are used. With functional steps only, no containers are modified, and results will always be correct. However, there will always be a need to mutate matrix contents. We quickly review mutating operations, and then discuss how to structure code to use mutation safely.
The mutating operations in BIDMat are:
- The copy operator
C <-- B
- The direct assignment operator
C ~ A * B
- Functions with destination arguments
sin(A,C)
- Sliced assignment
A(II,JJ) = VV
for matricesII
andJJ
- Element assignment
A(i,j) = v
, or element/index assignment likeA(i,JJ) = VV
.
- val C = B.copy
- val C = A * B
- val C = sin(A)
In a piece of code which is executed repeatedly, e.g. a scala for loop, local variables can be reused in the next iteration. It would be wasteful to use only functional operations, because each variable in each iteration would require separate storage, but only the current generation of variables would be accessed. Such an iteration needs to use both mutating and non-mutating operations, but this can always be done in a way that avoids aliasing. The safe iteration pattern looks like this:
> while (condition) { > val newA = fn(A,B,C,...) > val newB = fn(A,B,C,...,newA) > val newC = fn(A,B,C,...,newA,newB) > .... > A <-- newA > B <-- newB > C <-- newC > }
This requires at most two copies of each local var. With care, you can often reduce the number of copies further by using the direct update operator ~
but this form is always safe and you can use it as a reference when optimizing.
Caching relies on matrix variables having the same size from one iteration to the next. Size really means storage. Standard matrix operations fix the size of the output, i.e. operands A and B with given sizes always produce a C of the same size. This applies to unary and binary operators, and even slicing operators. The exception is the find
function and find2
and find3
. The results of these functions have different sizes from one iteration to the next, and will not be cached.
This is an inconvenience, but there are usually workarounds. A common reason for using find is to compute a subset of elements on which to perform a sequence of operations. An alternative is to use conditionals to create 0-1 valued flag arrays which are used to select the results of the operation. e.g.
> val (ii, jj, vv) = find(a) > a(ii + a.nrows * jj) = fn(vv)is implemented instead with:
> val iflag = (a != 0) > val b = fn(a) > a ~ ((1 - iflag) *@ a) + (iflag *@ b)
For dense matrices, storage is proportional to nrows x ncols. For sparse matrices, the storage is proportional to the number of non-zero elements. Caching sparse matrices is still possible, but is a bit more work. We assume the size (storage) of sparse matrices is roughly fixed (e.g. is sampled from some fixed probability distribution) from one mini-batch to the next. The caching system makes an initial guess at the storage needed for each sparse matrix (e.g. by reading the first mini-batch), and then allocates storage which is slightly larger. It uses this container until it is found to be too small, and then creates a new one which is larger by a global growth factor Mat.recycleGrow
. You shouldnt normally have to be aware of this, but there may be performance issues if you have an unusual datasource, e.g. containing instances of gradually increasing size.