CUDA: GPU based matrix/vector storage and memory management #847

kjbartel · 2015-06-18T07:47:52Z

kjbartel
Jun 18, 2015

For the CUDA LA provider it would be desirable to have matrix and vector classes which use GPU memory so that we are not copying data back and forth between the GPU and CPU all the time. For example when solving a system with Matrix.Solve the LU decomposition is calculated using the GPU and then copied back to the CPU's memory and then copied again to the GPU to solve and back again to the CPU to provide the result. This is very inefficient. Recent benchmarking I did found the CUDA provider was the slowest provider (at least on my hardware) mainly due to this memory issue.

cuda · 2015-06-19T05:19:25Z

cuda
Jun 19, 2015

Interesting idea. I think it would make sense for the GPU providers, but no so much for the "CPU" providers - I don't think the there is much overhead to pinning a pointer to an array in the managed heap and passing it to the native code.

Any thoughts on how the API might look? How would one access the array elements from managed code?

0 replies

cdrnet · 2015-06-19T08:35:50Z

cdrnet
Jun 19, 2015
Maintainer

There have been discussions before for new additional matrix/vector storage implementations, based on unmanaged or GPU memory (where MatrixStorage essentially holds a pointer instead of some arrays). My conclusion back then was that we'd have to change the native provider interface to operate on MatrixStorage instead of arrays.

0 replies

kjbartel · 2015-06-22T01:32:25Z

kjbartel
Jun 22, 2015
Author

@cuda I think you might be correct about the CPU providers. For some reason I thought arrays might be getting copied but it is as you say that the arrays are getting pinned and then worked on directly by the native code.
So the main overhead is just simply from pinvoke / managed-to-native-to-managed transitions (I found this article interesting comparing speed of unmanged, managed C++/CLI and pinvoke from C#). This becomes an issue for any iterative calculations. And it also means that it would be a very bad idea to make lots of calls to native code for example inside an indexer to access an array stored in gpu memory.

For the CPU provider I did a bit of testing using GCHandles and pinning the arrays from DenseVectorStorage before doing an iterative calculation passing IntPtr to the native function rather than the arrays. Interestingly using the GCHandle to pin the arrays manually was slower than letting the runtime take care of the pinning itself so I assume it's optimizing the pinning.

Any thoughts on how the API might look? How would one access the array elements from managed code?

For the GPU I think we'd need some sort of manager class which basically maintains a collection of references to the managed vectors / matrices stored in CPU memory and the corresponding pointers to the vectors / matrices in GPU memory. Then you could have functions which copy the memory back and forth and free the GPU memory when it's not needed. So the basic workflow would be:

Create vectors / matrices using cpu memory and add to manager class
Copy to gpu memory
Perform calculations all using gpu memory
Copy from gpu memory back to cpu memory using manager class
Remove vectors / matrices from manager class and free gpu memory

Most of that could be hidden inside CUDA specific storage classes and the CUDA provider and / or wrapper so that the memory is automatically copied back and forth, using "dirty" flags to indicate when to copy. Depending on how it's implemented the manager class would be passed to the native function or just the IntPtrs for the GPU arrays (preferably a dictionary would be used for the collection(s) in the manager but I doubt dictionaries could be marshalled to the corresponding std::unordered_map type in C++).

Any thoughts?

0 replies

cuda · 2015-06-22T10:09:00Z

cuda
Jun 22, 2015

Sounds good to me, but I've yet to do any GPU programming so I cannot provide real feedback.

0 replies

redknightlois · 2015-06-23T12:50:39Z

redknightlois
Jun 23, 2015

@kjbartel stick to the basics when dealing with C++ (unless you are writing C++/CLI). Marshalling is one of the biggest costs. I have done a fair amount of optimization work in the managed/unmanaged boundary and I learned to be very careful on the marshalling tax. For some things, even for very big operations, it makes sense to work in managed code than pay the tax.

While achieving transparency in dealing with GPU memory is great, the performance you pay for it tends to be big. Avoid memory copy over the GPU-CPU boundary as much as you can. Sometimes (depending on the size of the vector) it may be faster to do it in CPU. So plan accordingly when writing the storage classes. For those thresholds if the data is in the GPU, do it in GPU, if not find out if data is big enough to make sense to copy to the GPU and back.

Last time I did something like that I added a .Lock() that ensures that data is on the CPU (no matter the cost) that returned a disposable which would .Unlock() the memory. That is basically a signal that you are going to change more than a single byte and request a CPU shadow copy. Changing a byte is not comparable to move the entire data back to CPU and then to GPU again.

The performance improvements of such an approach are huge.

Federico

PS: Dont forget to [SuppressUnmanagedCodeSecurity] on your DLLImports when dealing with cuBLAS and CUDA runtime :)

0 replies

kjbartel · 2015-06-24T01:59:00Z

kjbartel
Jun 24, 2015
Author

@redknightlois Currently how the CUDA provider is implemented it is purely 100% CUDA where every call to the provider will copy arrays from the CPU to GPU, make call(s) to cuBLAS or cuSOLVER, and then copy back to the CPU. It isn't a hybrid CPU / GPU approach where faster operations are done on CPU or on the GPU if the array is already in GPU memory. So it currently requires the user of the MathNet libraries to decide when to use a CPU based provider and when to use the CUDA provider. And the CUDA provider is only performant when doing an operation which is a single call to the provider (such as LUSolve). Anything involving multiple operations (like multiplying 3 matrices) or is iterative, is currently very slow (the managed provider is much faster).

I definitely do see your point about the difference of just changing a few bytes and changing a larger portion of an array. Changing elements through indexers would be the main problem case as how I was thinking it would copy the whole array back to the CPU for changing any part of it and then back to the GPU. It may be much better to make copying between the CPU and GPU explicit. In that case rather than storage classes it would probably be better to make derived Vector and Matrix classes with methods for modifying multiple elements in GPU memory and for copying between CPU and GPU memory.

Which comes to your point about the cost of marshalling and the transition between managed and unmanaged code. Adding GPU memory management would most definitely increase the number of native calls if there are additional calls to create / copy / modify / read arrays in GPU memory and IntPtrs are used when calling CUDA wrapper functions. The number of native calls could be reduced if structs are used instead containing both the CPU array and the GPU array IntPtr and a bool or two to signify which array is valid.

Would have to play around and see which is the best approach I think. Which is actually a bit of a problem for me as my home computer doesn't have an nVidia GPU and I'm pretty busy at work at the moment so can't spend any time on doing this for a few months.

0 replies

kjbartel · 2015-06-24T02:03:51Z

kjbartel
Jun 24, 2015
Author

@borfudin Any thoughts?

0 replies

matajoh · 2015-06-24T11:42:16Z

matajoh
Jun 24, 2015

I couldn't agree more re: the memory issue. I built this mostly as a first swag at getting it working, however in practice it makes far more sense to have all of the data allocated and stored on the GPU and to only copy it over when absolutely necessary.

Something I have used in other circumstances is to provide an abstraction of the actual physical memory used by a matrix or vector object, which already kind of exists in MathNet.Numerics in the form of the MatrixStorage abstractions. If the native provider implementation were extended to allow the provider to optionally provide its own MatrixStorage classes then you could conceivably create CUDA versions of all the MatrixStorage classes which keep both CPU and GPU pointers and copy between them as necessary. What do others think?

0 replies

kjbartel · 2015-06-26T05:49:44Z

kjbartel
Jun 26, 2015
Author

If the native provider implementation were extended to allow the provider to optionally provide its own MatrixStorage classes then you could conceivably create CUDA versions of all the MatrixStorage classes which keep both CPU and GPU pointers and copy between them as necessary.

This is also what I initially thought however it hides what type of storage class is being used and could cause problems when matrices or vectors are created from different providers (for example creating a matrix prior to calling Control.UseNativeCUDA) or when using operations not implemented by CUDA provider which fall back to the managed provider (vector functions). It also does mean that it's impossible to tell whether the user is going to modify just a single entry or replace the whole array.

For that reason I think just a manger class would be better (or just inside the CUDA LAP) which maps between the cpu and gpu pointers / references. Make the memory management explicit for the user and then you don't have to worry about modifying / extending any of the storage classes or ILinearAlgebraProvider. I initially thought it would be good to have automatic management and copying back and forth but I think we can just keep it really simple and forget about most of that. It does mean that the CUDA provider couldn't be used as a simple drop in replacement though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: GPU based matrix/vector storage and memory management #847

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CUDA: GPU based matrix/vector storage and memory management #847

kjbartel Jun 18, 2015

Replies: 9 comments

cuda Jun 19, 2015

cdrnet Jun 19, 2015 Maintainer

kjbartel Jun 22, 2015 Author

cuda Jun 22, 2015

redknightlois Jun 23, 2015

kjbartel Jun 24, 2015 Author

kjbartel Jun 24, 2015 Author

matajoh Jun 24, 2015

kjbartel Jun 26, 2015 Author

kjbartel
Jun 18, 2015

cuda
Jun 19, 2015

cdrnet
Jun 19, 2015
Maintainer

kjbartel
Jun 22, 2015
Author

cuda
Jun 22, 2015

redknightlois
Jun 23, 2015

kjbartel
Jun 24, 2015
Author

kjbartel
Jun 24, 2015
Author

matajoh
Jun 24, 2015

kjbartel
Jun 26, 2015
Author