Replies: 9 comments
-
Interesting idea. I think it would make sense for the GPU providers, but no so much for the "CPU" providers - I don't think the there is much overhead to pinning a pointer to an array in the managed heap and passing it to the native code. Any thoughts on how the API might look? How would one access the array elements from managed code? |
Beta Was this translation helpful? Give feedback.
-
There have been discussions before for new additional matrix/vector storage implementations, based on unmanaged or GPU memory (where MatrixStorage essentially holds a pointer instead of some arrays). My conclusion back then was that we'd have to change the native provider interface to operate on MatrixStorage instead of arrays. |
Beta Was this translation helpful? Give feedback.
-
@cuda I think you might be correct about the CPU providers. For some reason I thought arrays might be getting copied but it is as you say that the arrays are getting pinned and then worked on directly by the native code. For the CPU provider I did a bit of testing using
For the GPU I think we'd need some sort of manager class which basically maintains a collection of references to the managed vectors / matrices stored in CPU memory and the corresponding pointers to the vectors / matrices in GPU memory. Then you could have functions which copy the memory back and forth and free the GPU memory when it's not needed. So the basic workflow would be:
Most of that could be hidden inside CUDA specific storage classes and the CUDA provider and / or wrapper so that the memory is automatically copied back and forth, using "dirty" flags to indicate when to copy. Depending on how it's implemented the manager class would be passed to the native function or just the Any thoughts? |
Beta Was this translation helpful? Give feedback.
-
Sounds good to me, but I've yet to do any GPU programming so I cannot provide real feedback. |
Beta Was this translation helpful? Give feedback.
-
@kjbartel stick to the basics when dealing with C++ (unless you are writing C++/CLI). Marshalling is one of the biggest costs. I have done a fair amount of optimization work in the managed/unmanaged boundary and I learned to be very careful on the marshalling tax. For some things, even for very big operations, it makes sense to work in managed code than pay the tax. While achieving transparency in dealing with GPU memory is great, the performance you pay for it tends to be big. Avoid memory copy over the GPU-CPU boundary as much as you can. Sometimes (depending on the size of the vector) it may be faster to do it in CPU. So plan accordingly when writing the storage classes. For those thresholds if the data is in the GPU, do it in GPU, if not find out if data is big enough to make sense to copy to the GPU and back. Last time I did something like that I added a .Lock() that ensures that data is on the CPU (no matter the cost) that returned a disposable which would .Unlock() the memory. That is basically a signal that you are going to change more than a single byte and request a CPU shadow copy. Changing a byte is not comparable to move the entire data back to CPU and then to GPU again. The performance improvements of such an approach are huge. Federico PS: Dont forget to [SuppressUnmanagedCodeSecurity] on your DLLImports when dealing with cuBLAS and CUDA runtime :) |
Beta Was this translation helpful? Give feedback.
-
@redknightlois Currently how the CUDA provider is implemented it is purely 100% CUDA where every call to the provider will copy arrays from the CPU to GPU, make call(s) to cuBLAS or cuSOLVER, and then copy back to the CPU. It isn't a hybrid CPU / GPU approach where faster operations are done on CPU or on the GPU if the array is already in GPU memory. So it currently requires the user of the MathNet libraries to decide when to use a CPU based provider and when to use the CUDA provider. And the CUDA provider is only performant when doing an operation which is a single call to the provider (such as I definitely do see your point about the difference of just changing a few bytes and changing a larger portion of an array. Changing elements through indexers would be the main problem case as how I was thinking it would copy the whole array back to the CPU for changing any part of it and then back to the GPU. It may be much better to make copying between the CPU and GPU explicit. In that case rather than storage classes it would probably be better to make derived Vector and Matrix classes with methods for modifying multiple elements in GPU memory and for copying between CPU and GPU memory. Which comes to your point about the cost of marshalling and the transition between managed and unmanaged code. Adding GPU memory management would most definitely increase the number of native calls if there are additional calls to create / copy / modify / read arrays in GPU memory and Would have to play around and see which is the best approach I think. Which is actually a bit of a problem for me as my home computer doesn't have an nVidia GPU and I'm pretty busy at work at the moment so can't spend any time on doing this for a few months. |
Beta Was this translation helpful? Give feedback.
-
@borfudin Any thoughts? |
Beta Was this translation helpful? Give feedback.
-
I couldn't agree more re: the memory issue. I built this mostly as a first swag at getting it working, however in practice it makes far more sense to have all of the data allocated and stored on the GPU and to only copy it over when absolutely necessary. Something I have used in other circumstances is to provide an abstraction of the actual physical memory used by a matrix or vector object, which already kind of exists in MathNet.Numerics in the form of the MatrixStorage abstractions. If the native provider implementation were extended to allow the provider to optionally provide its own MatrixStorage classes then you could conceivably create CUDA versions of all the MatrixStorage classes which keep both CPU and GPU pointers and copy between them as necessary. What do others think? |
Beta Was this translation helpful? Give feedback.
-
This is also what I initially thought however it hides what type of storage class is being used and could cause problems when matrices or vectors are created from different providers (for example creating a matrix prior to calling For that reason I think just a manger class would be better (or just inside the CUDA LAP) which maps between the cpu and gpu pointers / references. Make the memory management explicit for the user and then you don't have to worry about modifying / extending any of the storage classes or |
Beta Was this translation helpful? Give feedback.
-
For the CUDA LA provider it would be desirable to have matrix and vector classes which use GPU memory so that we are not copying data back and forth between the GPU and CPU all the time. For example when solving a system with Matrix.Solve the LU decomposition is calculated using the GPU and then copied back to the CPU's memory and then copied again to the GPU to solve and back again to the CPU to provide the result. This is very inefficient. Recent benchmarking I did found the CUDA provider was the slowest provider (at least on my hardware) mainly due to this memory issue.
Beta Was this translation helpful? Give feedback.
All reactions