-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Communication with Cupy #100
Comments
The location on the input/output data can only be specified in the C++ interface, but not in the C interface, So I guess that will not work. It could work though with some changes to the code. |
Cupy has an option to run custom CUDA kernels; would this feature be useful to get it working? |
Hey I ended up figuring this out. Is there a way to update the master branch? Basically this lets you do all of your data processing on the GPU, and then immediately push it to GPUfit to do the fits without having to transfer the data back to the CPU. Next thing I want to figure out is how to JIT compile fit models so that you can construct a fit model based on various parameters and then have it run though GPUfit, don't know if this is possible. |
Sounds great. The best way to do this is fork the repository, include your changes, and submit a pull request. |
Ok, well the cupy works. I actually don't think I care about the JIT stuff, it looks like ya'll tried that and there were speed hits. How about fits that involve complex numbers? would that be a difficult addition? |
@SBresler I don't see any changes to your fork. Can you share with us how you implemented cupy interfacing, I (and I imagine others) would find this very interesting/useful! Cheers. |
Hey, I don't think I ever uploaded this and wasn't actually 100% sure if
what I did worked.
I basically just had the python interface point to the cupy object instead
of whatever numpy thing it was pointing to, and the program still worked
and I at least didn't have to explicitly do a cupy.asnumpy() function (I am
doing preprocessing for fits with rapids and cupy, so I didn't see the
point in shipping stuff back and forth)
So I'm not exactly sure if this was *really* doing what I thought it was
doing. I'm not sure I have the skills to even figure this out really. Maybe
you could see what it does.
I will try to find the version of the program where I did this.
…On Fri, Sep 30, 2022 at 11:16 PM casparvitch ***@***.***> wrote:
@SBresler <https://github.com/SBresler> I don't see any changes to your
fork. Can you share with us how you implemented cupy interfacing, I (and I
imagine others) would find this very interesting/useful! Cheers.
—
Reply to this email directly, view it on GitHub
<#100 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLZQHXAKFJM5GFMVZFENYDWA6UHRANCNFSM5JVPRJJA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
-Sean
|
Alright, I have come back to this. What I did was cast a pointer to to the cupy object in the python interface. This allowed me to put in a cupy object as an argument to the fit function call. Looking at the traces through nvtx, there is still a lot of copying... stuff happening during that block. I was looking at jaxFit, which I could more realistically make modifications to since I have a lot more python knowledge than C++ at this point, but to the me program is more focused on extremely large, complex fits, whereas gpufit is all about doing a ton of small fits at once. This might be personal bias because it's exactly my use-case, but my feeling is that it's really obvious that if you have a ton of small data like this that you want to fit, that the most obvious possible improvement at the moment for gpufit is to allow for data that is already in global memory on device to be accessed. At the moment I am doing ~3GB/s transfers to the gpu for FFTs and then some reduction operations, and it's working relatively well, but the bottlenecks are always transfer times. I just thought of this - maybe it's easier to just do it the other way and put in all of my preprocessing into gpufit instead. I am streaming a LOT of data through a digitizer at the moment (3GB/s), and have gotten fits continuously for about 10 seconds, and I am fairly certain that eliminating one or both of these copies blows the problem apart (RDMA for the digitizer takes away one transfer, accessing global memory for the gpufit calls takes away 2 transfers, and my reduction is about a factor of 2.5) |
Another thought - What if you want to use RDMA to get the data to the gpu faster by bypassing the whole read the data into cpu ram over the pci-e bus (hard drive or otherwise), pinning the address, transferring... et cetera. this would mean that you fundamentally cannot use gpufit and RDMA in the same application. |
Another idea - add a preprocessing section that allows you to add in your own kernel that you want to add in what you want to do before the fit. This could work as a stopgap. |
Hi. Fitting data that is already stored in the GPU memory is already implemented in Gpufit. The docs are here: https://gpufit.readthedocs.io/en/latest/gpufit_api.html#gpufit-cuda-interface . As you found out, when working with Python, you need to obtain a pointer to a GPU memory location to use the gpufit_cuda_interface call. Gpufit knows nothing about python or numpy arrays, etc. The pre-processing you're talking about could be implemented as a separate routine. You can do anything you want with the data stored on the GPU before and after calling Gpufit. Gpufit is simply meant to handle the fit step. Finally, we tried real-time compilation of fit model functions, and this caused major performance bottlenecks. It would clearly be a great feature to have. This topic may be revisited in the future. |
wow this is why you have to be persistent and keep asking! so either this is new, or I just was going off some other information found in other posts that wasn't entirely accurate. I don't see a way to look at old docs but that would be interesting to find out. Thanks so much for the information. I can work with this. It was blowing my mind that this wasn't a feature and it totally is. |
Interesting. when you say "major performance bottlenecks" are you talking more than an order of magnitude speed decrease? I think that scientists are generally hungry for faster fitting routines and almost anything beats the speed of LMfit. |
I have a version now which does the following:
So I think this is a lot closer to what I want. I will do a PR at some point for this. An idea I have been toying with is to expose all of the functions to pybind11 rather than using ctypes - this seems to be the tool of choice for a lot of people. This would give you access to pytest for unit testing in gpufit - i think that the python interface is by far the most important aspect of this for any sort of widespread adoption. |
Is it possible to have arrays already stored on the GPU as cupy.ndarray objects be used for the fitting routines? This is using the python wheel.
The text was updated successfully, but these errors were encountered: