Communication with Cupy #100

BigBSB · 2021-12-09T04:56:10Z

Is it possible to have arrays already stored on the GPU as cupy.ndarray objects be used for the fitting routines? This is using the python wheel.

jkfindeisen · 2021-12-09T09:27:29Z

The location on the input/output data can only be specified in the C++ interface, but not in the C interface, So I guess that will not work. It could work though with some changes to the code.

BigBSB · 2021-12-09T17:07:53Z

Cupy has an option to run custom CUDA kernels; would this feature be useful to get it working?

SBresler · 2022-07-13T23:56:37Z

Hey I ended up figuring this out. Is there a way to update the master branch?

Basically this lets you do all of your data processing on the GPU, and then immediately push it to GPUfit to do the fits without having to transfer the data back to the CPU.

Next thing I want to figure out is how to JIT compile fit models so that you can construct a fit model based on various parameters and then have it run though GPUfit, don't know if this is possible.

superchromix · 2022-07-18T08:53:44Z

Sounds great. The best way to do this is fork the repository, include your changes, and submit a pull request.

SBresler · 2022-07-18T15:41:26Z

Ok, well the cupy works.

I actually don't think I care about the JIT stuff, it looks like ya'll tried that and there were speed hits.

How about fits that involve complex numbers? would that be a difficult addition?

casparvitch · 2022-10-01T03:15:58Z

@SBresler I don't see any changes to your fork. Can you share with us how you implemented cupy interfacing, I (and I imagine others) would find this very interesting/useful! Cheers.

SBresler · 2022-10-01T19:24:52Z

Hey, I don't think I ever uploaded this and wasn't actually 100% sure if what I did worked. I basically just had the python interface point to the cupy object instead of whatever numpy thing it was pointing to, and the program still worked and I at least didn't have to explicitly do a cupy.asnumpy() function (I am doing preprocessing for fits with rapids and cupy, so I didn't see the point in shipping stuff back and forth) So I'm not exactly sure if this was *really* doing what I thought it was doing. I'm not sure I have the skills to even figure this out really. Maybe you could see what it does. I will try to find the version of the program where I did this.

…

On Fri, Sep 30, 2022 at 11:16 PM casparvitch ***@***.***> wrote: @SBresler <https://github.com/SBresler> I don't see any changes to your fork. Can you share with us how you implemented cupy interfacing, I (and I imagine others) would find this very interesting/useful! Cheers. — Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADLZQHXAKFJM5GFMVZFENYDWA6UHRANCNFSM5JVPRJJA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -Sean

SBresler · 2022-12-02T20:19:34Z

Alright, I have come back to this.

What I did was cast a pointer to to the cupy object in the python interface.

This allowed me to put in a cupy object as an argument to the fit function call.

Looking at the traces through nvtx, there is still a lot of copying... stuff happening during that block.

I was looking at jaxFit, which I could more realistically make modifications to since I have a lot more python knowledge than C++ at this point, but to the me program is more focused on extremely large, complex fits, whereas gpufit is all about doing a ton of small fits at once.

This might be personal bias because it's exactly my use-case, but my feeling is that it's really obvious that if you have a ton of small data like this that you want to fit, that the most obvious possible improvement at the moment for gpufit is to allow for data that is already in global memory on device to be accessed.

At the moment I am doing ~3GB/s transfers to the gpu for FFTs and then some reduction operations, and it's working relatively well, but the bottlenecks are always transfer times.

I just thought of this - maybe it's easier to just do it the other way and put in all of my preprocessing into gpufit instead.

I am streaming a LOT of data through a digitizer at the moment (3GB/s), and have gotten fits continuously for about 10 seconds, and I am fairly certain that eliminating one or both of these copies blows the problem apart (RDMA for the digitizer takes away one transfer, accessing global memory for the gpufit calls takes away 2 transfers, and my reduction is about a factor of 2.5)

SBresler · 2022-12-02T20:22:31Z

Another thought -

What if you want to use RDMA to get the data to the gpu faster by bypassing the whole read the data into cpu ram over the pci-e bus (hard drive or otherwise), pinning the address, transferring... et cetera.

this would mean that you fundamentally cannot use gpufit and RDMA in the same application.

SBresler · 2022-12-02T21:16:38Z

Another idea - add a preprocessing section that allows you to add in your own kernel that you want to add in what you want to do before the fit.

This could work as a stopgap.

superchromix · 2022-12-06T14:48:40Z

Hi. Fitting data that is already stored in the GPU memory is already implemented in Gpufit. The docs are here: https://gpufit.readthedocs.io/en/latest/gpufit_api.html#gpufit-cuda-interface .

As you found out, when working with Python, you need to obtain a pointer to a GPU memory location to use the gpufit_cuda_interface call. Gpufit knows nothing about python or numpy arrays, etc.

The pre-processing you're talking about could be implemented as a separate routine. You can do anything you want with the data stored on the GPU before and after calling Gpufit. Gpufit is simply meant to handle the fit step.

Finally, we tried real-time compilation of fit model functions, and this caused major performance bottlenecks. It would clearly be a great feature to have. This topic may be revisited in the future.

SBresler · 2022-12-06T18:30:10Z

wow this is why you have to be persistent and keep asking!

so either this is new, or I just was going off some other information found in other posts that wasn't entirely accurate. I don't see a way to look at old docs but that would be interesting to find out.

Thanks so much for the information. I can work with this. It was blowing my mind that this wasn't a feature and it totally is.

SBresler · 2022-12-06T18:41:30Z

Hi. Fitting data that is already stored in the GPU memory is already implemented in Gpufit. The docs are here: https://gpufit.readthedocs.io/en/latest/gpufit_api.html#gpufit-cuda-interface .

As you found out, when working with Python, you need to obtain a pointer to a GPU memory location to use the gpufit_cuda_interface call. Gpufit knows nothing about python or numpy arrays, etc.

The pre-processing you're talking about could be implemented as a separate routine. You can do anything you want with the data stored on the GPU before and after calling Gpufit. Gpufit is simply meant to handle the fit step.

Finally, we tried real-time compilation of fit model functions, and this caused major performance bottlenecks. It would clearly be a great feature to have. This topic may be revisited in the future.

Interesting.

when you say "major performance bottlenecks" are you talking more than an order of magnitude speed decrease?

I think that scientists are generally hungry for faster fitting routines and almost anything beats the speed of LMfit.

SBresler · 2022-12-16T19:53:56Z

I have a version now which does the following:

sets up the method in the dll corresponding to the constarined cuda interface function
checks that input cupy arrays are c-contiguous
gets the pointer for the cupy arrrays
sends that over to gpufit.

So I think this is a lot closer to what I want. I will do a PR at some point for this.

An idea I have been toying with is to expose all of the functions to pybind11 rather than using ctypes - this seems to be the tool of choice for a lot of people.

This would give you access to pytest for unit testing in gpufit - i think that the python interface is by far the most important aspect of this for any sort of widespread adoption.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Communication with Cupy #100

Communication with Cupy #100

BigBSB commented Dec 9, 2021 •

edited

Loading

jkfindeisen commented Dec 9, 2021

BigBSB commented Dec 9, 2021

SBresler commented Jul 13, 2022 •

edited

Loading

superchromix commented Jul 18, 2022

SBresler commented Jul 18, 2022 •

edited

Loading

casparvitch commented Oct 1, 2022

SBresler commented Oct 1, 2022 via email

SBresler commented Dec 2, 2022 •

edited

Loading

SBresler commented Dec 2, 2022

SBresler commented Dec 2, 2022 •

edited

Loading

superchromix commented Dec 6, 2022

SBresler commented Dec 6, 2022 •

edited

Loading

SBresler commented Dec 6, 2022

SBresler commented Dec 16, 2022 •

edited

Loading

Communication with Cupy #100

Communication with Cupy #100

Comments

BigBSB commented Dec 9, 2021 • edited Loading

jkfindeisen commented Dec 9, 2021

BigBSB commented Dec 9, 2021

SBresler commented Jul 13, 2022 • edited Loading

superchromix commented Jul 18, 2022

SBresler commented Jul 18, 2022 • edited Loading

casparvitch commented Oct 1, 2022

SBresler commented Oct 1, 2022 via email

SBresler commented Dec 2, 2022 • edited Loading

SBresler commented Dec 2, 2022

SBresler commented Dec 2, 2022 • edited Loading

superchromix commented Dec 6, 2022

SBresler commented Dec 6, 2022 • edited Loading

SBresler commented Dec 6, 2022

SBresler commented Dec 16, 2022 • edited Loading

BigBSB commented Dec 9, 2021 •

edited

Loading

SBresler commented Jul 13, 2022 •

edited

Loading

SBresler commented Jul 18, 2022 •

edited

Loading

SBresler commented Dec 2, 2022 •

edited

Loading

SBresler commented Dec 2, 2022 •

edited

Loading

SBresler commented Dec 6, 2022 •

edited

Loading

SBresler commented Dec 16, 2022 •

edited

Loading