-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PencilArrays + CUDA #21
Comments
So far in the shallow water model we do have pencil arrays working that we have tested with CPUs and they seem to work well. More tests need to be done of course. We also have the functionality of distirubting the solution across different GPUs. We have a proof of concept in shallow water model but have no done any real testing. This got us looking at how the model does in GPUs vs CPUs and we are doing some tests in shallow water model. So far GPUs are performing very well, maybe too well? Since GPUs can do operations faster, I don't think you would need to distrubte across as many as you do in CPUs, but even a few would be helpful when the model size gets big enough. I think this would be a great thing to do and would like to play with this in shallow water. I would guess that this could extend to other models, but only issue would be the pressure solve, as usual. Maybe @ali-ramadhan would have more to add from his perspective? |
Thank you all for your interest! I'm happy to know that PencilArrays and PencilFFTs are already being used in Oceananigans for parallelisation on CPUs, and I'd be glad to help making them work on GPUs. To answer @tomchor's questions,
In principle it's not in my immediate plans, since for now I haven't personally played much with GPUs (let alone multi-GPU systems), and my main motivation is on CPUs. But, as discussed in jipolanco/PencilFFTs.jl#3, I would be very happy to add support for GPU arrays and in particular for
From the PencilArrays side, the first thing to do is to determine what works and what doesn't on PencilFFTs will also need support for CUDA FFT plans, which I guess should be easy. If I understand correctly, to build a CUDA FFT plan one just needs to call
@christophernhill That's really good news! I would love to hear more details on this. Have you tested transpositions in particular? |
@jipolanco we did some very simple transpose tests and they seemed to work - which was awesome. We added
and switched to
and
in https://github.com/jipolanco/PencilArrays.jl#quick-start and things ran OK on a single rank. So not a very comprehensive test, but a good start. I can try a couple more things tomorrow. |
I agree, that's a very good start! A few tests that I can suggest right now:
By the way, @christophernhill's example makes me think that it would be convenient to have a higher level constructor for PencilArray{T}([array_type = Array], undef, pencil::Pencil) which would be consistent with the current |
@jipolanco I can try some dims variations and gather. I put some WIP pieces here
I will chat with @ali-ramadhan about some configuring some auto testing on a GPU machine in case this gets real. So far still looks good. Looks like fill! ( https://github.com/christophernhill/PencilArrays.jl/blob/b12a127275f9932dbc825574e49f1de6a85d736a/gpu-experimenting/p1.jl#L24 ) does some element by element things e.g. see these messages that come from fill! https://github.com/christophernhill/PencilArrays.jl/blob/master/out/1/rank.00/stderr . That is probably inevitable for some initializations, but for some it should be possible to have an interface that is GPU optimized. |
Thanks @christophernhill for the tests!
That would be great!
You seem to be right about this. Since there is no specific Otherwise, from your initial tests, I'm just a bit worried about the last line in out/1/rank.00/stdout:
which suggests that something went wrong in the transposition of GPU data. It would be interesting to determine where things fail exactly. |
@jipolanco oops - I think I pushed the output from a test where I deliberately commented out the AyC transpose to check it would error! Sorry about that 🔥 !! With the transpose in place the results do match. I will tidy and update a bit later when I do gather etc... tests. |
That's great then!! I'm happy to know that it just works! |
@jipolanco it looks like a dimension permutation test works ( updated code here https://github.com/christophernhill/PencilArrays.jl/blob/02a648481fdd5273b4f97d18a848690599dbc3b5/gpu-experimenting/p1.jl#L38 ). gather() seems to have problems (see error message below). It looks like a receive problem that deadlocks with a bad address. I am not sure if it is PencilArrays or some lower level that has the problem. Looking at gather code its not clear to me that it should fail. I don't think gather is critical/needed for our FFT use case, so generally things still look good. Also broadcast to a local CPU copy seems to work ( https://github.com/christophernhill/PencilArrays.jl/blob/02a648481fdd5273b4f97d18a848690599dbc3b5/gpu-experimenting/p1.jl#L76 ), so a hack could be dispatch gather on CuArray to first do an intermediate local to rank CPU broadcast copy and then dispatch to a CPU gather? The current tests are on a 4 Titan V system with PCI backplane. The Titan V has unified virtual memory, so I thought it might work. I am going to try on a higher end system with NVLink etc.... messages from gather on GPU memory Pencil (https://github.com/christophernhill/PencilArrays.jl/blob/02a648481fdd5273b4f97d18a848690599dbc3b5/gpu-experimenting/p1.jl#L81 ) when it is active .
|
@christophernhill Awesome, apparently there's not much work to do after all.
Your latest commit seems to confirm that it's a lower level issue, as
I agree, that would be a good solution. The only thing is, I would prefer not having CUDA.jl as a hard dependency. But, taking inspiration from Oceananigans, I'm thinking that such a dispatch can be done more generically by defining |
@jipolanco it does look like the gather() issue is not present with mvapich2, only with openmpi. So its most likely its a lower level stack issue. Requires sounds worth a try. We can take a quick look if you like? @jipolanco it may make sense for you to register something with https://github.com/JuliaGPU/buildkite at some point? That would give a way to potentially do some GPU CI against main PencilArrays using BuildKite. There are a bunch of registration instructions here https://github.com/JuliaGPU/buildkite . |
@christophernhill Of course, feel free to look into this. Note that I'm already using Requires to optionally load HDF5.jl as a parallel I/O backend, so I guess you could start from there.
Yes, that would make sense. If I understand correctly, for this I should transfer PencilArrays to a GitHub organisation? I guess I could transfer it to JuliaParallel, or I could create my own organisation. |
@jipolanco I believe you can register with Buildkite as an individual too. The organization may be fuzzy language. @maleadt and/or @vchuravy may be able to clarify. |
It's possible to add BuildKite CI to personal repos, but it's much easier when you use one of the already set-up organizations. |
Maybe JuliaArrays? |
Thanks for the suggestion. That would definitely make sense for PencilArrays, but I'm not sure if PencilFFTs (which should also be tested on GPUs at some point) would be a good fit there. |
Hello! Let me start by thanking you for this project! It's super useful and it's impressive that it's benchmarks are better than P3DFFT!
I'm currently working with and collaborating to Oceananigans, which has started using your packages (both this one and PencilFFTs.jl) to implement some distributed functionality. @ali-ramadhan implemented that and it's been working well, but really the end goal for us is to parallelize our computations across multiple GPUs and I think the first step is to make this package play nice with CUDA.
So here are a couple of questions if you don't mind:
Thanks!
CC: @francispoulin @ali-ramadhan
The text was updated successfully, but these errors were encountered: