-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-atomic pairwise force summation kernels. #133
Conversation
I have presently changed the implementation such that it utilizes two kernels: One for neighbor list and one without it.
|
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #133 +/- ##
==========================================
- Coverage 73.12% 72.25% -0.87%
==========================================
Files 35 35
Lines 5142 5212 +70
==========================================
+ Hits 3760 3766 +6
- Misses 1382 1446 +64
☔ View full report in Codecov by Sentry. |
Looks like a good start. With regards to benchmarking I would use a system of 1000-5000 atoms and use
This stuff can be hard to debug. I would use It might be worth asking on the #gpu Slack channel about sparse matrices. CUDA has sparse matrices but I guess the issue is that these don't work in kernels? I would also be interested to see the speed of using the OpenMM atom block strategy even with using the dense adjacency matrix. Another idea is to use a dense adjacency matrix, but create a sparse data structure with the first thread inside the kernel itself. You could have two shared memory int vectors for the interacting indices. Threads would then loop over those pairs. |
Not the best CUDA programmer, but I thought I'd throw this out there. I'm sure I'm missing some intricacy as I have literally spent 0 minutes thinking about force kernels on a GPU but the CUDA part of Molly is something I'd like to understand more. What are downsides of using a kernel like this besides you are not manually choosing the block sizes etc? too many kernel invocatons? function kernel(atom_idx, interacting_atoms)
return mapreduce(j -> force(atom_idx, j), +, interacting_atoms)
end |
@ejmeitz Right now the kernel is essentially doing what you have pointed out. Maybe I will try and see how the map reduce compares to the current implementation. But I think the problem with just using this is map reduce is that it is well suited to do reductions when the reduced over indices are mostly independent and no extra optimizations can be made anyway. But in our case, we are dealing with N^2 interactions where we can certainly benefit from data reuse and operation ordering. This can happen only if the kernel is doing things at a warp level so that we can shuffle things around while smartly going through the reduction and force calculation at once. I am working on this right now and trying to straighten a few bugs. I also have little experience with CUDA apart from examples for now, as far as I understand this is the main motivation to begin with this barebones implementation, I am not sure how much these optimizations will help but it will be surely worth something 😄 (unless neighbor data is way sparse that the extra overhead of loops takes over but that can be dealt with as another case?) |
Bechmark results on for the Device: Tesla V100-PCIE-16GB
|
Btw I have access to a cluster of A40s and another of 2080s of you wanna test on distributed systems or just on a different single GPU. |
This comment was marked as outdated.
This comment was marked as outdated.
Tiled Kernel
Remove inner loop over tiles from warp
Benchmark results on the
|
Simulation | Atoms | Number of Pairs | Min time (ms) | Median time (ms) | Mean time (ms) | Mean time per 10000 pairs (μs) |
---|---|---|---|---|---|---|
Approach 0 | 3072 | 4717056 | 5.193 | 5.331 | 5.339 | 11.31 |
Approach 0 f32 | 3072 | 4717056 | 5.306 | 5.466 | 5.484 | 11.63 |
Approach 1 | 3072 | 4717056 | 1.402 | 1.572 | 1.568 | 3.32 |
Approach 1 f32 | 3072 | 4717056 | 0.870 | 1.065 | 1.061 | 2.24 |
Approach 2 | 3072 | 4717056 | 2.015 | 2.213 | 2.220 | 4.71 |
Approach 2 f32 | 3072 | 4717056 | 0.502 | 0.512 | 0.513 | 1.08 |
Takeaway: getting rid of the internal loop gives 2x improvement for f32 but we have to pay the price with worse f64 performance. I think this can be because approach 2 increases the number of blocks by a factor of n_threads / WARPSIZE
which results in drop in performance to dispatch the f64 calculations to the SMs in the GPU? If this is the case then this can be improved if we want to by each warp calculating 2 or 4 tiles at a time (without a loop just multiple lines of same code).
Cool, I would focus on f32 performance since force calculation is usually safe to run in f32. I know other software has severe slowdown with f64, it might be unavoidable. Am I right in interpreting that the f32 median time before this PR was 1.098 ms for no NL / 0.210 ms for NL and now is 0.512 ms for no NL? That's getting somewhere. I would aim to move to the NL version soon, particularly if there are things you have learned with the no NL case that can be used there. Can approach 2 above be applied in a performant way to the case where interactions are skipped if they are not in a dense neighbour matrix? |
In the current state the kernel is quite minimal and the only major optimization opportunity that I can see is to shuffle atom and coordinate data as well which will certainly require some preprocessing that I haven't thought of yet how to integrate with the whole interface. The NONL version gives quite a motivation for similar implementation for NL as well if we can think of a way to construct some dense tiles out of the sparse adjacency matrix? I am not quite sure about this. But this will work quite naturally if we are dealing with cell lists! @jgreener64 No the 1.098 ms is the median time for the kernel in which we simply have a for loop over all j's for a particular i (using block stride and then a binary reduction over shared memory). But for NL i haven't touched anything so yes that the same. Looking at the benchmarks of the atomic version (before the pr) that I have included just now you can see this kernel is 10x faster that that. If we can achieve similar performance with a NL it would be certainly something to look for. EDIT: Also the current kernel only works if the number of atoms is a multiple of of |
Great, even better. Yes I think the NL kernel would be based on this new no NL one, the current NL kernel is probably too simple to speed up a lot. How easy would it be to have the above approach 2 but have another input
I would try adding the |
Okay I will try both the things and benchmark them 👍 |
Yes, the current benchmarks are for just the using Molly
using BenchmarkTools
using CUDA
function setup_sim(nl::Bool, f32::Bool)
local n_atoms = 3219
atom_mass = f32 ? 10.0f0u"u" : 10.0u"u"
boundary = f32 ? CubicBoundary(6.0f0u"nm") : CubicBoundary(6.0u"nm")
starting_coords = place_atoms(n_atoms, boundary; min_dist=0.2u"nm")
starting_velocities = [random_velocity(atom_mass, 1.0u"K") for i in 1:n_atoms]
starting_coords_f32 = [Float32.(c) for c in starting_coords]
starting_velocities_f32 = [Float32.(c) for c in starting_velocities]
simulator = VelocityVerlet(dt=f32 ? 0.02f0u"ps" : 0.02u"ps")
neighbor_finder = NoNeighborFinder()
cutoff = DistanceCutoff(f32 ? 1.0f0u"nm" : 1.0u"nm")
pairwise_inters = (LennardJones(use_neighbors=false, cutoff=cutoff),)
if nl
neighbor_finder = DistanceNeighborFinder(
eligible=CuArray(trues(n_atoms, n_atoms)),
n_steps=10,
dist_cutoff=f32 ? 1.5f0u"nm" : 1.5u"nm",
)
pairwise_inters = (LennardJones(use_neighbors=true, cutoff=cutoff),)
end
coords = CuArray(f32 ? starting_coords_f32 : starting_coords)
velocities = CuArray(f32 ? starting_velocities_f32 : starting_velocities)
atoms = CuArray([Atom(charge=f32 ? 0.0f0 : 0.0, mass=atom_mass, σ=f32 ? 0.2f0u"nm" : 0.2u"nm",
ϵ=f32 ? 0.2f0u"kJ * mol^-1" : 0.2u"kJ * mol^-1") for i in 1:n_atoms])
sys = System(
atoms=atoms,
coords=coords,
boundary=boundary,
velocities=velocities,
pairwise_inters=pairwise_inters,
neighbor_finder=neighbor_finder,
)
return sys, simulator
end
runs = [
("GPU NONL" , [false, false]),
("GPU NONL f32" , [false, true]),
("GPU NL" , [true , false]),
("GPU NL f32", [true , true]),
]
for (name, args) in runs
println("*************** Run: $name ************************")
sys, sim = setup_sim(args...)
n_atoms = length(sys)
neighbors = find_neighbors(sys)
nbrs = isnothing(neighbors) ? n_atoms * (n_atoms - 1) ÷ 2 : length(neighbors)
println("> Total Pairs = $nbrs")
f = forces(sys, neighbors)
b = @benchmark CUDA.@sync forces($sys, $neighbors)
display(b)
end I will also try and setup Nsight compute but if you can take a quick look at the profiles as well for the neighbor list kernel that would also be great. |
Here's the GPU NL F32 output. A lot more kernels were generated than I expected. The first 3 were generated by If you unzip the file below and double click on one of the kernels it will dump all sorts of information and things that could be optimized. I looked quickly at the pairwise_force_kernel and its only using maybe 30% of my GPU's resources. Looks like there's a lot of un-coalesced accesses and warp divergence. |
Thanks a lot @ejmeitz. I will go through these and try to see if fixing these problems can improve things. |
So... Out of all the different approaches for the NL kernel everything has almost the same performance even the atomic approach. I think this suggest that the computation is not making any difference and the performance is almost entirely dependent on the memory read-write. Profiles also show the same thing. I am not really sure how to deal with this memory bottleneck. Even using the shared memory to store the atoms in between the computation does not deal with this issue. I had a look at the generated PTX code which shows that many intermediate results are stored in local memory instead in the registers and the using shared memory does not get rid of majority of memory stalls. Maybe there is some way to deal with this, but I will have to look more into this. For now, it would be great if the NONL kernel can be merged at least and I will work on the NL kernel in a separate PR from scratch with my newly gained experience :) As of now according to the benchmarks for different system sizes, the NONL kernel is 20-30 times faster than the atomic approach so that's quite an improvement and any further improvement faces the same issue of memory read-write being so slow that computation times are not holding a candle to that. |
Sounds good, I can try and review this PR next week. You will need to rebase/merge. The no NL kernel is definitely an improvement. One thing that would be nice to add is a comment by the kernel with a description/diagram of the approach used. I don't know how the kernel will play with Enzyme, but I can look into any problems there myself. Well done for doggedly pursuing the NL kernel, I know how frustrating it can be. Something to consider is starting from the other direction, completely ablating everything, just loop over the neighbouring pairs and return zero forces. I guess that should be fast? Then see what minimal addition causes the slowdown, make a self-contained example and discuss it here, on Slack or on the CUDA.jl issues. |
Thanks @jgreener64, I will rebase the incoming branch. Also should I also make similar changes to the pairwise potential energy kernel as well? |
If it's easy to change the potential energy kernel then do. If it will take up anything more than a short time then focus on the NL kernel and I will implement the potential energy kernel later as a way to learn what is going on. |
I am not sure why the CI runs are failing. Everything works for me locally. |
This looks good bar the failure when CUDA is not available. That will be due to Do the GPU tests pass for you? I can run them locally tomorrow too. Long term we should get set up on the Julia GPU CI infrastructure to test this kind of thing. |
Yes the problem seems to lie with The GPU tests also pass for me locally. |
I am also seeing an occasional failure on the Monte Carlo anisotropic barostat GPU tests but I'm not sure it is due to this change, do you see that?
May as well remove these lines if they are not needed. |
The barostat tests fail intermittently on my local machine for CPU. Im not sure that is related to changes made in this PR. |
Benchmark results before any changes
Job Properties
JULIA_NUM_THREADS => 16
Results
Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the
ID
column have the structure[parent_group, child_group, ..., key]
, and can be used toindex into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.
["interactions", "Coulomb energy"]
["interactions", "Coulomb force"]
["interactions", "HarmonicBond energy"]
["interactions", "HarmonicBond force"]
["interactions", "LennardJones energy"]
["interactions", "LennardJones force"]
["protein", "CPU parallel NL"]
["simulation", "CPU NL"]
["simulation", "CPU f32 NL"]
["simulation", "CPU f32"]
["simulation", "CPU parallel NL"]
["simulation", "CPU parallel f32 NL"]
["simulation", "CPU parallel f32"]
["simulation", "CPU parallel"]
["simulation", "CPU"]
["simulation", "GPU NL"]
["simulation", "GPU f32 NL"]
["simulation", "GPU f32"]
["simulation", "GPU"]
["spatial", "vector"]
["spatial", "vector_1D"]
Benchmark Group List
Here's a list of all the benchmark groups executed by this job:
["interactions"]
["protein"]
["simulation"]
["spatial"]
Julia versioninfo