Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KernelAbstractions support #147

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

leios
Copy link
Contributor

@leios leios commented Sep 7, 2023

The KernelAbstractions branch now compiles, so I thought I would put forward a quick draft PR while I figure out all the runtime bugs.

Notes:

  1. This builds off of adding preliminary AMDGPU support #99 and should replace it entirely
  2. CUDA has been removed and replaced with KernelAbstractions and GPUArrays. As an important note here, GPUArrays is not strictly necessary except to replicate the behavior of the boolean GPU flag (ie isa(a, AbstractGPUArray)).
  3. If this is merged, other GPU types (Metal, AMD, Intel) will also be supported, but I can only test on AMD (and maybe Metal if I can get someone to try it with a mac).
  4. I need to add in the changes from Non-atomic pairwise force summation kernels. #133. If there is something we are missing on the KernelAbstractions side, I can try to add it in, but I think we are good to go.

@leios
Copy link
Contributor Author

leios commented Sep 7, 2023

Ah, I guess while I'm here, I'll briefly explain the differences with CUDA syntactically:

  1. Indexing is easier: @index(Global / Group / Local, Linear / NTuple / CartesianIndex) vs (blockIdx().x - 1) * blockDim().x + threadIdx().x for CUDA
  2. Kernels run off of an ndrange for the range of elements (OpenCL inspired syntax)
  3. Launching kernels requires configuration with a backend, see: https://github.com/leios/Molly.jl/blob/KA_support/src/kernels.jl#L21
  4. Certain functions now execute on the backend CUDA.zeros(...) -> zeros(backend, args...)

The tricky thing about this PR was removing the CUDA dependency outside of the kernels. There is still one call in zygote.jl I gotta figure out: https://github.com/leios/Molly.jl/blob/KA_support/src/zygote.jl#L698

@jgreener64
Copy link
Collaborator

Great work so far. Making the code compatible with generic array types is a nice change, and having the kernels work on different devices would be a selling point of the package.

I would be interested to see the performance of the kernels compared to the CUDA versions. Also whether it plays nicely with Enzyme. Good luck with the runtime errors.

@leios
Copy link
Contributor Author

leios commented Sep 8, 2023

I think I can finish this up today or else early next week (emphasis on think), but to quickly answer the questions:

  1. KA essentially just writes vendor-specific code (ie CUDA) from the generic code input, so if we don't have identical performance to CUDA, then that's a bug. I'll do the performance testing similar to Non-atomic pairwise force summation kernels. #133 once the code is cleaned up.
  2. Enzyme should also not be an issue; however, there are some reports of error handling being an issue: Enzyme + KA Stalls on Error instead of reporting it EnzymeAD/Enzyme.jl#365

@jgreener64
Copy link
Collaborator

Great. Not urgent, but how well does KernelAbstractions.jl deal with warp-level code, e.g. warpsize() and sync_warp()?

@leios
Copy link
Contributor Author

leios commented Sep 8, 2023

That's a good question. We can probably expose the APIs available from CUDA, but I am not sure how AMDGPU deals with these. We would also just need to figure out what that corresponds to on parallel CPU.

I think these are the tools we need: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html
So they are available, it's just a matter of exposing them in KA and figuring out what it corresponds to for different backends.

Ah, as an important note (that I somehow failed to mention before), an advantage of KA is that it also provides a parallel CPU implentation, so the kernels can be written once and executed everywhere. I didn't do that in this PR because that brings up design questions related to Molly internals.

@jgreener64
Copy link
Collaborator

I didn't do that in this PR because that brings up design questions related to Molly internals.

Yeah we can discuss that after this PR. I would be okay with switching if there was no performance hit.

Judging from discussion on the linked PR there is not currently warp support in KA. It may be necessary to leave that CUDA kernel in and have a separate KA kernel for other backends until warp support comes to KA.

@leios
Copy link
Contributor Author

leios commented Sep 9, 2023

Ok, so a couple of quick notes here:

  1. There are a few host calls that are not yet supported by AMDGPU (such as findall). My understanding was that such calls would eventually be ported to GPUArrays, but I don't think that has happened yet. Note that some of the stalling here is because we are waiting to get KA into GPUArrays (Transition GPUArrays to KernelAbstractions JuliaGPU/GPUArrays.jl#451). At least for findall, the kernel is not that complex: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/indexing.jl#L23, so we could put it into AMDGPU or something for now; however, we are stuck on an older version of AMDGPU due to some package conflicts. The quick fix would be to do it the ol' fashioned way and just stick the necessary kernels in Molly under a file like, kernel_hacks,.jl or something. Such issues were also what stalled adding preliminary AMDGPU support #99.
  2. Non-atomic pairwise force summation kernels. #133 seems to only use warpsize and warp_sync for warp-level semantics. The KA kernel would probably get the warpsize on the host and then pass it in as a parameter. warp_sync is a bit more interesting because, well, at least in the old days warps didn't need any synchronizing. It seems that things changed in Volta and most people missed the memo. Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel. We could also add in warp-level semantics to KA, but that would take some time to propagate to all the independent GPU APIs and (as mentioned in 1), we are kinda stuck on older versions of AMDGPU and CUDA because of compatability with other packages.
  3. I am realizing that there is a greater conflict with this PR. Namely, I don't know if I have the bandwidth to do any sort of maintainence on Molly after this PR is in. I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave. On the other hand, getting this to work on AMD would be great and really useful. Let me think on that.

@jgreener64
Copy link
Collaborator

Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel.

That is okay.

I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave.

I wouldn't worry about this. Currently I only merge stuff that I am able to maintain, or where I think I can skill up to the point of maintaining it. The changes here seem reasonable and worth merging once any errors and performance regressions are fixed. There is a wider question about whether KernelAbstractions.jl will continue to be maintained compared to CUDA.jl, but it seems to have decent traction now.

@leios
Copy link
Contributor Author

leios commented Sep 10, 2023

Yeah, the plan is for KA to be used even within GPUArrays, so it's not going anywhere anytime soon. Speaking of which, the "correct" course of action for KA in Molly would be to get the KA in GPUArrays first and then use that to implement any missing features on the GPUArrays level.

Would it be better for me to separate this PR then? Maybe one doing the generic Array stuff and then another with the KA support?

@jgreener64
Copy link
Collaborator

I would try and get this PR working as is. Only if that becomes difficult would it be worth splitting out and merging the generic array support.

If KA is here for the long haul then there is a benefit to switching the kernels even if only CUDA works currently. Because then when changes happen elsewhere, AMDGPU will work without any changes required in Molly.

Copy link

codecov bot commented Jun 28, 2024

Codecov Report

Attention: Patch coverage is 14.31981% with 359 lines in your changes missing coverage. Please review.

Project coverage is 74.99%. Comparing base (795a2bc) to head (28e7381).
Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
src/kernels.jl 1.04% 190 Missing ⚠️
ext/MollyCUDAExt.jl 4.16% 92 Missing ⚠️
src/interactions/implicit_solvent.jl 6.12% 46 Missing ⚠️
src/neighbors.jl 28.57% 10 Missing ⚠️
src/spatial.jl 27.27% 8 Missing ⚠️
src/energy.jl 25.00% 3 Missing ⚠️
src/force.jl 25.00% 3 Missing ⚠️
ext/MollyCUDAEnzymeExt.jl 0.00% 2 Missing ⚠️
ext/MollyPythonCallExt.jl 0.00% 2 Missing ⚠️
src/types.jl 93.54% 2 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #147      +/-   ##
==========================================
+ Coverage   74.32%   74.99%   +0.66%     
==========================================
  Files          35       38       +3     
  Lines        5028     4998      -30     
==========================================
+ Hits         3737     3748      +11     
+ Misses       1291     1250      -41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/chain_rules.jl Outdated Show resolved Hide resolved
@leios
Copy link
Contributor Author

leios commented Sep 30, 2024

Getting around to this and noticed a bunch of segfaults in the CPU tests. I then found that there's a strange conflict between AMDGPU and Molly. Even on the master branch, this script will create a segfault:

using Molly

n_atoms = 100
atom_mass = 10.0f0u"g/mol"
boundary = CubicBoundary(2.0f0u"nm")
temp = 100.0f0u"K"
cpu_coords = place_atoms(n_atoms, boundary; min_dist=0.3u"nm")
cpu_atoms = Array([Atom(mass=atom_mass, σ=0.3f0u"nm", ϵ=0.2f0u"kJ * mol^-1") for
 i in 1:n_atoms])
cpu_velocities = Array([random_velocity(atom_mass, temp) for i in 1:n_atoms])
cpu_simulator = VelocityVerlet(dt=0.002f0u"ps")

cpu_sys = System(
    atoms=cpu_atoms,
    coords=cpu_coords,
    boundary=boundary,
    velocities=cpu_velocities,
    pairwise_inters=(LennardJones(),),
    loggers=(
        temp=TemperatureLogger(typeof(1.0f0u"K"), 10),
        coords=CoordinateLogger(typeof(1.0f0u"nm"), 10),
    ),
)

simulate!(deepcopy(cpu_sys), cpu_simulator, 20) # Compile function
simulate!(cpu_sys, cpu_simulator, 2000)

But only if AMDGPU is loaded before include("cpu.jl"). Not sure how to go about debugging this on, but writing it down so it is documented somewhere. The segfault:


julia> include("tmp/cpu.jl")
System with 100 atoms, boundary CubicBoundary{Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}(Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}[2.0f0 nm, 2.0f0 nm, 2.0f0 nm])

julia> 
[leios@noema Molly.jl]$ julia --project -t 12
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.2 (2024-03-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU

julia> include("tmp/cpu.jl")
[1727708527.809644] [noema:37885:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1727708527.809644] [noema:37885:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:0:37894] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:1:37897] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809644] [noema:37885:3]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:3:37893] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809649] [noema:37885:2]           debug.c:1297 UCX  WARN  ucs_debug_disable_signal: signal 1 was not set in ucs
[noema:37885:2:37892] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809644] [noema:37885:4]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:4:37889] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:6:37890] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809728] [noema:37885:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:7:37888] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:8:37891] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:9:37895] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809730] [noema:37885:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1727708527.809735] [noema:37885:5]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:5:37898] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809741] [noema:37885:3]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:10:37896] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
==== backtrace (tid:  37894) ====
 0 0x000000000004d212 ucs_event_set_fd_get()  ???:0
 1 0x000000000004d3dd ucs_event_set_fd_get()  ???:0
 2 0x000000000003d1d0 __sigaction()  ???:0
 3 0x00000000000845d4 ijl_process_events()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jl_uv.c:277
 4 0x0000000000097f8d ijl_task_get_next()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/partr.c:524
 5 0x0000000001cb0bd8 julia_poptask_75383()  ./task.jl:985
 6 0x0000000001cb0bd8 julia_poptask_75383()  ./task.jl:987
 7 0x0000000000997f72 julia_wait_74665()  ./task.jl:994
 8 0x0000000000962c1c julia_task_done_hook_75296()  ./task.jl:675
 9 0x0000000001443a97 jfptr_task_done_hook_75297.1()  :0
10 0x0000000000046a0e _jl_invoke()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894
11 0x0000000000069c17 jl_apply()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982
12 0x0000000000069d9e start_task()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:1249
=================================

[37885] signal (11.-6): Segmentation fault
in expression starting at /home/leios/projects/CESMIX/Molly.jl/tmp/cpu.jl:25
ijl_process_events at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jl_uv.c:277
ijl_task_get_next at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/partr.c:524
poptask at ./task.jl:985
wait at ./task.jl:994
task_done_hook at ./task.jl:675
jfptr_task_done_hook_75297.1 at /home/leios/builds/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_finish_task at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:320
start_task at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:1249
Allocations: 39536048 (Pool: 39464158; Big: 71890); GC: 46
Segmentation fault (core dumped)

st:

(Molly) pkg> st
Project Molly v0.21.1
Status `~/projects/CESMIX/Molly.jl/Project.toml`
  [a9b6321e] Atomix v0.1.0
⌅ [a963bdd2] AtomsBase v0.3.5
  [a3e0e189] AtomsCalculators v0.2.2
  [de9282ab] BioStructures v4.2.0
⌃ [052768ef] CUDA v5.4.3
  [69e1c6dd] CellListMap v0.9.6
  [082447d4] ChainRules v1.71.0
  [d360d2e6] ChainRulesCore v1.25.0
  [46823bd8] Chemfiles v0.10.41
  [861a8166] Combinatorics v1.0.2
  [864edb3b] DataStructures v0.18.20
  [b4f34e82] Distances v0.10.11
  [31c24e10] Distributions v0.25.112
⌅ [7da242da] Enzyme v0.12.36
  [8f5d6c58] EzXML v1.2.0
  [cc61a311] FLoops v0.2.2
  [f6369f11] ForwardDiff v0.10.36
  [86223c79] Graphs v1.12.0
  [5ab0869b] KernelDensity v0.6.9
  [b8a86587] NearestNeighbors v0.4.20
  [7b2266bf] PeriodicTable v1.2.1
  [189a3867] Reexport v1.2.2
⌅ [64031d72] SimpleCrystals v0.2.0
  [90137ffa] StaticArrays v1.9.7
  [1986cc42] Unitful v1.21.0
  [a7773ee8] UnitfulAtomic v1.0.0
  [f31437dd] UnitfulChainRules v0.1.2
  [d80eeb9a] UnsafeAtomicsLLVM v0.2.1
  [e88e6eb3] Zygote v0.6.71
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`

Note that using a single thread "fixes" the issue. It seems to be a UCX / MPI issue, but I am not loading them and neither are in the Manifest.

@vchuravy
Copy link

vchuravy commented Sep 30, 2024

@leios
Copy link
Contributor Author

leios commented Sep 30, 2024

The fix mentioned there seems to work:

[leios@noema Molly.jl]$ export UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE"
[leios@noema Molly.jl]$ julia --project -t 12
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.2 (2024-03-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU

julia> include("tmp/cpu.jl")
System with 100 atoms, boundary CubicBoundary{Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}(Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}[2.0f0 nm, 2.0f0 nm, 2.0f0 nm])

julia> 

Note st - m has no MPI or UCX

(Molly) pkg> st -m
Project Molly v0.21.1
Status `~/projects/CESMIX/Molly.jl/Manifest.toml`
  [621f4979] AbstractFFTs v1.5.0
  [7d9f7c33] Accessors v0.1.38
  [79e6a3ab] Adapt v4.0.4
  [66dad0bd] AliasTables v1.1.3
  [dce04be8] ArgCheck v2.3.0
  [ec485272] ArnoldiMethod v0.4.0
  [a9b6321e] Atomix v0.1.0
⌅ [a963bdd2] AtomsBase v0.3.5
  [a3e0e189] AtomsCalculators v0.2.2
  [13072b0f] AxisAlgorithms v1.1.0
  [ab4f0b2a] BFloat16s v0.5.0
  [198e06fe] BangBang v0.4.3
  [9718e550] Baselet v0.1.1
  [47718e42] BioGenerics v0.1.5
  [de9282ab] BioStructures v4.2.0
  [3c28c6f8] BioSymbols v5.1.3
  [fa961155] CEnum v0.5.0
⌃ [052768ef] CUDA v5.4.3
  [1af6417a] CUDA_Runtime_Discovery v0.3.5
  [69e1c6dd] CellListMap v0.9.6
  [082447d4] ChainRules v1.71.0
  [d360d2e6] ChainRulesCore v1.25.0
  [46823bd8] Chemfiles v0.10.41
  [944b1d66] CodecZlib v0.7.6
  [3da002f7] ColorTypes v0.11.5
  [5ae59095] Colors v0.12.11
  [861a8166] Combinatorics v1.0.2
  [bbf7d656] CommonSubexpressions v0.3.1
  [34da2185] Compat v4.16.0
  [a33af91c] CompositionsBase v0.1.2
  [187b0558] ConstructionBase v1.5.8
  [6add18c4] ContextVariablesX v0.1.3
  [a8cc5b0e] Crayons v4.1.1
  [9a962f9c] DataAPI v1.16.0
  [a93c6f00] DataFrames v1.7.0
  [864edb3b] DataStructures v0.18.20
  [e2d170a0] DataValueInterfaces v1.0.0
  [244e2a9f] DefineSingletons v0.1.2
  [163ba53b] DiffResults v1.1.0
  [b552c78f] DiffRules v1.15.1
  [b4f34e82] Distances v0.10.11
  [31c24e10] Distributions v0.25.112
  [ffbed154] DocStringExtensions v0.9.3
⌅ [7da242da] Enzyme v0.12.36
⌅ [f151be2c] EnzymeCore v0.7.8
  [e2ba6199] ExprTools v0.1.10
  [8f5d6c58] EzXML v1.2.0
  [7a1cc6ca] FFTW v1.8.0
  [cc61a311] FLoops v0.2.2
  [b9860ae5] FLoopsBase v0.1.1
  [1a297f60] FillArrays v1.13.0
  [53c48c17] FixedPointNumbers v0.8.5
  [1fa38f19] Format v1.3.7
  [f6369f11] ForwardDiff v0.10.36
  [0c68f7d7] GPUArrays v10.3.1
  [46192b85] GPUArraysCore v0.1.6
⌅ [61eb1bfa] GPUCompiler v0.26.7
  [86223c79] Graphs v1.12.0
  [34004b35] HypergeometricFunctions v0.3.24
  [7869d1d1] IRTools v0.4.14
  [d25df0c9] Inflate v0.1.5
  [22cec73e] InitialValues v0.3.1
  [842dd82b] InlineStrings v1.4.2
  [a98d9a8b] Interpolations v0.15.1
  [3587e190] InverseFunctions v0.1.17
  [41ab1584] InvertedIndices v1.3.0
  [92d709cd] IrrationalConstants v0.2.2
  [82899510] IteratorInterfaceExtensions v1.0.0
  [692b3bcd] JLLWrappers v1.6.0
  [b14d175d] JuliaVariables v0.2.4
⌃ [63c18a36] KernelAbstractions v0.9.26
  [5ab0869b] KernelDensity v0.6.9
⌅ [929cbde3] LLVM v8.1.0
  [8b046642] LLVMLoopInfo v1.0.0
  [b964fa9f] LaTeXStrings v1.3.1
  [2ab3a3ac] LogExpFunctions v0.3.28
  [d8e11817] MLStyle v0.4.17
  [1914dd2f] MacroTools v0.5.13
  [128add7d] MicroCollections v0.2.0
  [e1d29d7a] Missings v1.2.0
  [5da4648a] NVTX v0.3.4
  [77ba4419] NaNMath v1.0.2
  [71a1bf82] NameResolution v0.1.5
  [b8a86587] NearestNeighbors v0.4.20
  [d8793406] ObjectFile v0.4.2
  [6fe1bfb0] OffsetArrays v1.14.1
  [bac558e1] OrderedCollections v1.6.3
  [90014a1f] PDMats v0.11.31
  [d96e819e] Parameters v0.12.3
  [7b2266bf] PeriodicTable v1.2.1
  [2dfb63ee] PooledArrays v1.4.3
  [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.4.3
  [8162dcfd] PrettyPrint v0.2.0
  [08abe8d2] PrettyTables v2.4.0
  [92933f4c] ProgressMeter v1.10.2
  [43287f4e] PtrArrays v1.2.1
  [1fd47b50] QuadGK v2.11.1
  [74087812] Random123 v1.7.0
  [e6cf234a] RandomNumbers v1.6.0
  [c84ed2f1] Ratios v0.4.5
  [c1ae055f] RealDot v0.1.0
  [3cdcf5f2] RecipesBase v1.3.4
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [79098fc4] Rmath v0.8.0
  [6c6a2e73] Scratch v1.2.1
  [91c51154] SentinelArrays v1.4.5
  [efcf1570] Setfield v1.1.1
⌅ [64031d72] SimpleCrystals v0.2.0
  [699a6c99] SimpleTraits v0.9.4
  [a2af1166] SortingAlgorithms v1.2.1
  [dc90abb0] SparseInverseSubset v0.1.2
  [276daf66] SpecialFunctions v2.4.0
  [171d559e] SplittablesBase v0.1.15
  [90137ffa] StaticArrays v1.9.7
  [1e83bf80] StaticArraysCore v1.4.3
  [82ae8749] StatsAPI v1.7.0
  [2913bbd2] StatsBase v0.34.3
  [4c63d2b9] StatsFuns v1.3.2
  [892a3eda] StringManipulation v0.4.0
  [09ab397b] StructArrays v0.6.18
  [53d494c1] StructIO v0.3.1
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.12.0
  [1c621080] TestItems v1.0.0
  [a759f4b9] TimerOutputs v0.5.24
  [3bb67fe8] TranscodingStreams v0.11.2
  [28d57a85] Transducers v0.4.82
  [3a884ed6] UnPack v1.0.2
  [1986cc42] Unitful v1.21.0
  [a7773ee8] UnitfulAtomic v1.0.0
  [f31437dd] UnitfulChainRules v0.1.2
  [013be700] UnsafeAtomics v0.2.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.2.1
  [efce3f68] WoodburyMatrices v1.0.0
  [e88e6eb3] Zygote v0.6.71
  [700de1a5] ZygoteRules v0.2.5
⌅ [4ee394cb] CUDA_Driver_jll v0.9.2+0
⌅ [76a88914] CUDA_Runtime_jll v0.14.1+0
  [78a364fa] Chemfiles_jll v0.10.4+0
⌅ [7cc45869] Enzyme_jll v0.0.148+0
  [f5851436] FFTW_jll v3.3.10+1
  [1d5cc7b8] IntelOpenMP_jll v2024.2.1+0
  [9c1d0b0a] JuliaNVTXCallbacks_jll v0.2.1+0
⌅ [dad2f222] LLVMExtra_jll v0.0.31+0
  [94ce4f54] Libiconv_jll v1.17.0+0
  [856f044c] MKL_jll v2024.2.0+0
  [e98f9f5b] NVTX_jll v3.1.0+2
  [efe28fd5] OpenSpecFun_jll v0.5.5+0
  [f50d1b31] Rmath_jll v0.5.1+0
  [02c8fc9c] XML2_jll v2.13.3+0
  [1317d2d5] oneTBB_jll v2021.12.0+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
  [4607b0f0] SuiteSparse
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.1.0+0
  [deac9b47] LibCURL_jll v8.4.0+0
  [e37daf67] LibGit2_jll v1.6.4+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.2+1
  [14a3606d] MozillaCACerts_jll v2023.1.10
  [4536629a] OpenBLAS_jll v0.3.23+4
  [05823500] OpenLibm_jll v0.8.1+2
  [bea87d4a] SuiteSparse_jll v7.2.1+1
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.8.0+1
  [8e850ede] nghttp2_jll v1.52.0+1
  [3f19e933] p7zip_jll v17.4.0+2
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated -m`

@vchuravy
Copy link

vchuravy commented Oct 1, 2024

Wild... What is Libc.dllist()? Who loads this darn library

@leios
Copy link
Contributor Author

leios commented Oct 1, 2024

julia> Libc.Libdl.dllist()
32-element Vector{String}:
 "linux-vdso.so.1"
 "/usr/lib/libdl.so.2"
 "/usr/lib/libpthread.so.0"
 "/usr/lib/libc.so.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/libjulia.so.1.10"
 "/lib64/ld-linux-x86-64.so.2"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgcc_s.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libopenlibm.so"
 "/usr/lib/libstdc++.so.6"
 "/usr/lib/libm.so.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libjulia-internal.so.1.10"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libunwind.so.8"
 "/usr/lib/librt.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libz.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libatomic.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libjulia-codegen.so.1.10"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libLLVM-15jl.so"
 "/home/leios/builds/julia-1.10.2/lib/julia/sys.so"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libpcre2-8.so"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgmp.so.10"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmpfr.so.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgfortran.so.5"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libquadmath.so.0"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libopenblas64_.so"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libblastrampoline.so.5"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmbedcrypto.so.7"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmbedtls.so.14"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmbedx509.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libssh2.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgit2.so.1.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libnghttp2.so.14"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libcurl.so.4"

Is it a linux thing like libpthread?

# This triggers an error but it isn't printed
# See https://discourse.julialang.org/t/error-handling-in-cuda-kernels/79692
# for how to throw a more meaningful error
error("wrong force unit returned, was expecting $F but got $(unit(f[1]))")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interpolation here is particularly tricky. I would avoid that if at all possible.

Copy link
Contributor Author

@leios leios Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: are you referring to the error(...) call in an inlined function within an @kernel? Or carrying the units through to this stage in the first place?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error("Oops") is fine, error("Oops, $F") is sadly not since it string interpolation is really tough on the GPU compiler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove that, no problem.

@leios
Copy link
Contributor Author

leios commented Oct 24, 2024

Right, having an issue right now with zygote tests. Not really sure hot to go about debugging it, so I'll paste it here and then think about it

Differentiable simulation: Error During Test at /home/leios/projects/CESMIX/Molly.jl/test/zygote.jl:30
  Got exception outside of a @test
  MethodError: _pullback(::Zygote.Context{false}, ::typeof(Base.Broadcast.broadcasted), ::CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, ::typeof(mass), ::CuArray{Atom{Float64, Float64, Float64, Float64}, 1, CUDA.DeviceMemory}) is ambiguous.
  
  Candidates:
    _pullback(__context__::ZygoteRules.AContext, var"639"::typeof(Base.Broadcast.broadcasted), var"640"::AbstractGPUArrayStyle, f, args...)
      @ Zygote ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:66
    _pullback(__context__::ZygoteRules.AContext, var"387"::typeof(Base.Broadcast.broadcasted), var"388"::Base.Broadcast.AbstractArrayStyle, f::typeof(mass), args...)
      @ Molly ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:66
  
  Possible fix, define
    _pullback(::ZygoteRules.AContext, ::typeof(Base.Broadcast.broadcasted), ::AbstractGPUArrayStyle, ::typeof(mass), ::Vararg{Any})
  
  Stacktrace:
    [1] _apply(::Function, ::Vararg{Any})
      @ Core ./boot.jl:838
    [2] adjoint
      @ ~/.julia/packages/Zygote/NRp5C/src/lib/lib.jl:203 [inlined]
    [3] _pullback
      @ ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:67 [inlined]
    [4] broadcasted
      @ ./broadcast.jl:1341 [inlined]
    [5] #System#3
      @ ~/projects/CESMIX/Molly.jl/src/types.jl:565 [inlined]
    [6] _pullback(::Zygote.Context{false}, ::Molly.var"##System#3", ::CuArray{Atom{Float64, Float64, Float64, Float64}, 1, CUDA.DeviceMemory}, ::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, ::CubicBoundary{Float64}, ::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, ::Vector{Any}, ::Nothing, ::Tuple{LennardJones{false, DistanceCutoff{Float64, Float64, Float64}, Int64, Int64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}, CoulombReactionField{Float64, Float64, Int64, Float64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}}, ::Tuple{InteractionList2Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicBond{Float32, Float64}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}}, ::Tuple{}, ::Tuple{}, ::DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}, ::Tuple{}, ::Unitful.FreeUnits{(), NoDims, nothing}, ::Unitful.FreeUnits{(), NoDims, nothing}, ::Float64, ::Nothing, ::Type{System})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
    [7] System
      @ ~/projects/CESMIX/Molly.jl/src/types.jl:485 [inlined]
    [8] _pullback(::Zygote.Context{false}, ::typeof(Core.kwcall), ::@NamedTuple{atoms::CuArray{Atom{Float64, Float64, Float64, Float64}, 1, CUDA.DeviceMemory}, coords::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, boundary::CubicBoundary{Float64}, velocities::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, pairwise_inters::Tuple{LennardJones{false, DistanceCutoff{Float64, Float64, Float64}, Int64, Int64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}, CoulombReactionField{Float64, Float64, Int64, Float64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}}, specific_inter_lists::Tuple{InteractionList2Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicBond{Float32, Float64}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}}, general_inters::Tuple{}, neighbor_finder::DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}, force_units::Unitful.FreeUnits{(), NoDims, nothing}, energy_units::Unitful.FreeUnits{(), NoDims, nothing}}, ::Type{System})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
    [9] loss
      @ ~/projects/CESMIX/Molly.jl/test/zygote.jl:140 [inlined]
   [10] _pullback(::Zygote.Context{false}, ::var"#loss#39"{UnionAll, Bool, Bool, Bool, Bool, DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}, Vector{Float64}, CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{Int32, 1, CUDA.DeviceMemory}, Tuple{LennardJones{false, DistanceCutoff{Float64, Float64, Float64}, Int64, Int64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}, CoulombReactionField{Float64, Float64, Int64, Float64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}}, Vector{SVector{3, ForwardDiff.Dual{Nothing, Float64, 1}}}, Vector{SVector{3, ForwardDiff.Dual{Nothing, Float64, 1}}}, Vector{SVector{3, Float64}}, Vector{SVector{3, Float64}}, VelocityVerlet{Float64, RescaleThermostat{Float64}}, CubicBoundary{Float64}, Float64, Int64, Int64, var"#mean_min_separation#28"{var"#abs2_vec#27"}}, ::Float64, ::Float64)
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
   [11] pullback(::Function, ::Zygote.Context{false}, ::Float64, ::Vararg{Float64})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:90
   [12] pullback(::Function, ::Float64, ::Float64)
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:88
   [13] gradient(::Function, ::Float64, ::Vararg{Float64})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:147
   [14] macro expansion
      @ ~/projects/CESMIX/Molly.jl/test/zygote.jl:202 [inlined]
   [15] macro expansion
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [16] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/zygote.jl:31
   [17] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [18] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/runtests.jl:110
   [19] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [20] top-level scope
      @ none:6
   [21] eval
      @ ./boot.jl:385 [inlined]
   [22] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [23] _start()
      @ Base ./client.jl:552

@leios
Copy link
Contributor Author

leios commented Oct 24, 2024

As a note, it looks like #182 is close-ish to being done? If so, it might be best to rework this PR once those changes are in.

As it stands, this PR can work as a branch for any non-CUDA GPU (As long as the user does not need differentiable sims).

@jgreener64
Copy link
Collaborator

Thanks for keeping going on this. Yes #182 will get merged soon, I was waiting on full Enzyme GPU broadcasting support to avoid GPU differentiable simulations breaking, but I might just merge it and wait for that to arrive later. Consequently I wouldn't worry about the Zygote error.

I'm happy to help update this PR after #182, I realise that the ground has been moving under you.

@leios
Copy link
Contributor Author

leios commented Oct 24, 2024

Yeah, no worries. I should go ahead and close the #99 PR since this one supersedes it.

On my end, I am happy to wait a little longer and rebase up when you are happy with #182. I have other projects to work on in the mean time.

Could you link the Enzyme issue? Is it this one? EnzymeAD/Enzyme.jl#1672

@jgreener64
Copy link
Collaborator

Okay great. The example I posted on JuliaGPU/CUDA.jl#2471 doesn't work as it stands, that and similar issues around reduction and broadcasting are the problem.

@vchuravy
Copy link

@jgreener64 if there isn't an open issue on Enzyme.jl associated with it it is very likely that Billy and I will lose track of it.

@jgreener64
Copy link
Collaborator

jgreener64 commented Oct 24, 2024

Billy has favoured moving CUDA-specific issues to CUDA.jl, e.g. EnzymeAD/Enzyme.jl#1454 (comment), in either case I can find some MWEs in the next few days of what is erroring and make sure they are tracked in issues.

@jgreener64
Copy link
Collaborator

#182 is now merged.

@wsmoses
Copy link

wsmoses commented Oct 30, 2024

Particular issue is this one: JuliaGPU/CUDA.jl#2455

Which is essentially that we haven't yet implemented the CUDA.jl reduction of broadcast derivative (just added the reduction of array function in cuda.jl)

@leios
Copy link
Contributor Author

leios commented Dec 18, 2024

Coming back to this. AMD tests pass up to the end of the simulation tests. Jotting this down for later because it's a precision issue that might not be easy to solve:

Different implementations: Test Failed at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:1269
  Expression: coord_diff < 0.0001 * u"nm"
   Evaluated: 0.18373905452396624 nm < 0.0001 nm

Stacktrace:
 [1] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1269 [inlined]
 [3] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] top-level scope
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1172
[ Info: GPU f32             - difference per coordinate 0.222242465130291 nm - potential energy difference 5.492531155670832e-5 kJ mol^-1
Different implementations: Test Failed at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:1269
  Expression: coord_diff < 0.0001 * u"nm"
   Evaluated: 0.222242465130291 nm < 0.0001 nm

Stacktrace:
 [1] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1269 [inlined]
 [3] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] top-level scope
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1172
[ Info: GPU NL              - difference per coordinate 1.9973717839752905 nm - potential energy difference 7.993605777301127e-15 kJ mol^-1
Different implementations: Test Failed at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:1269
  Expression: coord_diff < 0.0001 * u"nm"
   Evaluated: 1.9973717839752905 nm < 0.0001 nm

Stacktrace:
 [1] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1269 [inlined]
 [3] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] top-level scope
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1172
[ Info: GPU f32 NL          - difference per coordinate NaN nm - potential energy difference 6.20778689297552e-5 kJ mol^-1
Different implementations: Test Failed at /home/leios/projects/CESMIX/Molly.jl/test/simulation.jl:1269
  Expression: coord_diff < 0.0001 * u"nm"
   Evaluated: NaN nm < 0.0001 nm

Stacktrace:
 [1] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1269 [inlined]
 [3] macro expansion
   @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] top-level scope
   @ ~/projects/CESMIX/Molly.jl/test/simulation.jl:1172
Test Summary:             | Pass  Fail  Total   Time
Different implementations |   46     4     50  21.0s

@jgreener64
Copy link
Collaborator

Great, let me know when this is ready for review. Don't worry about the gradient tests for now, I am currently dealing with some issues there.

Do you know what effect these changes have on performance, e.g. for the protein system at https://juliamolsim.github.io/Molly.jl/dev/documentation/#Simulating-a-protein on GPU with no barostat?

@leios
Copy link
Contributor Author

leios commented Dec 18, 2024

I can take a look at the difference in performance as well, but it should be the same for CUDA devices because all of that is now in a separate extension.

@jgreener64
Copy link
Collaborator

Ah I see. Would it cause slowdown to remove the CUDA code entirely and just use the KA code?

@leios
Copy link
Contributor Author

leios commented Dec 18, 2024

The answer should be "no," but there are certain features that are supported by CUDA, but not by KA, such as the warp-level semantics used for pairwise_force_kernel_nonl.

When we swapped over to KA for GPUArrays, there was not a noticeable difference in performance for the kernels there.

@jgreener64
Copy link
Collaborator

Great, in that case maybe remove all the CUDA code from the extension except the code that uses the warps.

@leios
Copy link
Contributor Author

leios commented Dec 18, 2024

Ok, will do. Still the following left:

  • Figure out the last few errors
  • Check performance before and after
  • Fix docs to mention allowed array types (CuArray, ROCArray, OneArray, MtlArray, Array...)
  • Prune the CUDA extension and check performance again
  • Double-check test coverage
  • Rebase up again

@leios
Copy link
Contributor Author

leios commented Dec 20, 2024

Ok, the CUDA tests were all passing locally just a second ago. Note that I had to leave both the nonl and nl kernels in the CUDA extension.

@leios leios marked this pull request as ready for review December 20, 2024 15:04
@leios
Copy link
Contributor Author

leios commented Dec 20, 2024

Ok, I still need to do the performance checks, but the PR looks to finally be out of the draft stage. A few important notes before merging:

  1. I would like to re-run all the tests locally on both CUDA and AMD devices just to be sure
  2. If someone has a mac and is willing to test this for a Metal Array, I can modify the tests to make that happen (the biggest change is that I need to make sure the arrays are all 32 bit). This isn't necessary for the PR though.
  3. I was getting some somewhat large precision issues on AMD that vanish on NVIDIA devices. I am not sure what that is about tbh

@leios
Copy link
Contributor Author

leios commented Dec 20, 2024

The GPU error from the gradient tests:

 Differentiable simulation: Error During Test at /home/leios/projects/CESMIX/Molly.jl/test/gradients.jl:37
  Got exception outside of a @test
  ArgumentError: the coordinates are on the GPU but the atoms are not
  Stacktrace:
    [1] #System#4
      @ ~/projects/CESMIX/Molly.jl/src/types.jl:557
    [2] System
      @ ~/projects/CESMIX/Molly.jl/src/types.jl:489
    [3] loss
      @ ~/projects/CESMIX/Molly.jl/test/gradients.jl:95 [inlined]
    [4] loss
      @ ~/projects/CESMIX/Molly.jl/test/gradients.jl:0
    [5] macro expansion
      @ ~/.julia/packages/Enzyme/ydGh2/src/compiler.jl:5218 [inlined]
    [6] enzyme_call
      @ ~/.julia/packages/Enzyme/ydGh2/src/compiler.jl:4764 [inlined]
    [7] PrimalErrorThunk
      @ ~/.julia/packages/Enzyme/ydGh2/src/compiler.jl:4620 [inlined]
    [8] autodiff
      @ ~/.julia/packages/Enzyme/ydGh2/src/Enzyme.jl:503 [inlined]
    [9] autodiff(::ReverseMode{false, true, FFIABI, false, false}, ::var"#loss#15"{var"#mean_min_separation#14"}, ::Type{Active}, ::Active{Float64}, ::Active{Float64}, ::Duplicated{Vector{SVector{3, Float64}}}, ::Duplicated{Vector{SVector{3, Float64}}}, ::Const{CubicBoundary{Float64}}, ::Const{Tuple{LennardJones{DistanceCutoff{Float64, Float64, Float64}, typeof(Molly.lj_zero_shortcut), typeof(Molly.lorentz_σ_mixing), typeof(Molly.geometric_ϵ_mixing), Int64}, CoulombReactionField{Float64, Float64, Int64, Float64}}}, ::Const{Tuple{}}, ::Const{DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}}, ::Const{VelocityVerlet{Float64, RescaleThermostat{Float64}}}, ::Const{Int64}, ::Const{Int64}, ::Const{Int64}, ::Const{Float64}, ::Const{Vector{Float64}}, ::Const{CuArray{Int32, 1, CUDA.DeviceMemory}}, ::Const{CuArray{Int32, 1, CUDA.DeviceMemory}}, ::Const{InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}}, ::Const{InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}}, ::Const{Val{Float64}}, ::Const{Val{CuArray}})
      @ Enzyme ~/.julia/packages/Enzyme/ydGh2/src/Enzyme.jl:524
   [10] macro expansion
      @ ~/projects/CESMIX/Molly.jl/test/gradients.jl:211 [inlined]
   [11] macro expansion
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [12] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/gradients.jl:38
   [13] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [14] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/runtests.jl:106
   [15] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [16] top-level scope
      @ none:6
   [17] eval
      @ ./boot.jl:385 [inlined]
   [18] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [19] _start()
      @ Base ./client.jl:552
Test Summary:             | Pass  Error  Total      Time
Differentiable simulation |   18      1     19  28m29.3s
ERROR: LoadError: Some tests did not pass: 18 passed, 0 failed, 1 errored, 0 broken.
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/gradients.jl:37
in expression starting at /home/leios/projects/CESMIX/Molly.jl/test/runtests.jl:105
ERROR: Package Molly errored during testing

@leios
Copy link
Contributor Author

leios commented Dec 22, 2024

Right, so the tests pass locally, but running the KA version of the nl kernel on CUDA specifically has an issue (pasted below.

Note that this does not affect AMD (my testing machine) and doesn't affect CUDA execution because the nl kernel is in the CUDA extension.

Here is the error:

Energy minimization: Error During Test at /home/leios/projects/CESMIX/Molly.jl/test/minimization.jl:1
  Got exception outside of a @test
  LLVM error: Cannot select: 0x13da74c8: ch = AtomicStore<(store seq_cst (s64) into %ir.188, addrspace 1)> 0x1df59610, 0xacb3ef0, 0x27f64268, /home/leios/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39 @[ /home/leios/.julia/packages/UnsafeAtomicsLLVM/cIROJ/src/atomics.jl:230 @[ /home/leios/.julia/packages/UnsafeAtomicsLLVM/cIROJ/src/atomics.jl:186 @[ /home/leios/.julia/packages/UnsafeAtomicsLLVM/cIROJ/src/internal.jl:13 @[ /home/leios/.julia/packages/Atomix/F9VIX/src/core.jl:14 @[ /home/leios/projects/CESMIX/Molly.jl/src/kernels.jl:82 @[ /home/leios/.julia/packages/KernelAbstractions/0r40T/src/macros.jl:97 @[ none:0 ] ] ] ] ] ] ]
    0xacb3ef0: i64,ch = CopyFromReg 0x10b584c8, Register:i64 %73, tuple.jl:31 @[ /home/leios/.julia/packages/StaticArrays/SA4kA/src/SArray.jl:65 @[ /home/leios/projects/CESMIX/Molly.jl/src/kernels.jl:81 @[ /home/leios/.julia/packages/KernelAbstractions/0r40T/src/macros.jl:97 @[ none:0 ] ] ] ]
      0x27f88df0: i64 = Register %73
    0x27f64268: f64 = fsub 0xacb4640, 0x13da7d50, float.jl:410 @[ /home/leios/projects/CESMIX/Molly.jl/src/kernels.jl:82 @[ /home/leios/.julia/packages/KernelAbstractions/0r40T/src/macros.jl:97 @[ none:0 ] ] ]
      0xacb4640: f64,ch = load<(load (s64) from %ir.188, !tbaa !604, addrspace 1)> 0x10b584c8, 0xacb3ef0, undef:i64, /home/leios/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39 @[ none:0 @[ none:0 @[ /home/leios/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:85 @[ /home/leios/.julia/packages/CUDA/2kjXI/src/device/array.jl:91 @[ /home/leios/.julia/packages/CUDA/2kjXI/src/device/array.jl:85 @[ /home/leios/.julia/packages/CUDA/2kjXI/src/device/array.jl:164 @[ /home/leios/.julia/packages/CUDA/2kjXI/src/device/array.jl:175 @[ /home/leios/projects/CESMIX/Molly.jl/src/kernels.jl:82 @[ /home/leios/.julia/packages/KernelAbstractions/0r40T/src/macros.jl:97 @[ none:0 ] ] ] ] ] ] ] ] ] ]
        0xacb3ef0: i64,ch = CopyFromReg 0x10b584c8, Register:i64 %73, tuple.jl:31 @[ /home/leios/.julia/packages/StaticArrays/SA4kA/src/SArray.jl:65 @[ /home/leios/projects/CESMIX/Molly.jl/src/kernels.jl:81 @[ /home/leios/.julia/packages/KernelAbstractions/0r40T/src/macros.jl:97 @[ none:0 ] ] ] ]
          0x27f88df0: i64 = Register %73
        0x27f63978: i64 = undef
      0x13da7d50: f64,ch = load<(load (s64) from %ir.189, !tbaa !647, !alias.scope !649, !noalias !650, addrspace 5)> 0x10b584c8, 0x27f88cb8, undef:i64, float.jl:410 @[ /home/leios/projects/CESMIX/Molly.jl/src/kernels.jl:82 @[ /home/leios/.julia/packages/KernelAbstractions/0r40T/src/macros.jl:97 @[ none:0 ] ] ]
        0x27f88cb8: i64,ch = CopyFromReg 0x10b584c8, Register:i64 %76, tuple.jl:31 @[ /home/leios/.julia/packages/StaticArrays/SA4kA/src/SArray.jl:65 @[ /home/leios/projects/CESMIX/Molly.jl/src/kernels.jl:81 @[ /home/leios/.julia/packages/KernelAbstractions/0r40T/src/macros.jl:97 @[ none:0 ] ] ] ]
          0x27f643a0: i64 = Register %76
        0x27f63978: i64 = undef
  In function: _Z31gpu_pairwise_force_kernel_nonl_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILl1E5TupleI5OneToI5Int64EEE7NDRangeILl1ES0_10StaticSizeI6_512__ES2_ILl1ES3_IS4_IS5_EEEvEE13CuDeviceArrayI7Float64Ll2ELl1EES8_I6SArrayIS3_ILl3EE8QuantityIS9_1_9FreeUnitsI5_nm__1_7nothingEELl1ELl3EELl1ELl1EES8_IS10_IS3_ILl3EES11_IS9_6_____1S12_I11_nm__ps__1_6_____17nothingEELl1ELl3EELl1ELl1EES8_I4AtomIS5_S11_IS9_6_____1S12_I11_g__mol__1_6_____17nothingEES9_S11_IS9_1_S12_I5_nm__1_7nothingEES11_IS9_15__2______1____2S12_I12_kJ__mol__1_15__2______1____27nothingEEELl1ELl1EE13CubicBoundaryIS11_IS9_1_S12_I5_nm__1_7nothingEEES3_I12LennardJonesI8NoCutoff16lj_zero_shortcut16lorentz___mixing18geometric___mixingS5_EES5_3ValILl3EES20_I15kJ_nm__1_mol__1E
  Stacktrace:
    [1] handle_error(reason::Cstring)
      @ LLVM ~/.julia/packages/LLVM/wMjUU/src/core/context.jl:194
    [2] LLVMTargetMachineEmitToMemoryBuffer(T::LLVM.TargetMachine, M::LLVM.Module, codegen::LLVM.API.LLVMCodeGenFileType, ErrorMessage::Base.RefValue{Cstring}, OutMemBuf::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}})
      @ LLVM.API ~/.julia/packages/LLVM/wMjUU/lib/15/libLLVM.jl:11138
    [3] emit(tm::LLVM.TargetMachine, mod::LLVM.Module, filetype::LLVM.API.LLVMCodeGenFileType)
      @ LLVM ~/.julia/packages/LLVM/wMjUU/src/targetmachine.jl:118
    [4] mcgen
      @ ~/.julia/packages/GPUCompiler/2CW9L/src/mcgen.jl:75 [inlined]
    [5] mcgen(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, mod::LLVM.Module, format::LLVM.API.LLVMCodeGenFileType)
      @ CUDA ~/.julia/packages/CUDA/2kjXI/src/compiler/compilation.jl:127
    [6] macro expansion
      @ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:253 [inlined]
    [7] macro expansion
      @ ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:403 [inlined]
    [8] macro expansion
      @ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:253 [inlined]
    [9] macro expansion
      @ ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:400 [inlined]
   [10] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/utils.jl:108
   [11] emit_asm
      @ ~/.julia/packages/GPUCompiler/2CW9L/src/utils.jl:106 [inlined]
   [12] codegen(output::Symbol, job::GPUCompiler.CompilerJob; toplevel::Bool, libraries::Bool, optimize::Bool, cleanup::Bool, validate::Bool, strip::Bool, only_entry::Bool, parent_job::Nothing)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:120
   [13] codegen(output::Symbol, job::GPUCompiler.CompilerJob)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:82
   [14] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
      @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:79
   [15] compile
      @ ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:74 [inlined]
   [16] #1145
      @ ~/.julia/packages/CUDA/2kjXI/src/compiler/compilation.jl:250 [inlined]
   [17] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{})
      @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:34
   [18] JuliaContext(f::Function)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/driver.jl:25
   [19] compile(job::GPUCompiler.CompilerJob)
      @ CUDA ~/.julia/packages/CUDA/2kjXI/src/compiler/compilation.jl:249
   [20] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
      @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:237
   [21] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/2CW9L/src/execution.jl:151
   [22] macro expansion
      @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:380 [inlined]
   [23] macro expansion
      @ ./lock.jl:267 [inlined]
   [24] cufunction(f::typeof(Molly.gpu_pairwise_force_kernel_nonl!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceMatrix{Float64, 1}, CuDeviceVector{SVector{3, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1}, CuDeviceVector{SVector{3, Quantity{Float64, 𝐋 𝐓^-1, Unitful.FreeUnits{(nm, ps^-1), 𝐋 𝐓^-1, nothing}}}, 1}, CuDeviceVector{Atom{Int64, Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, Float64, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1}, CubicBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, Tuple{LennardJones{NoCutoff, typeof(Molly.lj_zero_shortcut), typeof(Molly.lorentz_σ_mixing), typeof(Molly.geometric_ϵ_mixing), Int64}}, Int64, Val{3}, Val{kJ nm^-1 mol^-1}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Int64})
      @ CUDA ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:375
   [25] macro expansion
      @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:112 [inlined]
   [26] (::KernelAbstractions.Kernel{CUDABackend, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(Molly.gpu_pairwise_force_kernel_nonl!)})(::CuArray{Float64, 2, CUDA.DeviceMemory}, ::Vararg{Any}; ndrange::Int64, workgroupsize::Nothing)
      @ CUDA.CUDAKernels ~/.julia/packages/CUDA/2kjXI/src/CUDAKernels.jl:103
   [27] pairwise_force_gpu!(fs_mat::CuArray{Float64, 2, CUDA.DeviceMemory}, coords::CuArray{SVector{3, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, CUDA.DeviceMemory}, velocities::CuArray{SVector{3, Quantity{Float64, 𝐋 𝐓^-1, Unitful.FreeUnits{(nm, ps^-1), 𝐋 𝐓^-1, nothing}}}, 1, CUDA.DeviceMemory}, atoms::CuArray{Atom{Int64, Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, Float64, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, CUDA.DeviceMemory}, boundary::CubicBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, pairwise_inters::Tuple{LennardJones{NoCutoff, typeof(Molly.lj_zero_shortcut), typeof(Molly.lorentz_σ_mixing), typeof(Molly.geometric_ϵ_mixing), Int64}}, nbs::NoNeighborList, step_n::Int64, force_units::Unitful.FreeUnits{(kJ, nm^-1, mol^-1), 𝐋 𝐌 𝐍^-1 𝐓^-2, nothing}, ::Val{Float64})
      @ Molly ~/projects/CESMIX/Molly.jl/src/kernels.jl:40
   [28] forces_nounits!(fs_nounits::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, sys::System{3, CuArray, Float64, CuArray{Atom{Int64, Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, Float64, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, CUDA.DeviceMemory}, CuArray{SVector{3, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, CUDA.DeviceMemory}, CubicBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, CuArray{SVector{3, Quantity{Float64, 𝐋 𝐓^-1, Unitful.FreeUnits{(nm, ps^-1), 𝐋 𝐓^-1, nothing}}}, 1, CUDA.DeviceMemory}, Vector{Any}, Nothing, Tuple{LennardJones{NoCutoff, typeof(Molly.lj_zero_shortcut), typeof(Molly.lorentz_σ_mixing), typeof(Molly.geometric_ϵ_mixing), Int64}}, Tuple{}, Tuple{}, Tuple{}, NoNeighborFinder, Tuple{}, Unitful.FreeUnits{(kJ, nm^-1, mol^-1), 𝐋 𝐌 𝐍^-1 𝐓^-2, nothing}, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, Unitful.FreeUnits{(kJ, K^-1, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, nothing}}, CuArray{Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, 1, CUDA.DeviceMemory}, Nothing}, neighbors::Nothing, fs_mat::CuArray{Float64, 2, CUDA.DeviceMemory}, step_n::Int64; n_threads::Int64)
      @ Molly ~/projects/CESMIX/Molly.jl/src/force.jl:359
   [29] forces_nounits!
      @ ~/projects/CESMIX/Molly.jl/src/force.jl:350 [inlined]
   [30] #simulate!#248
      @ ~/projects/CESMIX/Molly.jl/src/simulators.jl:78 [inlined]
   [31] simulate!(sys::System{3, CuArray, Float64, CuArray{Atom{Int64, Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, Float64, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}}}, 1, CUDA.DeviceMemory}, CuArray{SVector{3, Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, 1, CUDA.DeviceMemory}, CubicBoundary{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}, CuArray{SVector{3, Quantity{Float64, 𝐋 𝐓^-1, Unitful.FreeUnits{(nm, ps^-1), 𝐋 𝐓^-1, nothing}}}, 1, CUDA.DeviceMemory}, Vector{Any}, Nothing, Tuple{LennardJones{NoCutoff, typeof(Molly.lj_zero_shortcut), typeof(Molly.lorentz_σ_mixing), typeof(Molly.geometric_ϵ_mixing), Int64}}, Tuple{}, Tuple{}, Tuple{}, NoNeighborFinder, Tuple{}, Unitful.FreeUnits{(kJ, nm^-1, mol^-1), 𝐋 𝐌 𝐍^-1 𝐓^-2, nothing}, Unitful.FreeUnits{(kJ, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝐓^-2, nothing}, Quantity{Float64, 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, Unitful.FreeUnits{(kJ, K^-1, mol^-1), 𝐋^2 𝐌 𝐍^-1 𝚯^-1 𝐓^-2, nothing}}, CuArray{Quantity{Float64, 𝐌 𝐍^-1, Unitful.FreeUnits{(g, mol^-1), 𝐌 𝐍^-1, nothing}}, 1, CUDA.DeviceMemory}, Nothing}, sim::SteepestDescentMinimizer{Quantity{Float64, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}, Quantity{Float64, 𝐋 𝐌 𝐍^-1 𝐓^-2, Unitful.FreeUnits{(kJ, nm^-1, mol^-1), 𝐋 𝐌 𝐍^-1 𝐓^-2, nothing}}, Base.DevNull})
      @ Molly ~/projects/CESMIX/Molly.jl/src/simulators.jl:60
   [32] macro expansion
      @ ~/projects/CESMIX/Molly.jl/test/minimization.jl:59 [inlined]
   [33] macro expansion
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [34] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/minimization.jl:2
   [35] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [36] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/runtests.jl:96
   [37] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [38] top-level scope
      @ none:6
   [39] eval
      @ ./boot.jl:385 [inlined]
   [40] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [41] _start()
      @ Base ./client.jl:552

@jgreener64
Copy link
Collaborator

That's strange, but I wouldn't worry as we are working to change that kernel anyway.

@leios leios force-pushed the KA_support branch 3 times, most recently from 98134c8 to 2a26c03 Compare December 23, 2024 12:00
@leios
Copy link
Contributor Author

leios commented Dec 23, 2024

Not sure why I am getting precision issues on the CPU tests after rebasing up.

Also, I have some bad news about performance and would actually recommend against merging this pr at this time because this branch appears to be 30% slower. I am not sure why that is the case at this time, but this branch seems to have a larger number of allocations and memory footprint. Details below:

Master:

julia> CUDA.@time include("tmp/perf_test.jl")
110.211809 seconds (31.62 M CPU allocations: 6.957 GiB, 1.38% gc time) (13.50 k GPU allocations: 1.049 TiB, 0.18% memmgmt time)

This branch:

julia> CUDA.@time include("tmp/perf_test.jl")
  129.916144 seconds (33.52 M CPU allocations: 9.388 GiB, 1.40% gc time) (18.50 k GPU allocations: 1.051 TiB, 0.27% memmgmt time)

Average of 10 runs

This branch master
130.66646728515624 100.83366224500868

Script:

using Molly
using CUDA

data_dir = joinpath(dirname(pathof(Molly)), "..", "data")
ff = MolecularForceField(
    joinpath(data_dir, "force_fields", "ff99SBildn.xml"),
    joinpath(data_dir, "force_fields", "tip3p_standard.xml"),
    joinpath(data_dir, "force_fields", "his.xml"),
)

sys = System(
    joinpath(data_dir, "6mrr_equil.pdb"),
    ff;
    loggers=(
        energy=TotalEnergyLogger(10),
        writer=StructureWriter(10, "traj_6mrr_1ps.pdb", ["HOH"]),
    ),
    array_type=CuArray,
)

minimizer = SteepestDescentMinimizer()
simulate!(sys, minimizer)

temp = 298.0u"K"
random_velocities!(sys, temp)
simulator = Langevin(
    dt=0.001u"ps",
    temperature=temp,
    friction=1.0u"ps^-1",
    coupling=MonteCarloBarostat(1.0u"bar", temp, sys.boundary),
)

simulate!(sys, simulator, 5_000)

@leios
Copy link
Contributor Author

leios commented Dec 23, 2024

Good news. I'm an idiot. I seem to have run the tests on two different GPUs before. When running on the same GPU, the performance is the same:

All runs:

This branch Master
129.29177856445312 123.61217498779297
118.34098815917969 111.21739959716797
107.20030975341797 124.71260070800781
126.77005004882812 123.9267349243164
127.49639892578125 109.17316436767578
129.89144897460938 108.7107925415039
96.02093505859375 92.04306030273438
96.4058837890625 109.63255310058594
95.76754760742188 91.27706909179688
95.88066864013672 109.9665298461914

Averages:

This branch Master
112.30660095214844 110.42720794677734

So now we just need to:

  • Figure out the precision issue on the CI tests (I am also getting them locally)
  • Double-check AMD tests and check precision there as well
  • Change all gpu = true in docs to array_type = CuArray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants