Replies: 3 comments 1 reply
-
Sorry, I still don't have access to a 4090 card. Obviously 4090 was not running with full speed. If you can take a look before one of our colleges investigates, things I will check are: (1) Is the bottleneck in the synchronization? If I remember it correctly, there are at least two full stops for an NVE step with BUSSI thermostat. One for the temperature, one for the neighbor list detection. More synchronizations are in the induced dipole iterations. It's unlikely to be the bottleneck here but I cannot simply rule it out. (2) Did cuFFT scale well with more cores? (3) If other kernels are not fully utilizing the cores, is it easy to identify where the bound is? This can possibly be tuned with block or grid sizes. Kind of hard to predict where the problem is without more detailed profiling information. |
Beta Was this translation helpful? Give feedback.
-
Are you using a 2fs time step (RESPA)? This is default to use with AMOEBA. Then you should get 60 ns/d for DHFR.
tinker-HP can use even larger time step so it will be faster.
From: PinkFoxK ***@***.***>
Sent: Wednesday, June 21, 2023 4:27 PM
To: TinkerTools/tinker9 ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [TinkerTools/tinker9] Simulation performance (rate) on the RTX 4090 probably can be twice faster? (Discussion #225)
Dear zhi-wang,
First of all, while discussing the Tinker9 and Tinker-HP projects with the community, people are very grateful for the opportunity to get an appropriate MD performance for the polarizable force fields with the aid of a single GPU(Tinker9) or multiple GPU (Tinker-HP). A lot of efforts nowadays were aimed at speeding up the performance of the polarizable force fields in software, mathematics https://www.researchgate.net/publication/370492080_ANKH_A_Generalized_ON_Interpolated_Ewald_Strategy_for_Molecular_Dynamics_Simulations and the most powerful GPU utilization. Previous year, the 4090 GPU arose, making Tinker users great hopes about the opportunity to use AMOEBAbio, especially AMOEBApro force field, as one of the main instruments in MD. We had tested MD computation (protein dhfr2 (AMOEBAbio09)) performance with Tinker9 (single RTX4090) and got a result (similar to "keebborg") of about 31ns/d (with GPU loading 94-97%) , moreover, we compared (for the same test protein in water solution) the performance (ns/d) of nonpolarizable classical Amber ff99SB Tinker9/Gromacs=490/1200 so possibly some adaptation of Tinker9 for RTX4090 is needed. We have great hopes for Tinker-HP support of 2 or 4 RTX4090 to speed up the calculations - is it available? The final question is: what is the best choice for a single RTX4090: Tinker9 or the latest Tinker-HP?
Thanks in advance for your response and your time!
-
Reply to this email directly, view it on GitHub<#225 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABNC6XU5BC6RSCH7JJF2RUDXMNRI5ANCNFSM6AAAAAAYE44CHI>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
Dear zhi-wang and pren, |
Beta Was this translation helpful? Give feedback.
-
Hello!
We have tested the simulation performance of Tinker9 on two different RTX 4090 and one RTX 3060. As a result, the performance for the DHFR2 test for the RTX 4090 was 30.5-31.5 ns/day, and for the RTX 3060 it was 9.82 ns/day.
RTX4090/RTX3060 performance ratio ~ 3.2
For all simulation runs the command “tinker9 dynamic -k dhfr2.key dhfr2.xyz 10000 1.0 1.0 4 298 1.0” was used. Potential TFLOPS ratio for RTX4090/RTX3060 = 82.58/12.74 = 6.48. Thus, the ratio of the theoretical maximum to the practical result is ~ 2 (6.48/3.2). It looks like the RTX 4090 has the 2X loss due to bug or lack of optimization for the ada lovelace architecture.
Has anyone come across a similar situation? Is it possible to fix the tinker9 code?
GPU loading in all cases is above 93%. The calculations end at a close potential energy.
Below are listed our calculations parameters:
=======================================================================
GeForce RTX 3060 (NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1)
GPU utilization 97%
Graphics Clock - 1980 MHz
dhfr2.key
echo ################################################################
echo Joint Amber-CHARMM Benchmark on Dihydrofolate Reductase in Water
echo 23558 Atoms, 62.23 Ang Cube, 9 Ang Nonbond Cutoffs, 64x64x64 PME
echo ################################################################
parameters ../params/amoebabio09.prm
neighbor-list
a-axis 62.23
vdw-cutoff 12.0
ewald
ewald-cutoff 7.0
pme-grid 64 64 64
pme-order 5
polarization MUTUAL
integrator respa
thermostat bussi
barostat montecarlo
vdw-correction
fft-package FFTW
polarization OPT3 #OPT4 is mre accurate but OPT3 is faster
polar-eps 0.001 # the induced dipole convergence threshold
polar-predict
tinker9 dynamic -k dhfr2.key dhfr2.xyz 10000 1.0 1.0 4 298 1.0
...
Current Time 10.0000 Picosecond
Current Potential -67552.9105 Kcal/mole
Current Kinetic 20880.8222 Kcal/mole
Lattice Lengths 61.931305 61.931305 61.931305
Lattice Angles 90.000000 90.000000 90.000000
Frame Number 10
Coordinate File dhfr2.arc
Performance: ns/day 9.8248
Wall Time 87.9409
Steps 10000
Updates 10
Time Step 1.0000
Atoms 23558
RTX Geforce 4090 Gaming OC (NVIDIA-SMI 525.60.13/530.30.02 Driver Version: 525.60.13/530.30.02 CUDA Version: 12.1)
GPU utilization 93%
Graphics Clock - 2775 MHz
echo Joint Amber-CHARMM Benchmark on Dihydrofolate Reductase in Water
echo 23558 Atoms, 62.23 Ang Cube, 9 Ang Nonbond Cutoffs, 64x64x64 PME
echo ################################################################
parameters ../params/amoebabio09.prm
neighbor-list
a-axis 62.23
vdw-cutoff 12.0
ewald
ewald-cutoff 7.0
pme-grid 64 64 64
pme-order 5
polarization MUTUAL
integrator respa
thermostat bussi
barostat montecarlo
vdw-correction
fft-package FFTW
polarization OPT3 #OPT4 is mre accurate but OPT3 is faster
polar-eps 0.001 # the induced dipole convergence threshold
polar-predict
tinker9 dynamic -k dhfr2.key dhfr2.xyz 10000 1.0 1.0 4 298 1.0
…
Current Time 10.0000 Picosecond
Current Potential -67802.3152 Kcal/mole
Current Kinetic 21203.2177 Kcal/mole
Lattice Lengths 61.427605 61.427605 61.427605
Lattice Angles 90.000000 90.000000 90.000000
Frame Number 10
Coordinate File dhfr2.arc
Performance: ns/day 31.4891
Wall Time 27.4380
Steps 10000
Updates 10
Time Step 1.0000
Atoms 23558
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions