Optimize grouping indexes #1044

GiovanniBussi · 2024-03-13T14:33:46Z

I am again working with this input file:

WHOLEMOLECULES ENTITY0=1-100000
c: CENTER ATOMS=1-100000
pos: POSITION ATOM=c
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1

In this PR I tried to optimize loops like this:

 for(const auto & a : atom_value_ind) {
   plumed_dbg_massert( ind<forcesToApply.size(), "problem setting forces in " + getLabel() );
   std::size_t nn = a.first, kk = a.second;
   xpos[nn]->inputForce[kk] += forcesToApply[ind]; ind++;
   ypos[nn]->inputForce[kk] += forcesToApply[ind]; ind++;
   zpos[nn]->inputForce[kk] += forcesToApply[ind]; ind++;
 }

Into loops like this:

  for(const auto & a : atom_value_ind_grouped) {
    const auto nn=a.first;
    plumed_dbg_assert(ind<forcesToApply.size()) << "problem setting forces in " << getLabel();
    auto & xp=xpos[nn]->inputForce;
    auto & yp=ypos[nn]->inputForce;
    auto & zp=zpos[nn]->inputForce;
    for(const auto & kk : a.second) {
      xp[kk] += forcesToApply[ind]; ind++;
      yp[kk] += forcesToApply[ind]; ind++;
      zp[kk] += forcesToApply[ind]; ind++;
    }
  }

Here atom_value_ind_grouped is constructed when atom_value_ind is updated, and basically stores the same information in a different way, exployting the fact that for most of the iterations in the first implementation nn is constant.

@gtribello can you have a look and check if this makes sense? Is this a reasonable assumption on the memory access pattern?

The improvement is quite measurable. Below the comparison is:

v2.9,
the commit where I optimized using blas, (Replaced some bottleneck loop with blas #1043)
this change

The overhead decreases from 18% (using blas) to 12% (this commit).

BENCH:  Kernel:      /Users/bussi/plumed2/tmp/v2.9-mpi-ref/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.162712     0.162712     0.162712     0.162712
BENCH:  B0 First step                                      1     0.008648     0.008648     0.008648     0.008648
BENCH:  B1 Warm-up                                        99     0.450701     0.004553     0.004138     0.004917
BENCH:  B2 Calculation part 1                            200     0.914300     0.004572     0.004184     0.005419
BENCH:  B3 Calculation part 2                            200     0.907306     0.004537     0.004116     0.005461
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.440795     2.440795     2.440795     2.440795
PLUMED: 1 Prepare dependencies                           500     0.001112     0.000002     0.000001     0.000016
PLUMED: 2 Sharing data                                   500     0.143430     0.000287     0.000232     0.001424
PLUMED: 3 Waiting for data                               500     0.000257     0.000001     0.000000     0.000007
PLUMED: 4 Calculating (forward loop)                     500     1.663443     0.003327     0.002997     0.006315
PLUMED: 5 Applying (backward loop)                       500     0.461751     0.000924     0.000834     0.001461
PLUMED: 6 Update                                         500     0.000515     0.000001     0.000001     0.000003
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/tmp/reference/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.176 +- 0.004
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.170923     0.170923     0.170923     0.170923
BENCH:  B0 First step                                      1     0.011203     0.011203     0.011203     0.011203
BENCH:  B1 Warm-up                                        99     0.530119     0.005355     0.004808     0.005859
BENCH:  B2 Calculation part 1                            200     1.073778     0.005369     0.004811     0.007811
BENCH:  B3 Calculation part 2                            200     1.068996     0.005345     0.004818     0.006343
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.852410     2.852410     2.852410     2.852410
PLUMED: 1 Prepare dependencies                           500     0.001712     0.000003     0.000001     0.000021
PLUMED: 2 Sharing data                                   500     0.173275     0.000347     0.000287     0.000940
PLUMED: 3 Waiting for data                               500     0.001615     0.000003     0.000001     0.000023
PLUMED: 4 Calculating (forward loop)                     500     1.899739     0.003799     0.003378     0.007691
PLUMED: 5 Applying (backward loop)                       500     0.595992     0.001192     0.001059     0.002648
PLUMED: 6 Update                                         500     0.002300     0.000005     0.000003     0.000042
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/src/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.125 +- 0.003
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.176497     0.176497     0.176497     0.176497
BENCH:  B0 First step                                      1     0.009901     0.009901     0.009901     0.009901
BENCH:  B1 Warm-up                                        99     0.513381     0.005186     0.004732     0.005876
BENCH:  B2 Calculation part 1                            200     1.028328     0.005142     0.004676     0.005917
BENCH:  B3 Calculation part 2                            200     1.021688     0.005108     0.004611     0.005853
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.747601     2.747601     2.747601     2.747601
PLUMED: 1 Prepare dependencies                           500     0.001586     0.000003     0.000001     0.000005
PLUMED: 2 Sharing data                                   500     0.168293     0.000337     0.000287     0.000723
PLUMED: 3 Waiting for data                               500     0.001344     0.000003     0.000001     0.000016
PLUMED: 4 Calculating (forward loop)                     500     1.850951     0.003702     0.003321     0.007262
PLUMED: 5 Applying (backward loop)                       500     0.540332     0.001081     0.000951     0.001962
PLUMED: 6 Update                                         500     0.002144     0.000004     0.000003     0.000026

src/core/ActionWithValue.cpp

+      // for(unsigned j=0; j<forcesForApply.size(); ++j) {
+      // forcesForApply[j]+=omp_f[j];
+      // }


src/core/ActionWithValue.cpp

+      // if( nt>1 ) for(unsigned j=0; j<nder; ++j) omp_f[j] += ff*thisderiv[1+j];
+      //else for(unsigned j=0; j<nder; ++j) forcesForApply[j] += ff*thisderiv[1+j];


src/core/ActionAtomistic.cpp

+// for(const auto & a : atom_value_ind) {
+//   plumed_dbg_massert( ind<forcesToApply.size(), "problem setting forces in " + getLabel() );
+//   std::size_t nn = a.first, kk = a.second;
+//   xpos[nn]->inputForce[kk] += forcesToApply[ind]; ind++;
+//   ypos[nn]->inputForce[kk] += forcesToApply[ind]; ind++;
+//   zpos[nn]->inputForce[kk] += forcesToApply[ind]; ind++;
+// }


src/core/ActionAtomistic.cpp

+// for(const auto & a : atom_value_ind) {
+//   std::size_t nn = a.first, kk = a.second;
+//   positions[j][0] = xpos[nn]->data[kk];
+//   positions[j][1] = ypos[nn]->data[kk];
+//   positions[j][2] = zpos[nn]->data[kk];
+//   charges[j] = chargev[nn]->data[kk];
+//   masses[j] = masv[nn]->data[kk];
+//   j++;
+// }


src/core/ActionWithVirtualAtom.cpp

+// for(const auto & a : atom_value_ind) {
+//   std::size_t nn = a.first, kk = a.second;
+//   xpos[nn]->inputForce[kk] += xf*xval->data[1+k] + yf*yval->data[1+k] + zf*zval->data[1+k]; k++;
+//   ypos[nn]->inputForce[kk] += xf*xval->data[1+k] + yf*yval->data[1+k] + zf*zval->data[1+k]; k++;
+//   zpos[nn]->inputForce[kk] += xf*xval->data[1+k] + yf*yval->data[1+k] + zf*zval->data[1+k]; k++;
+// }


src/core/ActionAtomistic.cpp

+    atom_value_ind_grouped.back().second.push_back(kk);
+    auto prev_nn=nn;
+    for(unsigned i=1; i<atom_value_ind.size(); i++) {
+      auto nn = atom_value_ind[i].first;


src/core/ActionAtomistic.cpp

+    auto prev_nn=nn;
+    for(unsigned i=1; i<atom_value_ind.size(); i++) {
+      auto nn = atom_value_ind[i].first;
+      auto kk = atom_value_ind[i].second;


GiovanniBussi · 2024-03-14T08:36:36Z

The improvement with intel compiler on my workstation is less (29% to 27% overhead), but still measurable.

BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-v2.9/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.424078     0.424078     0.424078     0.424078
BENCH:  B0 First step                                      1     0.017435     0.017435     0.017435     0.017435
BENCH:  B1 Warm-up                                        99     0.899508     0.009086     0.008296     0.012992
BENCH:  B2 Calculation part 1                            200     1.743061     0.008715     0.008302     0.011589
BENCH:  B3 Calculation part 2                            200     1.745426     0.008727     0.008314     0.011583
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     4.821678     4.821678     4.821678     4.821678
PLUMED: 1 Prepare dependencies                           500     0.001949     0.000004     0.000002     0.000015
PLUMED: 2 Sharing data                                   500     0.404638     0.000809     0.000504     0.002490
PLUMED: 3 Waiting for data                               500     0.000810     0.000002     0.000001     0.000013
PLUMED: 4 Calculating (forward loop)                     500     3.145875     0.006292     0.006110     0.012230
PLUMED: 5 Applying (backward loop)                       500     0.824214     0.001648     0.001609     0.002594
PLUMED: 6 Update                                         500     0.001113     0.000002     0.000001     0.000006
BENCH:  
BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-reference/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.287 +- 0.003
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.440200     0.440200     0.440200     0.440200
BENCH:  B0 First step                                      1     0.022010     0.022010     0.022010     0.022010
BENCH:  B1 Warm-up                                        99     1.163765     0.011755     0.010770     0.017656
BENCH:  B2 Calculation part 1                            200     2.243936     0.011220     0.010745     0.017048
BENCH:  B3 Calculation part 2                            200     2.243593     0.011218     0.010752     0.017610
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     6.106817     6.106817     6.106817     6.106817
PLUMED: 1 Prepare dependencies                           500     0.002200     0.000004     0.000002     0.000014
PLUMED: 2 Sharing data                                   500     0.343587     0.000687     0.000527     0.001934
PLUMED: 3 Waiting for data                               500     0.002487     0.000005     0.000004     0.000024
PLUMED: 4 Calculating (forward loop)                     500     4.122769     0.008246     0.007840     0.015409
PLUMED: 5 Applying (backward loop)                       500     1.176878     0.002354     0.002314     0.004537
PLUMED: 6 Update                                         500     0.004813     0.000010     0.000009     0.000021
BENCH:  
BENCH:  Kernel:      /scratch/bussi/plumed2/src/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.267 +- 0.002
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.445317     0.445317     0.445317     0.445317
BENCH:  B0 First step                                      1     0.027140     0.027140     0.027140     0.027140
BENCH:  B1 Warm-up                                        99     1.140178     0.011517     0.010588     0.016933
BENCH:  B2 Calculation part 1                            200     2.206530     0.011033     0.010579     0.015644
BENCH:  B3 Calculation part 2                            200     2.213058     0.011065     0.010590     0.015673
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     6.026409     6.026409     6.026409     6.026409
PLUMED: 1 Prepare dependencies                           500     0.002142     0.000004     0.000003     0.000014
PLUMED: 2 Sharing data                                   500     0.344054     0.000688     0.000530     0.003223
PLUMED: 3 Waiting for data                               500     0.002326     0.000005     0.000004     0.000026
PLUMED: 4 Calculating (forward loop)                     500     4.091521     0.008183     0.007775     0.019535
PLUMED: 5 Applying (backward loop)                       500     1.123146     0.002246     0.002201     0.004265
PLUMED: 6 Update                                         500     0.004250     0.000009     0.000008     0.000018

gtribello · 2024-03-14T09:49:14Z

Hi @GiovanniBussi

This looks fine to me. As far as I am concerned you can go ahead and merge.

Thanks for taking the time to optimise the code.

GiovanniBussi added 2 commits March 13, 2024 12:52

Replaced some bottleneck loop with blas

8701a7d

Group indexes to make memory access faster

b2a66a5

GiovanniBussi mentioned this pull request Mar 13, 2024

Optimize wholemolecules #1045

Open

github-advanced-security bot found potential problems Mar 13, 2024

View reviewed changes

GiovanniBussi merged commit 7b7bedf into master Mar 14, 2024
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize grouping indexes #1044

Optimize grouping indexes #1044

GiovanniBussi commented Mar 13, 2024

GiovanniBussi commented Mar 14, 2024

gtribello commented Mar 14, 2024

		// if( nt>1 ) for(unsigned j=0; j<nder; ++j) omp_f[j] += ff*thisderiv[1+j];
		//else for(unsigned j=0; j<nder; ++j) forcesForApply[j] += ff*thisderiv[1+j];

Optimize grouping indexes #1044

Optimize grouping indexes #1044

Conversation

GiovanniBussi commented Mar 13, 2024

GiovanniBussi commented Mar 14, 2024

gtribello commented Mar 14, 2024