[Feature Request] Compute `prod_env_mat` OP in parallel in the dimension of the frame #2618

njzjz · 2023-06-16T05:12:45Z

Summary

Currently, the prod_env_mat OP and its kernel are only parallelized in the dimension of the atoms and are not parallelized in the dimension of the frame. This is not a problem if the training batch size is small or if running MD simulations, but it causes performance degradation when the training batch size is large or for inference (dp test and dp model-devi) on modern GPUs that have a large memory.

In #2600 and #2601, I refactored prod_force and prod_force_grad. A similar thing should be applied to prod_env_mat.

Detailed Description

The current code is:

deepmd-kit/source/op/prod_env_mat_multi_device.cc

Lines 1150 to 1151 in 92ca097

    
           // loop over samples 
        
           for (int_64 ff = 0; ff < nsamples; ++ff) {

This loop should be avoided for at least GPUs.

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

This PR does a trick to speed up the pairwise DPRc model. Considering #2618 is not ready and is quite difficult to implement, in this PR, multiple frames are merged into one frame before feed to `prod_env_mat` OP, and the mesh is faked to make it perform the same behavior as the multiple frames. A new `mesh` shape is proposed. The first element stores `nloc`, and the following 15 elements store nothing to distinguish it from other mesh. The `(16 : 16 + nloc)` elements store `ilist`, `(16 + nloc : 16 + nloc * 2)` store `numneigh`, and the rest elements (in the shape of `sum(numneigh)`) store neighbors. The `nei_mode` is 4 for this situation. `prod_env_mat` OP is not a bottleneck anymore, as shown below. ![image](https://github.com/deepmodeling/deepmd-kit/assets/9496702/eea64b99-d630-4ea1-99f4-e7d49c126c33) --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Fix #2619. The GPU implementation in this PR is usually faster than the CPU in one thread (i.e., not using the feature implemented in #1624). Still, it needs parallelism in the batch dimension, which is blocked by #2618, regarding building the neighbor list. The GPU utilization is less than 10% for the water system. It should be improved when #2618 makes progress. --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz added the enhancement label Jun 16, 2023

njzjz mentioned this issue Sep 17, 2023

make the pairwise DPRc model 2x faster #2833

Merged

njzjz mentioned this issue Oct 4, 2023

support neighbor stat on GPUs #2897

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Compute `prod_env_mat` OP in parallel in the dimension of the frame #2618

[Feature Request] Compute `prod_env_mat` OP in parallel in the dimension of the frame #2618

njzjz commented Jun 16, 2023

[Feature Request] Compute prod_env_mat OP in parallel in the dimension of the frame #2618

[Feature Request] Compute prod_env_mat OP in parallel in the dimension of the frame #2618

Comments

njzjz commented Jun 16, 2023

Summary

Detailed Description

Further Information, Files, and Links

[Feature Request] Compute `prod_env_mat` OP in parallel in the dimension of the frame #2618

[Feature Request] Compute `prod_env_mat` OP in parallel in the dimension of the frame #2618