Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Compute prod_env_mat OP in parallel in the dimension of the frame #2618

Open
njzjz opened this issue Jun 16, 2023 · 0 comments

Comments

@njzjz
Copy link
Member

njzjz commented Jun 16, 2023

Summary

Currently, the prod_env_mat OP and its kernel are only parallelized in the dimension of the atoms and are not parallelized in the dimension of the frame. This is not a problem if the training batch size is small or if running MD simulations, but it causes performance degradation when the training batch size is large or for inference (dp test and dp model-devi) on modern GPUs that have a large memory.

In #2600 and #2601, I refactored prod_force and prod_force_grad. A similar thing should be applied to prod_env_mat.

Detailed Description

The current code is:

// loop over samples
for (int_64 ff = 0; ff < nsamples; ++ff) {

This loop should be avoided for at least GPUs.

Further Information, Files, and Links

No response

wanghan-iapcm pushed a commit that referenced this issue Sep 19, 2023
This PR does a trick to speed up the pairwise DPRc model. Considering
#2618 is not ready and is quite difficult to implement, in this PR,
multiple frames are merged into one frame before feed to `prod_env_mat`
OP, and the mesh is faked to make it perform the same behavior as the
multiple frames.
A new `mesh` shape is proposed. The first element stores `nloc`, and the
following 15 elements store nothing to distinguish it from other mesh.
The `(16 : 16 + nloc)` elements store `ilist`, `(16 + nloc : 16 + nloc *
2)` store `numneigh`, and the rest elements (in the shape of
`sum(numneigh)`) store neighbors. The `nei_mode` is 4 for this
situation.

`prod_env_mat` OP is not a bottleneck anymore, as shown below.

![image](https://github.com/deepmodeling/deepmd-kit/assets/9496702/eea64b99-d630-4ea1-99f4-e7d49c126c33)

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
wanghan-iapcm pushed a commit that referenced this issue Oct 7, 2023
Fix #2619.

The GPU implementation in this PR is usually faster than the CPU in one
thread (i.e., not using the feature implemented in #1624). Still, it
needs parallelism in the batch dimension, which is blocked by #2618,
regarding building the neighbor list. The GPU utilization is less than
10% for the water system. It should be improved when #2618 makes
progress.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant