-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(cc): fix message passing when nghost is 0 and send list is empty #4237
Conversation
for more information, see https://pre-commit.ci
📝 WalkthroughWalkthroughThe changes in this pull request primarily focus on the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
source/op/pt/comm.cc (1)
255-268
: Efficient optimization for backward passThe changes effectively optimize the backward pass by conditionally allocating tensors only when
nswap != 0
. This improvement aligns with the PR objective and enhances memory efficiency.Consider adding a brief comment explaining the purpose of
max_recvnum
for improved code clarity:// Allocate receive buffer large enough for the maximum number of atoms received in any swap int max_recvnum = sendnum_tensor.max().item<int>();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- source/api_cc/src/DeepPotPT.cc (2 hunks)
- source/lmp/tests/test_lammps_dpa_pt_nopbc.py (0 hunks)
- source/op/pt/comm.cc (2 hunks)
💤 Files with no reviewable changes (1)
- source/lmp/tests/test_lammps_dpa_pt_nopbc.py
🧰 Additional context used
🔇 Additional comments (4)
source/op/pt/comm.cc (2)
123-131
: Excellent optimization for empty send listsThe changes effectively address the PR objective by introducing a conditional check for
nsend != 0
. This optimization prevents unnecessary tensor operations when there's nothing to send, improving efficiency and avoiding potential issues with empty tensors.
Line range hint
1-431
: Summary: Effective optimization of message passingThe changes in this file successfully address the PR objective of fixing message passing when the send list is empty. The optimizations in both the forward and backward passes improve efficiency by avoiding unnecessary operations and tensor allocations. These changes enhance the overall performance and robustness of the Border class, particularly in edge cases with empty send lists or no swaps.
source/api_cc/src/DeepPotPT.cc (2)
Line range hint
307-332
: Verify all call sites for the updatedcompute
method signatureThe
compute
method signature has been modified to remove thenghost
andlmp_list
parameters. This change suggests that the method no longer relies on external neighbor list information. While this simplification might improve the interface, it's crucial to ensure that all call sites in the codebase have been updated accordingly.Please run the following verification to check for any remaining calls with the old signature:
#!/bin/bash # Search for calls to compute with the old signature rg "compute\s*\([^)]*nghost" source/ rg "compute\s*\([^)]*lmp_list" source/If any results are found, they will need to be updated to match the new signature. Additionally, consider updating the documentation to reflect this change in the method's interface.
Also applies to: 431-456
Line range hint
171-221
: Verify the impact of simplified message passing logicThe changes to the message passing logic have simplified the conditions, which aligns with the PR objective. However, this simplification might have some implications:
- The removal of the
nghost > 0
check could lead to unnecessary operations when there are no ghost atoms.- The behavior in edge cases (e.g., when
nghost
is 0 and the send list is empty) might have changed.To ensure these changes don't introduce any regressions, please run the following verification:
Additionally, consider adding more test cases to cover these edge scenarios if they don't already exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
source/api_cc/src/DeepPotPT.cc (1)
Line range hint
171-189
: LGTM! Consider adding a comment for clarity.The changes effectively address the PR objective by simplifying the message passing condition and handling communication data more robustly. This should resolve issues when
nghost
is 0 and the send list is empty.Consider adding a brief comment explaining the purpose of the
communicator_tensor
initialization, especially the reason for the empty tensor whenlmp_list.world == 0
. This would improve code readability and maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- source/api_cc/src/DeepPotPT.cc (3 hunks)
🧰 Additional context used
🔇 Additional comments (2)
source/api_cc/src/DeepPotPT.cc (2)
Line range hint
1-577
: Overall changes align well with PR objectives.The modifications in this file are focused on the
compute
method, particularly on simplifying the message passing logic and tensor operations. These changes directly address the PR objective of fixing message passing whennghost
is 0 and the send list is empty.The targeted nature of the changes, along with the unchanged error handling and type mapping, suggests a careful and focused approach to resolving the specific issues mentioned in the PR description.
225-225
: Verify the initialization ofmapping_tensor
.The simplification of the condition for calling different
run_method
variants is consistent with the earlier changes. However,mapping_tensor
is being used here without a visible initialization in the provided code segment.Please ensure that
mapping_tensor
is properly initialized before this line. If it's initialized elsewhere in the file, consider moving the initialization closer to its usage or adding a comment to clarify where it's set up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
source/api_cc/src/DeepPotPT.cc (2)
Confirmed:
const_cast
Usage in DeepPotPT.ccThe use of
const_cast
insource/api_cc/src/DeepPotPT.cc
lines 180-189 is confirmed and may lead to undefined behavior. Please refactor to eliminate the need forconst_cast
.🔗 Analysis chain
Line range hint
171-186
: Review the new tensor operations for MPI communicationThe new tensor operations for MPI communication data look good and should improve performance by using zero-copy tensor creation with
torch::from_blob
. However, there's a potential issue with the use ofconst_cast
forlmp_list.world
.The use of
const_cast
in line 184-185 could lead to undefined behavior if the underlying data is modified. Consider refactoring to avoid the need forconst_cast
. For example:- communicator_tensor = torch::from_blob(const_cast<void*>(lmp_list.world), - {1}, torch::kInt64); + communicator_tensor = torch::from_blob(const_cast<std::int64_t*>(static_cast<const std::int64_t*>(lmp_list.world)), + {1}, torch::kInt64);This change makes the cast more explicit and type-safe. However, it's crucial to ensure that
lmp_list.world
is indeed of typestd::int64_t*
.To ensure that this change doesn't introduce any issues, please run the following script to check for any other instances of
const_cast
in the codebase:This will help identify any other potential issues related to const-correctness in the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for other instances of const_cast rg "const_cast" --type cppLength of output: 570
Remaining
nghost
References DetectedThe search found multiple instances of
nghost
across various files in the codebase. These remnants indicate that the removal of ghost atom checks was incomplete and may lead to inconsistencies.
source/op/pt/comm.cc
source/op/tf/pairwise.cc
source/lmp/pair_deepmd.cpp
source/lmp/fix_dplr.cpp
source/lmp/pppm_dplr.cpp
source/lmp/compute_deeptensor_atom.cpp
source/api_c/tests/test_select_by_type.cc
source/api_c/tests/test_dipolecharge.cc
source/api_cc/src/common.cc
source/api_cc/src/DeepTensorTF.cc
source/api_cc/tests/test_select_by_type.cc
source/api_cc/src/DeepPot.cc
source/api_cc/src/DataModifierTF.cc
source/api_cc/src/DataModifier.cc
source/api_cc/src/DeepTensor.cc
source/api_c/include/deepmd.hpp
source/api_c/src/c_api.cc
source/api_cc/include/commonTF.h
source/api_cc/include/common.h
source/api_cc/include/DeepTensorTF.h
source/api_cc/include/DeepPotTF.h
source/api_cc/include/DeepPotPT.h
source/api_cc/include/DeepPot.h
source/api_cc/include/DataModifierTF.h
source/api_cc/include/DeepTensor.h
source/api_cc/include/DataModifier.h
Please address these remaining references to ensure the removal is thorough and the system behaves as expected.
🔗 Analysis chain
Line range hint
171-186
: Verify the impact of removing ghost atom checksThe removal of
nghost
checks simplifies the code, but it may affect the system's behavior, especially in multi-node simulations or when dealing with boundary conditions.Please confirm that this change doesn't introduce any issues in the following scenarios:
- Single-node simulations with periodic boundary conditions
- Multi-node simulations with domain decomposition
To assist in verification, you can run the following script to check for any remaining references to
nghost
in the codebase:This will help ensure that the removal of ghost atom checks is consistent across the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for remaining references to nghost rg "nghost" --type cppLength of output: 43957
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- source/api_cc/src/DeepPotPT.cc (3 hunks)
- source/api_cc/tests/test_deeppot_dpa_pt.cc (0 hunks)
💤 Files with no reviewable changes (1)
- source/api_cc/tests/test_deeppot_dpa_pt.cc
🧰 Additional context used
🔇 Additional comments (1)
source/api_cc/src/DeepPotPT.cc (1)
Line range hint
221-230
: Review changes in the module's forward method callThe modification to the forward method call, which now includes a check for
do_message_passing
, is a good improvement. It allows for conditional execution of message passing in the neural network model, which can potentially improve performance when message passing is not required.The change aligns well with the removal of ghost atom checks, as it moves the communication handling to a higher level within the model.
To ensure that this change is consistently applied and doesn't introduce any issues, please run the following script:
#!/bin/bash # Search for other instances of forward_lower method calls rg "forward_lower" --type cpp # Check for any remaining references to ghost atoms in forward calls rg "forward.*ghost" --type cppThis will help verify that the changes are consistent across the codebase and that there are no remaining references to ghost atoms in forward method calls that might have been missed.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #4237 +/- ##
==========================================
- Coverage 84.55% 84.55% -0.01%
==========================================
Files 537 537
Lines 51237 51244 +7
Branches 3047 3051 +4
==========================================
+ Hits 43324 43327 +3
- Misses 6965 6967 +2
- Partials 948 950 +2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems to remove #4144, the minimal DPA-2 support for the old interface? I do need it work for AmberTools.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
source/api_cc/src/DeepPotPT.cc (1)
Line range hint
225-231
: Refactor the ternary operator for improved readabilityUsing a ternary operator with a complex multi-line function call can reduce code readability and maintainability. It can make the code harder to understand and debug.
Consider refactoring this code segment to use an
if-else
statement for better clarity.Apply this diff to refactor the code:
-c10::Dict<c10::IValue, c10::IValue> outputs = - (do_message_passing == 1) - ? module - .run_method("forward_lower", coord_wrapped_Tensor, atype_Tensor, - firstneigh_tensor, mapping_tensor, fparam_tensor, - aparam_tensor, do_atom_virial_tensor, comm_dict) - .toGenericDict() - : module - .run_method("forward_lower", coord_wrapped_Tensor, atype_Tensor, - firstneigh_tensor, mapping_tensor, fparam_tensor, - aparam_tensor, do_atom_virial_tensor) - .toGenericDict(); +c10::Dict<c10::IValue, c10::IValue> outputs; +if (do_message_passing == 1) { + outputs = module + .run_method("forward_lower", coord_wrapped_Tensor, atype_Tensor, + firstneigh_tensor, mapping_tensor, fparam_tensor, + aparam_tensor, do_atom_virial_tensor, comm_dict) + .toGenericDict(); +} else { + outputs = module + .run_method("forward_lower", coord_wrapped_Tensor, atype_Tensor, + firstneigh_tensor, mapping_tensor, fparam_tensor, + aparam_tensor, do_atom_virial_tensor) + .toGenericDict(); +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- source/api_cc/src/DeepPotPT.cc (3 hunks)
- source/op/pt/comm.cc (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- source/op/pt/comm.cc
🧰 Additional context used
🔇 Additional comments (1)
source/api_cc/src/DeepPotPT.cc (1)
Line range hint
171-190
: Validate the safety of castinglmp_list.world
The use of
const_cast<void*>(lmp_list.world)
when creatingcommunicator_tensor
may lead to undefined behavior iflmp_list.world
is not a valid pointer or if it isnullptr
. This could potentially cause segmentation faults or other runtime errors.Please ensure that
lmp_list.world
is a valid and appropriately typed pointer before casting. Consider adding a check to verify its validity or using safer casting mechanisms.
fix errors mentioned in following pr:
#4220
#4209
#4144
Summary by CodeRabbit
New Features
Bug Fixes
Refactor