-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(pt): invalid type_map when multitask training #4031
Conversation
It seems that in 3.0.0b3, executing multitask training or finetune task would run into a `RuntimeError`, calling inconsistent type map. However, the `type_map` in mutitask should be a shared dict. Diving into the source code, we would see a `type_map` [here](https://github.com/deepmodeling/deepmd-kit/blob/0e0fc1a63e478d3e56285b520b34a9c58488d659/deepmd/pt/entrypoints/main.py#L300). It would cause an empty type_map in multitask training because of no `type_map` found. Signed-off-by: Futaki Haduki <812556867@qq.com>
WalkthroughWalkthroughThe recent modifications to the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant TrainFunction
participant Config
User->>TrainFunction: Start Training
TrainFunction->>Config: Check multi_task flag
alt multi_task == false
TrainFunction->>Config: Retrieve type_map from config["model"]
else multi_task == true
TrainFunction->>Config: Retrieve type_map from config["model"]["model_dict"]
end
TrainFunction->>User: Training Initialized with type_map
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #4031 +/- ##
=======================================
Coverage 82.93% 82.93%
=======================================
Files 522 522
Lines 51036 51037 +1
Branches 3028 3028
=======================================
+ Hits 42325 42329 +4
- Misses 7762 7763 +1
+ Partials 949 945 -4 ☔ View full report in Codecov by Sentry. |
It seems that the multitask training is not converted by the unit test. |
Indeed, multitask training only has unit test for |
@Cloudac7 Thank you for identifying and fixing the bug in the neighbor statistics calculation. You have correctly pinpointed the issue that occurs under specific conditions: when the user does not set However, the error log you encountered likely arises from a difference between the |
via #4034 |
8394101
It seems that in 3.0.0b3, executing multitask training or finetune task would run into a `RuntimeError`, calling inconsistent type map. The error log is shown below. However, the `type_map` in mutitask should be a shared dict. Diving into the source code, we would see a `type_map` [here](https://github.com/deepmodeling/deepmd-kit/blob/0e0fc1a63e478d3e56285b520b34a9c58488d659/deepmd/pt/entrypoints/main.py#L300). It would cause an empty type_map in multitask training because of no `type_map` found. After applying the modification in this PR, everything seems to be well. ``` Traceback (most recent call last): File "/public/home/ypliucat/.conda/envs/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/public/home/ypliucat/.conda/envs/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/entrypoints/main.py", line 562, in main train(FLAGS) File "/public/home/ypliucat/.conda/envs/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/entrypoints/main.py", line 311, in train train_data = get_data( File "/public/home/ypliucat/.conda/envs/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/utils/data_system.py", line 802, in get_data data = DeepmdDataSystem( File "/public/home/ypliucat/.conda/envs/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/utils/data_system.py", line 184, in __init__ self.type_map = self._check_type_map_consistency(type_map_list) File "/public/home/ypliucat/.conda/envs/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/utils/data_system.py", line 616, in _check_type_map_consistency raise RuntimeError(f"inconsistent type map: {ret!s} {ii!s}") RuntimeError: inconsistent type map: ['Ag', 'Cu'] ['Ag', 'Ni'] ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced the training process to ensure consistent handling of model type configurations, improving clarity and availability based on multi-task settings. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Futaki Haduki <812556867@qq.com>
It seems that in 3.0.0b3, executing multitask training or finetune task would run into a
RuntimeError
, calling inconsistent type map. The error log is shown below. However, thetype_map
in mutitask should be a shared dict. Diving into the source code, we would see atype_map
here. It would cause an empty type_map in multitask training because of notype_map
found.After applying the modification in this PR, everything seems to be well.
Summary by CodeRabbit