Skip to content

Commit

Permalink
feat(pt): support CPU parallel training with PT (#4224)
Browse files Browse the repository at this point in the history
Fix #4132.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced backend selection for distributed training, allowing for
flexible use of NCCL or Gloo based on availability.
  
- **Bug Fixes**
	- Corrected indentation for improved code clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
iProzd and pre-commit-ci[bot] authored Oct 23, 2024
1 parent 18026eb commit a74d963
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions deepmd/pt/entrypoints/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,7 @@ def get_trainer(
local_rank = os.environ.get("LOCAL_RANK")
if local_rank is not None:
local_rank = int(local_rank)
assert dist.is_nccl_available()
dist.init_process_group(backend="nccl")
dist.init_process_group(backend="cuda:nccl,cpu:gloo")

def prepare_trainer_input_single(
model_params_single, data_dict_single, rank=0, seed=None
Expand Down

0 comments on commit a74d963

Please sign in to comment.