feat(pt): support CPU parallel training with PT (deepmodeling#4224)

Fix deepmodeling#4132.  ## Summary by CodeRabbit - **New Features** - Enhanced backend selection for distributed training, allowing for flexible use of NCCL or Gloo based on availability. - **Bug Fixes** - Corrected indentation for improved code clarity.  --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
njzjz · Oct 23, 2024 · a74d963 · a74d963
1 parent 18026eb
commit a74d963
Showing 1 changed file with 1 addition and 2 deletions.
diff --git a/deepmd/pt/entrypoints/main.py b/deepmd/pt/entrypoints/main.py
@@ -105,8 +105,7 @@ def get_trainer(
     local_rank = os.environ.get("LOCAL_RANK")
     if local_rank is not None:
         local_rank = int(local_rank)
-        assert dist.is_nccl_available()
-        dist.init_process_group(backend="nccl")
+        dist.init_process_group(backend="cuda:nccl,cpu:gloo")
 
     def prepare_trainer_input_single(
         model_params_single, data_dict_single, rank=0, seed=None