Issue connecting to nodes that are not within the same cluster #658

yxusnapchat · 2024-10-11T22:54:36Z

Hi team I have a example based on the latest nv image nvcr.io/nvidia/tensorflow:24.07-tf2-py3 but run the mpi job on different nodes. However it complains that the launcher could not identify the worker. Is it supported to have launcher and worker running on separate nodes?

kind: MPIJob
metadata:
  name: xxx
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: xxx
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            # resources:
            #   limits:
            #     nvidia.com/gpu: 1  # Request 1 GPU
            #   requests:
            #     nvidia.com/gpu: 1  # Optionally set requests equal to limits
            name: mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /nvidia-examples/movielens-1m-keras-with-horovod.py
            - --mode=train
            - --model_dir="./model_dir" 
            - --export_dir="./export_dir"
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: xxx
            name: mpi-worker
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            resources:
              limits:
                nvidia.com/gpu: 1  # Request 1 GPU
              requests:
                nvidia.com/gpu: 1  # Optionally set requests equal to limits

Also I am curious on where is the code pointer to start the worker. Thanks!

The text was updated successfully, but these errors were encountered:

alculquicondor · 2024-10-15T17:24:37Z

Yes, it is supported.

You can find examples here https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1

alculquicondor · 2024-10-15T17:25:13Z

Please share more details about error messages in both launcher and worker pods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue connecting to nodes that are not within the same cluster #658

Issue connecting to nodes that are not within the same cluster #658

yxusnapchat commented Oct 11, 2024

alculquicondor commented Oct 15, 2024

alculquicondor commented Oct 15, 2024

Issue connecting to nodes that are not within the same cluster #658

Issue connecting to nodes that are not within the same cluster #658

Comments

yxusnapchat commented Oct 11, 2024

alculquicondor commented Oct 15, 2024

alculquicondor commented Oct 15, 2024