Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch elastic scheduler error #1504

Closed
qiankunli opened this issue Dec 8, 2021 · 9 comments · Fixed by #1733
Closed

pytorch elastic scheduler error #1504

qiankunli opened this issue Dec 8, 2021 · 9 comments · Fixed by #1733
Assignees

Comments

@qiankunli
Copy link
Contributor

qiankunli commented Dec 8, 2021

v2-pytorch-1208202754389-worker-0 is always running,but other worker is completed.

v2-pytorch-1208202754389-worker-0         1/1     Running     0          7m22s
v2-pytorch-1208202754389-worker-1         0/1     Completed   0          7m21s
v2-pytorch-1208202754389-worker-2         0/1     Completed   0          7m21s
v2-pytorch-1208202754389-worker-3         0/1     Completed   0          7m21s

v2-pytorch-1208202754389-worker-0 log

[INFO] 2021-12-08 12:28:26,050 run: Running torch.distributed.run with args: ['/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py', '/xdl/private/bert.li/train.py']
[INFO] 2021-12-08 12:28:26,052 run: Using nproc_per_node=auto.
[INFO] 2021-12-08 12:28:26,097 run: Using nproc_per_node=auto, seting to 1 since the instance has 96 gpu
[INFO] 2021-12-08 12:28:26,098 api: Starting elastic_operator with launch configs:
  entrypoint       : /xdl/private/bert.li/train.py
  min_nodes        : 2
  max_nodes        : 4
  nproc_per_node   : 1
  run_id           : v2-pytorch-1208202754389
  rdzv_backend     : c10d
  rdzv_endpoint    : v2-pytorch-1208202754389-worker-0:23456
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 100
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

[INFO] 2021-12-08 12:28:26,099 c10d_rendezvous_backend: Process 8 hosts the TCP store for the C10d rendezvous backend.
[INFO] 2021-12-08 12:28:26,101 local_elastic_agent: log directory set to: /tmp/torchelastic_ggneqcf4/v2-pytorch-1208202754389_otp4jco3
[INFO] 2021-12-08 12:28:26,101 api: [default] starting workers for entrypoint: python
[INFO] 2021-12-08 12:28:26,101 api: [default] Rendezvous'ing worker group
[INFO] 2021-12-08 12:28:26,102 dynamic_rendezvous: The node 'v2-pytorch-1208202754389-worker-0_8_0' attempts to join the next round of the rendezvous 'v2-pytorch-1208202754389'.

v2-pytorch-1208202754389-worker-1 log

[INFO] 2021-12-08 12:28:26,303 run: Running torch.distributed.run with args: ['/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py', '/xdl/private/bert.li/train.py']
[INFO] 2021-12-08 12:28:26,306 run: Using nproc_per_node=auto.
[INFO] 2021-12-08 12:28:26,354 run: Using nproc_per_node=auto, seting to 1 since the instance has 96 gpu
[INFO] 2021-12-08 12:28:26,354 api: Starting elastic_operator with launch configs:
  entrypoint       : /xdl/private/bert.li/train.py
  min_nodes        : 2
  max_nodes        : 4
  nproc_per_node   : 1
  run_id           : v2-pytorch-1208202754389
  rdzv_backend     : c10d
  rdzv_endpoint    : v2-pytorch-1208202754389-worker-0:23456
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 100
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

[ERROR] 2021-12-08 12:28:26,423 error_handler: {
  "message": {
    "message": "RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 146, in _create_tcp_store\n    host, port, is_master=is_server, timeout=timedelta(seconds=read_timeout)\nValueError: host not found: Name or service not known\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 348, in wrapper\n    return f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py\", line 214, in launch_agent\n    rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/registry.py\", line 64, in get_rendezvous_handler\n    return handler_registry.create_handler(params)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/api.py\", line 253, in create_handler\n    handler = creator(params)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/registry.py\", line 35, in _create_c10d_handler\n    backend, store = create_backend(params)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 204, in create_backend\n    store = _create_tcp_store(params)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 165, in _create_tcp_store\n    ) from exc\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n",
      "timestamp": "1638966506"
    }
  }
}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 146, in _create_tcp_store
    host, port, is_master=is_server, timeout=timedelta(seconds=read_timeout)
ValueError: host not found: Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 637, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 629, in main
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 624, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 214, in launch_agent
    rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 64, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/api.py", line 253, in create_handler
    handler = creator(params)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 35, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 204, in create_backend
    store = _create_tcp_store(params)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 165, in _create_tcp_store
    ) from exc
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

14min later

v2-pytorch-1208202754389-worker-0         0/1     Completed   0          14m
v2-pytorch-1208202754389-worker-1         0/1     Completed   0          14m
v2-pytorch-1208202754389-worker-2         0/1     Completed   0          14m
v2-pytorch-1208202754389-worker-3         0/1     Completed   0          14m

v2-pytorch-1208202754389-worker-0 log

[INFO] 2021-12-08 12:28:26,050 run: Running torch.distributed.run with args: ['/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py', '/xdl/private/bert.li/train.py']
[INFO] 2021-12-08 12:28:26,052 run: Using nproc_per_node=auto.
[INFO] 2021-12-08 12:28:26,097 run: Using nproc_per_node=auto, seting to 1 since the instance has 96 gpu
[INFO] 2021-12-08 12:28:26,098 api: Starting elastic_operator with launch configs:
  entrypoint       : /xdl/private/bert.li/train.py
  min_nodes        : 2
  max_nodes        : 4
  nproc_per_node   : 1
  run_id           : v2-pytorch-1208202754389
  rdzv_backend     : c10d
  rdzv_endpoint    : v2-pytorch-1208202754389-worker-0:23456
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 100
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

[INFO] 2021-12-08 12:28:26,099 c10d_rendezvous_backend: Process 8 hosts the TCP store for the C10d rendezvous backend.
[INFO] 2021-12-08 12:28:26,101 local_elastic_agent: log directory set to: /tmp/torchelastic_ggneqcf4/v2-pytorch-1208202754389_otp4jco3
[INFO] 2021-12-08 12:28:26,101 api: [default] starting workers for entrypoint: python
[INFO] 2021-12-08 12:28:26,101 api: [default] Rendezvous'ing worker group
[INFO] 2021-12-08 12:28:26,102 dynamic_rendezvous: The node 'v2-pytorch-1208202754389-worker-0_8_0' attempts to join the next round of the rendezvous 'v2-pytorch-1208202754389'.
{"name": "torchelastic.worker.status.FAILED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "v2-pytorch-1208202754389", "global_rank": null, "group_rank": null, "worker_id": null, "role": "default", "hostname": "v2-pytorch-1208202754389-worker-0", "state": "FAILED", "total_run_time": 600, "rdzv_backend": "c10d", "raw_error": "Traceback (most recent call last):\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py\", line 238, in launch_agent\n    result = agent.run()\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py\", line 700, in run\n    result = self._invoke_run(role)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py\", line 822, in _invoke_run\n    self._initialize_workers(self._worker_group)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py\", line 670, in _initialize_workers\n    self._rendezvous(worker_group)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py\", line 530, in _rendezvous\n    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 933, in next_rendezvous\n    self._op_executor.run(join_op, deadline)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 574, in run\n    raise RendezvousTimeoutError()\ntorch.distributed.elastic.rendezvous.api.RendezvousTimeoutError\n", "metadata": "{\"group_world_size\": null, \"entry_point\": \"python\"}", "agent_restarts": 0}}
[INFO] 2021-12-08 12:38:26,948 dynamic_rendezvous: The node 'v2-pytorch-1208202754389-worker-0_8_0' has closed the rendezvous 'v2-pytorch-1208202754389'.
[ERROR] 2021-12-08 12:38:26,948 error_handler: {
  "message": {
    "message": "RendezvousTimeoutError: ",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 348, in wrapper\n    return f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py\", line 238, in launch_agent\n    result = agent.run()\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py\", line 700, in run\n    result = self._invoke_run(role)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py\", line 822, in _invoke_run\n    self._initialize_workers(self._worker_group)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py\", line 670, in _initialize_workers\n    self._rendezvous(worker_group)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py\", line 530, in _rendezvous\n    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 933, in next_rendezvous\n    self._op_executor.run(join_op, deadline)\n  File \"/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 574, in run\n    raise RendezvousTimeoutError()\ntorch.distributed.elastic.rendezvous.api.RendezvousTimeoutError\n",
      "timestamp": "1638967106"
    }
  }
}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 637, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 629, in main
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 624, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 238, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 700, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 822, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 670, in _initialize_workers
    self._rendezvous(worker_group)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 530, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 933, in next_rendezvous
    self._op_executor.run(join_op, deadline)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 574, in run
    raise RendezvousTimeoutError()
torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError
@gaocegege
Copy link
Member

Ref pytorch/pytorch#67742

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Apr 30, 2022
@tenzen-y
Copy link
Member

/reopen
/assign

@google-oss-prow
Copy link

@tenzen-y: Reopened this issue.

In response to this:

/reopen
/assign

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y
Copy link
Member

My misunderstanding. This issue has already been fixed.

/close

@google-oss-prow
Copy link

@tenzen-y: Closing this issue.

In response to this:

My misunderstanding. This issue has already been fixed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y
Copy link
Member

/unassign

@tenzen-y
Copy link
Member

I faced this issue when I deployed https://github.com/kubeflow/training-operator/tree/master/examples/pytorch/elastic/echo, again.
/assign
/reopen

@google-oss-prow
Copy link

@tenzen-y: Reopened this issue.

In response to this:

I faced this issue when I deployed https://github.com/kubeflow/training-operator/tree/master/examples/pytorch/elastic/echo, again.
/assign
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants