Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RLlib training gets stuck when GPU rollout workers are used #21758

Open
1 of 2 tasks
kawshik8 opened this issue Jan 20, 2022 · 2 comments
Open
1 of 2 tasks

[Bug] RLlib training gets stuck when GPU rollout workers are used #21758

kawshik8 opened this issue Jan 20, 2022 · 2 comments
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical rllib RLlib related issues rllib-system system issues, runtime env, oom, etc

Comments

@kawshik8
Copy link

kawshik8 commented Jan 20, 2022

Search before asking

Ray Component

RLlib

What happened + What you expected to happen

  1. Apex DQN trainer gets stuck randomly during training if gpu's are used for rollout workers. Couldn't find why it is happening from the ray logs in all logging modes. This behaviour doesn't happen when i use cpu workers. Changing the number of gpu workers doesn't seem to matter. The more the number of 8 gpu nodes I'm using the farther it goes into training before getting stuck

  2. Rllib runs as expected until completion with GPU workers

  3. Doesn't provide this stacktrace all the time :

    2022-01-26 19:03:42,470 WARNING trainer.py:975 -- Worker crashed during call to step_attempt(). To try to continue training without the failed worker, set ig nore_worker_failures=True.
    Traceback (most recent call last):
    File "src/rllib.py", line 1417, in
    main(args)
    File "src/rllib.py", line 969, in main
    result = trainer.train()
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/tune/trainable.py", line 319, in train
    result = self.step()
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 979, in step
    raise e
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 965, in step
    step_attempt_results = self.step_attempt()
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1044, in step_attempt
    step_results = self._exec_plan_or_training_iteration_fn()
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 2032, in _exec_plan_or_training_iteration_fn
    results = next(self.train_exec_impl)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 756, in next
    return next(self.built_iterator)
    File "/home/ubuntuvenv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 1075, in build_union
    item = next(it)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 756, in next
    return next(self.built_iterator)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
    [Previous line repeated 2 more times]
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 551, in base_iterator
    batch = ray.get(obj_ref)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/worker.py", line 1763, in get
    raise value.as_instanceof_cause()
    ray.exceptions.RayTaskError(RuntimeError): ray::RolloutWorker.par_iter_next_batch() (pid=14155, ip=172.31.64.130, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f8c9f7a21c0>)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 1157, in par_iter_next_batch
    batch.append(self.par_iter_next())
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 1151, in par_iter_next
    return next(self.local_it)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 380, in gen_rollouts
    yield self.sample()
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 759, in sample
    batches = [self.input_reader.next()]
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 104, in next
    batches = [self.get_data()]
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 266, in get_data
    item = next(self._env_runner)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 634, in _env_runner
    _process_observations(
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 1026, in _process_observations
    sample_collector.try_build_truncated_episode_multi_agent_batch()
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 886, in try_build_truncated_episode_multi_agent_batch
    self.postprocess_episode(episode, is_done=False)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 781, in postprocess_episode
    post_batches[agent_id] = policy.postprocess_trajectory(
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 304, in postprocess_trajectory
    return postprocess_fn(self, sample_batch,
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/dqn/dqn_tf_policy.py", line 402, in postprocess_nstep_and_prio
    td_errors = policy.compute_td_error(
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/dqn/dqn_torch_policy.py", line 122, in compute_td_error
    build_q_losses(self, self.model, None, input_dict)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/dqn/dqn_torch_policy.py", line 238, in build_q_losses
    model, {"obs": train_batch[SampleBatch.CUR_OBS]},
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/policy/sample_batch.py", line 725, in getitem
    self.intercepted_values[key] = self.get_interceptor(value)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/utils/torch_utils.py", line 158, in convert_to_torch_tensor
    return tree.map_structure(mapping, x)
    File "/home/ubuntu/venv/lib/python3.8/site-packages/tree/init.py", line 510, in map_structure
    [func(*args) for args in zip(*map(flatten, structures))])
    File "/home/ubuntu/venv/lib/python3.8/site-packages/tree/init.py", line 510, in
    [func(*args) for args in zip(*map(flatten, structures))])
    File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/utils/torch_utils.py", line 156, in mapping
    return tensor if device is None else tensor.to(device)
    RuntimeError: CUDA error: unspecified launch failure
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Versions / Dependencies

ray == 1.9.2
python == 3.8.12
torch == 1.11.1

Reproduction script

dqn_config = default config
num_gpus_per_worker = 0.2
num_gpus = 1
#[total number of available gpus = 8 (each with memory 15109 mb)]
trainer = dqn.ApexTrainer(config=dqn_config, env = custom_env)

for i in range(10000):
result = trainer.train()

Anything else

  1. couldn't find any relevant logs to understand why this is happening.

  2. out files in /tmp/ray/session_latest/logs did mention a garbage collection issue (seems like the worker died),
    "The GCS actor metadata garbage collector timer failed to fire. This could old actor metadata not being properly cleaned
    up." and a bunch of worker failed messages after this. Then finally Raylet
    b682a0f9013ee0737c068f9d54731fae0e95d0eccc01a1eb29c07287 is drained. Status IOError: . The information will be
    published to the cluster.

  3. Not sure if the above messages in logs are relevant

  4. Looks like the script fails more often with high replay buffer size (1000000)

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kawshik8 kawshik8 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2022
@kawshik8 kawshik8 changed the title [Bug] [Bug] RLlib training gets stuck when GPU rollout workers are used Jan 21, 2022
@krfricke krfricke added the rllib RLlib related issues label Apr 4, 2022
@gjoliver
Copy link
Member

gjoliver commented Apr 9, 2022

looks like bugs with using partial GPU on workers.

@gjoliver gjoliver added P2 Important issue, but not time-critical rllib-system system issues, runtime env, oom, etc and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 9, 2022
@shixianc
Copy link

shixianc commented Mar 1, 2024

Do we have a fix for this?

we ran into similar error RuntimeError: CUDA error: unspecified launch failure when using partial GPU workers. It also happens when we do tensors.to(cuda)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical rllib RLlib related issues rllib-system system issues, runtime env, oom, etc
Projects
None yet
Development

No branches or pull requests

4 participants