[Bug] RLlib training gets stuck when GPU rollout workers are used #21758

kawshik8 · 2022-01-20T20:00:58Z

Search before asking

I searched the issues and found that this might be related to [rllib][tune] Training stuck in "Pending" status #16425

Ray Component

RLlib

What happened + What you expected to happen

Apex DQN trainer gets stuck randomly during training if gpu's are used for rollout workers. Couldn't find why it is happening from the ray logs in all logging modes. This behaviour doesn't happen when i use cpu workers. Changing the number of gpu workers doesn't seem to matter. The more the number of 8 gpu nodes I'm using the farther it goes into training before getting stuck
Rllib runs as expected until completion with GPU workers
Doesn't provide this stacktrace all the time :

2022-01-26 19:03:42,470 WARNING trainer.py:975 -- Worker crashed during call to step_attempt(). To try to continue training without the failed worker, set ig nore_worker_failures=True.
Traceback (most recent call last):
File "src/rllib.py", line 1417, in
main(args)
File "src/rllib.py", line 969, in main
result = trainer.train()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/tune/trainable.py", line 319, in train
result = self.step()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 979, in step
raise e
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 965, in step
step_attempt_results = self.step_attempt()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1044, in step_attempt
step_results = self._exec_plan_or_training_iteration_fn()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 2032, in _exec_plan_or_training_iteration_fn
results = next(self.train_exec_impl)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 756, in next
return next(self.built_iterator)
File "/home/ubuntuvenv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 1075, in build_union
item = next(it)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 756, in next
return next(self.built_iterator)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
[Previous line repeated 2 more times]
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 551, in base_iterator
batch = ray.get(obj_ref)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/worker.py", line 1763, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RolloutWorker.par_iter_next_batch() (pid=14155, ip=172.31.64.130, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f8c9f7a21c0>)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 1157, in par_iter_next_batch
batch.append(self.par_iter_next())
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 1151, in par_iter_next
return next(self.local_it)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 380, in gen_rollouts
yield self.sample()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 759, in sample
batches = [self.input_reader.next()]
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 104, in next
batches = [self.get_data()]
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 266, in get_data
item = next(self._env_runner)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 634, in _env_runner
_process_observations(
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 1026, in _process_observations
sample_collector.try_build_truncated_episode_multi_agent_batch()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 886, in try_build_truncated_episode_multi_agent_batch
self.postprocess_episode(episode, is_done=False)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 781, in postprocess_episode
post_batches[agent_id] = policy.postprocess_trajectory(
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 304, in postprocess_trajectory
return postprocess_fn(self, sample_batch,
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/dqn/dqn_tf_policy.py", line 402, in postprocess_nstep_and_prio
td_errors = policy.compute_td_error(
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/dqn/dqn_torch_policy.py", line 122, in compute_td_error
build_q_losses(self, self.model, None, input_dict)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/dqn/dqn_torch_policy.py", line 238, in build_q_losses
model, {"obs": train_batch[SampleBatch.CUR_OBS]},
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/policy/sample_batch.py", line 725, in getitem
self.intercepted_values[key] = self.get_interceptor(value)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/utils/torch_utils.py", line 158, in convert_to_torch_tensor
return tree.map_structure(mapping, x)
File "/home/ubuntu/venv/lib/python3.8/site-packages/tree/init.py", line 510, in map_structure
[func(*args) for args in zip(*map(flatten, structures))])
File "/home/ubuntu/venv/lib/python3.8/site-packages/tree/init.py", line 510, in
[func(*args) for args in zip(*map(flatten, structures))])
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/utils/torch_utils.py", line 156, in mapping
return tensor if device is None else tensor.to(device)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Versions / Dependencies

ray == 1.9.2
python == 3.8.12
torch == 1.11.1

Reproduction script

dqn_config = default config
num_gpus_per_worker = 0.2
num_gpus = 1
#[total number of available gpus = 8 (each with memory 15109 mb)]
trainer = dqn.ApexTrainer(config=dqn_config, env = custom_env)

for i in range(10000):
result = trainer.train()

Anything else

couldn't find any relevant logs to understand why this is happening.
out files in /tmp/ray/session_latest/logs did mention a garbage collection issue (seems like the worker died),
"The GCS actor metadata garbage collector timer failed to fire. This could old actor metadata not being properly cleaned
up." and a bunch of worker failed messages after this. Then finally Raylet
b682a0f9013ee0737c068f9d54731fae0e95d0eccc01a1eb29c07287 is drained. Status IOError: . The information will be
published to the cluster.
Not sure if the above messages in logs are relevant
Looks like the script fails more often with high replay buffer size (1000000)

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

gjoliver · 2022-04-09T05:34:35Z

looks like bugs with using partial GPU on workers.

shixianc · 2024-03-01T19:12:33Z

Do we have a fix for this?

we ran into similar error RuntimeError: CUDA error: unspecified launch failure when using partial GPU workers. It also happens when we do tensors.to(cuda)

kawshik8 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2022

kawshik8 changed the title ~~[Bug]~~ [Bug] RLlib training gets stuck when GPU rollout workers are used Jan 21, 2022

krfricke added the rllib RLlib related issues label Apr 4, 2022

gjoliver added P2 Important issue, but not time-critical rllib-system system issues, runtime env, oom, etc and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] RLlib training gets stuck when GPU rollout workers are used #21758

[Bug] RLlib training gets stuck when GPU rollout workers are used #21758

kawshik8 commented Jan 20, 2022 •

edited

Loading

gjoliver commented Apr 9, 2022

shixianc commented Mar 1, 2024 •

edited

Loading

[Bug] RLlib training gets stuck when GPU rollout workers are used #21758

[Bug] RLlib training gets stuck when GPU rollout workers are used #21758

Comments

kawshik8 commented Jan 20, 2022 • edited Loading

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

gjoliver commented Apr 9, 2022

shixianc commented Mar 1, 2024 • edited Loading

kawshik8 commented Jan 20, 2022 •

edited

Loading

shixianc commented Mar 1, 2024 •

edited

Loading