[Bug] RLlib training gets stuck when GPU rollout workers are used #21758
Labels
bug
Something that is supposed to be working; but isn't
P2
Important issue, but not time-critical
rllib
RLlib related issues
rllib-system
system issues, runtime env, oom, etc
Search before asking
Ray Component
RLlib
What happened + What you expected to happen
Apex DQN trainer gets stuck randomly during training if gpu's are used for rollout workers. Couldn't find why it is happening from the ray logs in all logging modes. This behaviour doesn't happen when i use cpu workers. Changing the number of gpu workers doesn't seem to matter. The more the number of 8 gpu nodes I'm using the farther it goes into training before getting stuck
Rllib runs as expected until completion with GPU workers
Doesn't provide this stacktrace all the time :
2022-01-26 19:03:42,470 WARNING trainer.py:975 -- Worker crashed during call to
step_attempt()
. To try to continue training without the failed worker, setig nore_worker_failures=True
.Traceback (most recent call last):
File "src/rllib.py", line 1417, in
main(args)
File "src/rllib.py", line 969, in main
result = trainer.train()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/tune/trainable.py", line 319, in train
result = self.step()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 979, in step
raise e
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 965, in step
step_attempt_results = self.step_attempt()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1044, in step_attempt
step_results = self._exec_plan_or_training_iteration_fn()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 2032, in _exec_plan_or_training_iteration_fn
results = next(self.train_exec_impl)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 756, in next
return next(self.built_iterator)
File "/home/ubuntuvenv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 1075, in build_union
item = next(it)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 756, in next
return next(self.built_iterator)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
[Previous line repeated 2 more times]
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 551, in base_iterator
batch = ray.get(obj_ref)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/worker.py", line 1763, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RolloutWorker.par_iter_next_batch() (pid=14155, ip=172.31.64.130, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f8c9f7a21c0>)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 1157, in par_iter_next_batch
batch.append(self.par_iter_next())
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/util/iter.py", line 1151, in par_iter_next
return next(self.local_it)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 380, in gen_rollouts
yield self.sample()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 759, in sample
batches = [self.input_reader.next()]
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 104, in next
batches = [self.get_data()]
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 266, in get_data
item = next(self._env_runner)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 634, in _env_runner
_process_observations(
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/sampler.py", line 1026, in _process_observations
sample_collector.try_build_truncated_episode_multi_agent_batch()
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 886, in try_build_truncated_episode_multi_agent_batch
self.postprocess_episode(episode, is_done=False)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 781, in postprocess_episode
post_batches[agent_id] = policy.postprocess_trajectory(
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 304, in postprocess_trajectory
return postprocess_fn(self, sample_batch,
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/dqn/dqn_tf_policy.py", line 402, in postprocess_nstep_and_prio
td_errors = policy.compute_td_error(
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/dqn/dqn_torch_policy.py", line 122, in compute_td_error
build_q_losses(self, self.model, None, input_dict)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/agents/dqn/dqn_torch_policy.py", line 238, in build_q_losses
model, {"obs": train_batch[SampleBatch.CUR_OBS]},
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/policy/sample_batch.py", line 725, in getitem
self.intercepted_values[key] = self.get_interceptor(value)
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/utils/torch_utils.py", line 158, in convert_to_torch_tensor
return tree.map_structure(mapping, x)
File "/home/ubuntu/venv/lib/python3.8/site-packages/tree/init.py", line 510, in map_structure
[func(*args) for args in zip(*map(flatten, structures))])
File "/home/ubuntu/venv/lib/python3.8/site-packages/tree/init.py", line 510, in
[func(*args) for args in zip(*map(flatten, structures))])
File "/home/ubuntu/venv/lib/python3.8/site-packages/ray/rllib/utils/torch_utils.py", line 156, in mapping
return tensor if device is None else tensor.to(device)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Versions / Dependencies
ray == 1.9.2
python == 3.8.12
torch == 1.11.1
Reproduction script
dqn_config = default config
num_gpus_per_worker = 0.2
num_gpus = 1
#[total number of available gpus = 8 (each with memory 15109 mb)]
trainer = dqn.ApexTrainer(config=dqn_config, env = custom_env)
for i in range(10000):
result = trainer.train()
Anything else
couldn't find any relevant logs to understand why this is happening.
out files in /tmp/ray/session_latest/logs did mention a garbage collection issue (seems like the worker died),
"The GCS actor metadata garbage collector timer failed to fire. This could old actor metadata not being properly cleaned
up." and a bunch of worker failed messages after this. Then finally Raylet
b682a0f9013ee0737c068f9d54731fae0e95d0eccc01a1eb29c07287 is drained. Status IOError: . The information will be
published to the cluster.
Not sure if the above messages in logs are relevant
Looks like the script fails more often with high replay buffer size (1000000)
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: