[rllib][tune] Training stuck in "Pending" status #16425

floepfl · 2021-06-15T06:39:12Z

Hey everyone,

trying to run Ape-X with tune.run() on ray 1.3.0 and the status remains "pending". I get the same message indefinitely

== Status ==
Memory usage on this node: 7.5/19.4 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/8.24 GiB heap, 0.0/4.12 GiB objects
Result logdir: /home/flo/ray_results/APEX
Number of trials: 1/1 (1 PENDING)
+---------------------------+----------+-------+
| Trial name | status | loc |
|---------------------------+----------+-------|
| APEX_PFCAsset_985a1_00000 | PENDING | |
+---------------------------+----------+-------+

If I use the debug flag, it also outputs the following (a lot of times):
2021-06-10 09:22:33,760 DEBUG trial_runner.py:621 -- Running trial APEX_PFCAsset_985a1_00000
2021-06-10 09:22:33,760 DEBUG trial_executor.py:43 -- Trial APEX_PFCAsset_985a1_00000: Status PENDING unchanged.
2021-06-10 09:22:33,761 DEBUG trial_executor.py:62 -- Trial APEX_PFCAsset_985a1_00000: Saving trial metadata.

Downgrading to 1.2.0 solves the problem. I'm using Linux and Windows. Also tried with the last wheel on 2.0.0 on the website and version 1.4.0 and get the same issue. I also experienced it with A3C and another user on slack reports to have experienced it with PPO as well, according to him the problem could lie in the resource allocation.

richardliaw · 2021-06-15T07:04:30Z

@floepfl can you try Ray==1.4?

sven1977 · 2021-06-15T07:08:00Z

@floepfl , could you also try lowering your num_workers by 1 or 2?
It's probably because we dedicate a CPU now for the learner (local worker), which we didn't do in <= 1.2 via the placement groups. We also make sure now that each replay buffer shard has its own CPU (also something we didn't make sure of prior to 1.3).
The above may cause the number of CPUs to not be sufficient on your machine and your old configs that would run fine are now causing pending.

sven1977 · 2021-06-15T07:11:04Z

You can print out the needed resources by doing a:

print(APEXTrainer.default_resource_request(config=config)._bundles)

sven1977 · 2021-06-15T07:15:20Z

The default behavior is this:

[
{'CPU': 5, 'GPU': 1},  # <- learner (1 CPU + 1GPU for learner; 4 CPUs for the replay shards (see config.optimizer.num_replay_shards))
# 32 workers (1 CPU each)
{'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}]

floepfl · 2021-06-15T08:55:39Z

Thanks for your answers. Indeed, taking into account "replay buffer shards" in the number of CPU's I made available in ray.init resolved the issue

mickelliu · 2021-08-04T05:48:55Z

Hey everyone,

I encountered a similar issue today but with QMIX. I have tried ray 2.0.0 dev0, 1.5.0, 1.4.0 but with no success; they were all stuck at the "Pending" status, no matter how I adjust the number of workers and CPUs. I ran the trial overnight and it is still stuck at pending.

Then I saw this issue thread and lower my ray version to 1.2.0, however a new error pops up:

2021-08-04 13:21:32,756 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
  File "/data/[USER]/code/mate-nips/rllib_train.py", line 87, in <module>
    verbose=3)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 404, in step
    self.trial_executor.on_no_available_trials(self)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_executor.py", line 186, in on_no_available_trials
    "Insufficient cluster resources to launch trial: "
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 17.0 CPUs, 0.0 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 77.78 GiB heap, 25.73 GiB objects (1.0 node:162.105.162.205, 1.0 accelerator_type:X). 

You can adjust the resource requests of RLlib agents by setting num_workers, num_gpus, and other configs. See the DEFAULT_CONFIG defined by each agent for more info.

The config of this agent is: {'env': 'mate_centralized', 'log_level': 'WARN', 'num_workers': 16, 'num_cpus_per_worker': 1.0, 'num_gpus': 0, 'num_gpus_per_worker': 0.0, 'framework': 'torch', 'exploration_config': {'type': 'EpsilonGreedy', 'initial_epsilon': 1.0, 'final_epsilon': 0.02, 'epsilon_timesteps': 10000}, 'horizon': None, 'callbacks': <class 'module.AuxiliaryRewardCallbacks'>}

Interestingly the trial requested resource will always be 1 CPU more than the amount that I initialized to ray. If I set the resource to 4 CPUs, it would ask for 5. If I set it to 1 CPU, it would ask for 2. Though I only saw this error once I entirely downgraded ray to 1.2.0, and higher versions didn't even pop this error.

I haven't tested if this is a problem specific to the QMIX algorithm. I used a custom Gym environment, Python 3.7, Pytorch 1.4, Gym 0.18.3

mickelliu · 2021-08-04T10:59:12Z

Hey everyone,

I encountered a similar issue today but with QMIX. I have tried ray 2.0.0 dev0, 1.5.0, 1.4.0 but with no success; they were all stuck at the "Pending" status, no matter how I adjust the number of workers and CPUs. I ran the trial overnight and it is still stuck at pending.

Then I saw this issue thread and lower my ray version to 1.2.0, however a new error pops up:
2021-08-04 13:21:32,756 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
  File "/data/[USER]/code/mate-nips/rllib_train.py", line 87, in <module>
    verbose=3)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 404, in step
    self.trial_executor.on_no_available_trials(self)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_executor.py", line 186, in on_no_available_trials
    "Insufficient cluster resources to launch trial: "
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 17.0 CPUs, 0.0 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 77.78 GiB heap, 25.73 GiB objects (1.0 node:162.105.162.205, 1.0 accelerator_type:X). 

You can adjust the resource requests of RLlib agents by setting num_workers, num_gpus, and other configs. See the DEFAULT_CONFIG defined by each agent for more info.

The config of this agent is: {'env': 'mate_centralized', 'log_level': 'WARN', 'num_workers': 16, 'num_cpus_per_worker': 1.0, 'num_gpus': 0, 'num_gpus_per_worker': 0.0, 'framework': 'torch', 'exploration_config': {'type': 'EpsilonGreedy', 'initial_epsilon': 1.0, 'final_epsilon': 0.02, 'epsilon_timesteps': 10000}, 'horizon': None, 'callbacks': <class 'module.AuxiliaryRewardCallbacks'>} 
Interestingly the trial requested resource will always be 1 CPU more than the amount that I initialized to ray. If I set the resource to 4 CPUs, it would ask for 5. If I set it to 1 CPU, it would ask for 2. Though I only saw this error once I entirely downgraded ray to 1.2.0, and higher versions didn't even pop this error.

I haven't tested if this is a problem specific to the QMIX algorithm. I used a custom Gym environment, Python 3.7, Pytorch 1.4, Gym 0.18.3

Later I figure it was a genuine mistake. I thought I could fractionally divide the number of CPUs or GPUs by the number of workers.

        "num_cpus_per_worker": args.num_cpus / args.num_workers,
        "num_gpus_per_worker": args.num_gpus / args.num_workers,

I am not sure if this is an error from division because both division (/) and integer division (//) cause the same error. In config, it actually shows the correct number (num_of_cpus_per_agent = 1), but in Ray, it is automatically rounded up to 2 per agent, which causes that 6 CPUs requirement shown in the screenshot. Other than that I don't have a clear explanation. Delete those two lines from my config file, everything works just fine.

xwjiang2010 · 2021-08-04T23:20:20Z

Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!

mickelliu · 2021-08-05T05:25:31Z

Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!

Hi @xwjiang2010, although I resolved this issue by setting up my config correctly, I do believe that the problem (about pending stuck) still persists in Ray version > 1.2.
To reproduce the issue I had before, you could downgrade your Ray version to 1.2, then add these lines to the Rllib Algorithm Config dict (which will be passed into tune.run()). I ran my code with a closed-source custom Gym Environment that was then registered in ray.tune.

config = {
        'num_workers': args.num_workers,
        "num_gpus": args.num_gpus,
        # The below two lines caused bugs
        # "num_cpus_per_worker": args.num_cpus // args.num_workers,
        # "num_gpus_per_worker": args.num_gpus // args.num_workers,
}

In my case, the divisions don't seem to cause the error but specifying the parameters "num_cpus_per_worker" and "num_gpus_per_worker" does.

jamesliu · 2021-08-14T17:43:44Z

Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!

Hi @xwjiang2010, although I resolved this issue by setting up my config correctly, I do believe that the problem (about pending stuck) still persists in Ray version > 1.2.
To reproduce the issue I had before, you could downgrade your Ray version to 1.2, then add these lines to the Rllib Algorithm Config dict (which will be passed into tune.run()). I ran my code with a closed-source custom Gym Environment that was then registered in ray.tune.
config = {
        'num_workers': args.num_workers,
        "num_gpus": args.num_gpus,
        # The below two lines caused bugs
        # "num_cpus_per_worker": args.num_cpus // args.num_workers,
        # "num_gpus_per_worker": args.num_gpus // args.num_workers,
}
In my case, the divisions don't seem to cause the error but specifying the parameters "num_cpus_per_worker" and "num_gpus_per_worker" does.

Yes. After removing num_gpus_per_worker, pending issue is fixed.

JulesVerny · 2021-09-27T11:15:34Z

I have a similar issue, just on basic Policy Gradient configuration rrlib fails to start the job
from ray.rllib.agents.pg.pg import (
DEFAULT_CONFIG,
PGTrainer as trainer)
Even though I have 12 CPU cores, I have tried setting config_update = {
"env": args.env,
"num_gpus": 1,
"num_workers": 10,
"evaluation_num_workers": 4,
"evaluation_interval": 1

Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/31.34 GiB heap, 0.0/15.67 GiB objects

Still no Joy, Stuck at Pending !

richardliaw · 2021-09-27T16:52:14Z

Hey Jules, what version are you using?

…

On Mon, Sep 27, 2021 at 4:15 AM Jules ***@***.***> wrote: I have a similar issue, just on basic Policy Gradient configuration rrlib fails to start the job from ray.rllib.agents.pg.pg import ( DEFAULT_CONFIG, PGTrainer as trainer) Even though I have 12 CPU cores, I have tried setting config_update = { "env": args.env, "num_gpus": 1, "num_workers": 10, "evaluation_num_workers": 4, "evaluation_interval": 1 Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/31.34 GiB heap, 0.0/15.67 GiB objects Still no Joy, Stuck at Pending ! — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#16425 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCRZZMIIMFJBWM7GBPIHPTUEBG6DANCNFSM46WTBD7A> .

JulesVerny · 2021-09-28T10:38:04Z

Hello I am using ray version 1.6.0 but on Microsoft Windows 10, with Anaconda python 3.7.1. When I had this problem. But I cannot remember which order I installed pip install ray[rllib]

I actually got a ppo ray job working working with a "framework":"torch" under another Anaconda python 3.8.1 conda environment.

cheadrian · 2022-10-21T10:21:43Z

I think there should be an argument to set the evaluation_num_workers for ScalingConfig and count it in total number of workers for clarity or to limit total number of workers to the value that is set in ScalingConfig as if there are additional workers set in config={} the number will pass.

gurdipk · 2024-03-28T16:39:39Z

You can print out the needed resources by doing a:
print(APEXTrainer.default_resource_request(config=config)._bundles)

Do we add this print command within the function 'main'?

richardliaw changed the title ~~Training stuck in "Pending" status~~ [rllib][tune] Training stuck in "Pending" status Jun 15, 2021

sven1977 self-assigned this Jun 15, 2021

sven1977 added enhancement Request for new feature and/or capability rllib RLlib related issues tune Tune-related issues labels Jun 15, 2021

krfricke mentioned this issue Jul 27, 2021

[tune] tune.ray() gives repeated status without any further execution #17359

Closed

krfricke assigned xwjiang2010 Jul 27, 2021

richardliaw added the P1 Issue that should be fixed within a few weeks label Jul 28, 2021

xwjiang2010 mentioned this issue Aug 3, 2021

[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. #17533

Merged

6 tasks

richardliaw closed this as completed in #17533 Aug 13, 2021

kawshik8 mentioned this issue Jan 26, 2022

[Bug] RLlib training gets stuck when GPU rollout workers are used #21758

Open

2 tasks

JackYansongLi mentioned this issue Nov 2, 2022

Trial Pending Replicable-MARL/MARLlib#63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib][tune] Training stuck in "Pending" status #16425

[rllib][tune] Training stuck in "Pending" status #16425

floepfl commented Jun 15, 2021

richardliaw commented Jun 15, 2021

sven1977 commented Jun 15, 2021

sven1977 commented Jun 15, 2021 •

edited

Loading

sven1977 commented Jun 15, 2021

floepfl commented Jun 15, 2021

mickelliu commented Aug 4, 2021 •

edited

Loading

mickelliu commented Aug 4, 2021

xwjiang2010 commented Aug 4, 2021

mickelliu commented Aug 5, 2021 •

edited

Loading

jamesliu commented Aug 14, 2021

JulesVerny commented Sep 27, 2021

richardliaw commented Sep 27, 2021 via email

JulesVerny commented Sep 28, 2021

cheadrian commented Oct 21, 2022

gurdipk commented Mar 28, 2024 •

edited

Loading

[rllib][tune] Training stuck in "Pending" status #16425

[rllib][tune] Training stuck in "Pending" status #16425

Comments

floepfl commented Jun 15, 2021

richardliaw commented Jun 15, 2021

sven1977 commented Jun 15, 2021

sven1977 commented Jun 15, 2021 • edited Loading

sven1977 commented Jun 15, 2021

floepfl commented Jun 15, 2021

mickelliu commented Aug 4, 2021 • edited Loading

mickelliu commented Aug 4, 2021

xwjiang2010 commented Aug 4, 2021

mickelliu commented Aug 5, 2021 • edited Loading

jamesliu commented Aug 14, 2021

JulesVerny commented Sep 27, 2021

richardliaw commented Sep 27, 2021 via email

JulesVerny commented Sep 28, 2021

cheadrian commented Oct 21, 2022

gurdipk commented Mar 28, 2024 • edited Loading

sven1977 commented Jun 15, 2021 •

edited

Loading

mickelliu commented Aug 4, 2021 •

edited

Loading

mickelliu commented Aug 5, 2021 •

edited

Loading

gurdipk commented Mar 28, 2024 •

edited

Loading