Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib][tune] Training stuck in "Pending" status #16425

Closed
floepfl opened this issue Jun 15, 2021 · 12 comments · Fixed by #17533
Closed

[rllib][tune] Training stuck in "Pending" status #16425

floepfl opened this issue Jun 15, 2021 · 12 comments · Fixed by #17533
Assignees
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks rllib RLlib related issues tune Tune-related issues

Comments

@floepfl
Copy link

floepfl commented Jun 15, 2021

Hey everyone,

trying to run Ape-X with tune.run() on ray 1.3.0 and the status remains "pending". I get the same message indefinitely

== Status ==
Memory usage on this node: 7.5/19.4 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/8.24 GiB heap, 0.0/4.12 GiB objects
Result logdir: /home/flo/ray_results/APEX
Number of trials: 1/1 (1 PENDING)
+---------------------------+----------+-------+
| Trial name | status | loc |
|---------------------------+----------+-------|
| APEX_PFCAsset_985a1_00000 | PENDING | |
+---------------------------+----------+-------+

If I use the debug flag, it also outputs the following (a lot of times):
2021-06-10 09:22:33,760 DEBUG trial_runner.py:621 -- Running trial APEX_PFCAsset_985a1_00000
2021-06-10 09:22:33,760 DEBUG trial_executor.py:43 -- Trial APEX_PFCAsset_985a1_00000: Status PENDING unchanged.
2021-06-10 09:22:33,761 DEBUG trial_executor.py:62 -- Trial APEX_PFCAsset_985a1_00000: Saving trial metadata.

Downgrading to 1.2.0 solves the problem. I'm using Linux and Windows. Also tried with the last wheel on 2.0.0 on the website and version 1.4.0 and get the same issue. I also experienced it with A3C and another user on slack reports to have experienced it with PPO as well, according to him the problem could lie in the resource allocation.

@richardliaw
Copy link
Contributor

@floepfl can you try Ray==1.4?

@richardliaw richardliaw changed the title Training stuck in "Pending" status [rllib][tune] Training stuck in "Pending" status Jun 15, 2021
@sven1977
Copy link
Contributor

@floepfl , could you also try lowering your num_workers by 1 or 2?
It's probably because we dedicate a CPU now for the learner (local worker), which we didn't do in <= 1.2 via the placement groups. We also make sure now that each replay buffer shard has its own CPU (also something we didn't make sure of prior to 1.3).
The above may cause the number of CPUs to not be sufficient on your machine and your old configs that would run fine are now causing pending.

@sven1977
Copy link
Contributor

sven1977 commented Jun 15, 2021

You can print out the needed resources by doing a:

print(APEXTrainer.default_resource_request(config=config)._bundles)

@sven1977
Copy link
Contributor

The default behavior is this:

[
{'CPU': 5, 'GPU': 1},  # <- learner (1 CPU + 1GPU for learner; 4 CPUs for the replay shards (see config.optimizer.num_replay_shards))
# 32 workers (1 CPU each)
{'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}]

@sven1977 sven1977 self-assigned this Jun 15, 2021
@sven1977 sven1977 added enhancement Request for new feature and/or capability rllib RLlib related issues tune Tune-related issues labels Jun 15, 2021
@floepfl
Copy link
Author

floepfl commented Jun 15, 2021

Thanks for your answers. Indeed, taking into account "replay buffer shards" in the number of CPU's I made available in ray.init resolved the issue

@mickelliu
Copy link
Contributor

mickelliu commented Aug 4, 2021

Hey everyone,

I encountered a similar issue today but with QMIX. I have tried ray 2.0.0 dev0, 1.5.0, 1.4.0 but with no success; they were all stuck at the "Pending" status, no matter how I adjust the number of workers and CPUs. I ran the trial overnight and it is still stuck at pending.

Capture
Capture2

Then I saw this issue thread and lower my ray version to 1.2.0, however a new error pops up:

2021-08-04 13:21:32,756 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
  File "/data/[USER]/code/mate-nips/rllib_train.py", line 87, in <module>
    verbose=3)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 404, in step
    self.trial_executor.on_no_available_trials(self)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_executor.py", line 186, in on_no_available_trials
    "Insufficient cluster resources to launch trial: "
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 17.0 CPUs, 0.0 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 77.78 GiB heap, 25.73 GiB objects (1.0 node:162.105.162.205, 1.0 accelerator_type:X). 

You can adjust the resource requests of RLlib agents by setting num_workers, num_gpus, and other configs. See the DEFAULT_CONFIG defined by each agent for more info.

The config of this agent is: {'env': 'mate_centralized', 'log_level': 'WARN', 'num_workers': 16, 'num_cpus_per_worker': 1.0, 'num_gpus': 0, 'num_gpus_per_worker': 0.0, 'framework': 'torch', 'exploration_config': {'type': 'EpsilonGreedy', 'initial_epsilon': 1.0, 'final_epsilon': 0.02, 'epsilon_timesteps': 10000}, 'horizon': None, 'callbacks': <class 'module.AuxiliaryRewardCallbacks'>} 

Interestingly the trial requested resource will always be 1 CPU more than the amount that I initialized to ray. If I set the resource to 4 CPUs, it would ask for 5. If I set it to 1 CPU, it would ask for 2. Though I only saw this error once I entirely downgraded ray to 1.2.0, and higher versions didn't even pop this error.

I haven't tested if this is a problem specific to the QMIX algorithm. I used a custom Gym environment, Python 3.7, Pytorch 1.4, Gym 0.18.3

@mickelliu
Copy link
Contributor

Hey everyone,

I encountered a similar issue today but with QMIX. I have tried ray 2.0.0 dev0, 1.5.0, 1.4.0 but with no success; they were all stuck at the "Pending" status, no matter how I adjust the number of workers and CPUs. I ran the trial overnight and it is still stuck at pending.

Capture
Capture2

Then I saw this issue thread and lower my ray version to 1.2.0, however a new error pops up:

2021-08-04 13:21:32,756 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
  File "/data/[USER]/code/mate-nips/rllib_train.py", line 87, in <module>
    verbose=3)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 404, in step
    self.trial_executor.on_no_available_trials(self)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_executor.py", line 186, in on_no_available_trials
    "Insufficient cluster resources to launch trial: "
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 17.0 CPUs, 0.0 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 77.78 GiB heap, 25.73 GiB objects (1.0 node:162.105.162.205, 1.0 accelerator_type:X). 

You can adjust the resource requests of RLlib agents by setting num_workers, num_gpus, and other configs. See the DEFAULT_CONFIG defined by each agent for more info.

The config of this agent is: {'env': 'mate_centralized', 'log_level': 'WARN', 'num_workers': 16, 'num_cpus_per_worker': 1.0, 'num_gpus': 0, 'num_gpus_per_worker': 0.0, 'framework': 'torch', 'exploration_config': {'type': 'EpsilonGreedy', 'initial_epsilon': 1.0, 'final_epsilon': 0.02, 'epsilon_timesteps': 10000}, 'horizon': None, 'callbacks': <class 'module.AuxiliaryRewardCallbacks'>} 

Interestingly the trial requested resource will always be 1 CPU more than the amount that I initialized to ray. If I set the resource to 4 CPUs, it would ask for 5. If I set it to 1 CPU, it would ask for 2. Though I only saw this error once I entirely downgraded ray to 1.2.0, and higher versions didn't even pop this error.

I haven't tested if this is a problem specific to the QMIX algorithm. I used a custom Gym environment, Python 3.7, Pytorch 1.4, Gym 0.18.3

Later I figure it was a genuine mistake. I thought I could fractionally divide the number of CPUs or GPUs by the number of workers.

        "num_cpus_per_worker": args.num_cpus / args.num_workers,
        "num_gpus_per_worker": args.num_gpus / args.num_workers,

I am not sure if this is an error from division because both division (/) and integer division (//) cause the same error. In config, it actually shows the correct number (num_of_cpus_per_agent = 1), but in Ray, it is automatically rounded up to 2 per agent, which causes that 6 CPUs requirement shown in the screenshot. Other than that I don't have a clear explanation. Delete those two lines from my config file, everything works just fine.

@xwjiang2010
Copy link
Contributor

Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!

@mickelliu
Copy link
Contributor

mickelliu commented Aug 5, 2021

Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!

Hi @xwjiang2010, although I resolved this issue by setting up my config correctly, I do believe that the problem (about pending stuck) still persists in Ray version > 1.2.
To reproduce the issue I had before, you could downgrade your Ray version to 1.2, then add these lines to the Rllib Algorithm Config dict (which will be passed into tune.run()). I ran my code with a closed-source custom Gym Environment that was then registered in ray.tune.

config = {
        'num_workers': args.num_workers,
        "num_gpus": args.num_gpus,
        # The below two lines caused bugs
        # "num_cpus_per_worker": args.num_cpus // args.num_workers,
        # "num_gpus_per_worker": args.num_gpus // args.num_workers,
}

In my case, the divisions don't seem to cause the error but specifying the parameters "num_cpus_per_worker" and "num_gpus_per_worker" does.

@jamesliu
Copy link

Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!

Hi @xwjiang2010, although I resolved this issue by setting up my config correctly, I do believe that the problem (about pending stuck) still persists in Ray version > 1.2.
To reproduce the issue I had before, you could downgrade your Ray version to 1.2, then add these lines to the Rllib Algorithm Config dict (which will be passed into tune.run()). I ran my code with a closed-source custom Gym Environment that was then registered in ray.tune.

config = {
        'num_workers': args.num_workers,
        "num_gpus": args.num_gpus,
        # The below two lines caused bugs
        # "num_cpus_per_worker": args.num_cpus // args.num_workers,
        # "num_gpus_per_worker": args.num_gpus // args.num_workers,
}

In my case, the divisions don't seem to cause the error but specifying the parameters "num_cpus_per_worker" and "num_gpus_per_worker" does.

Yes. After removing num_gpus_per_worker, pending issue is fixed.

@JulesVerny
Copy link

I have a similar issue, just on basic Policy Gradient configuration rrlib fails to start the job
from ray.rllib.agents.pg.pg import (
DEFAULT_CONFIG,
PGTrainer as trainer)
Even though I have 12 CPU cores, I have tried setting config_update = {
"env": args.env,
"num_gpus": 1,
"num_workers": 10,
"evaluation_num_workers": 4,
"evaluation_interval": 1

Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/31.34 GiB heap, 0.0/15.67 GiB objects

Still no Joy, Stuck at Pending !

@richardliaw
Copy link
Contributor

richardliaw commented Sep 27, 2021 via email

@JulesVerny
Copy link

Hello I am using ray version 1.6.0 but on Microsoft Windows 10, with Anaconda python 3.7.1. When I had this problem. But I cannot remember which order I installed pip install ray[rllib]

I actually got a ppo ray job working working with a "framework":"torch" under another Anaconda python 3.8.1 conda environment.

@cheadrian
Copy link

I think there should be an argument to set the evaluation_num_workers for ScalingConfig and count it in total number of workers for clarity or to limit total number of workers to the value that is set in ScalingConfig as if there are additional workers set in config={} the number will pass.

@gurdipk
Copy link

gurdipk commented Mar 28, 2024

You can print out the needed resources by doing a:

print(APEXTrainer.default_resource_request(config=config)._bundles)

Do we add this print command within the function 'main'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks rllib RLlib related issues tune Tune-related issues
Projects
None yet
9 participants