-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib][tune] Training stuck in "Pending" status #16425
Comments
@floepfl can you try Ray==1.4? |
@floepfl , could you also try lowering your num_workers by 1 or 2? |
You can print out the needed resources by doing a:
|
The default behavior is this:
|
Thanks for your answers. Indeed, taking into account "replay buffer shards" in the number of CPU's I made available in ray.init resolved the issue |
Hey everyone, I encountered a similar issue today but with QMIX. I have tried ray 2.0.0 dev0, 1.5.0, 1.4.0 but with no success; they were all stuck at the "Pending" status, no matter how I adjust the number of workers and CPUs. I ran the trial overnight and it is still stuck at pending. Then I saw this issue thread and lower my ray version to 1.2.0, however a new error pops up:
Interestingly the trial requested resource will always be 1 CPU more than the amount that I initialized to ray. If I set the resource to 4 CPUs, it would ask for 5. If I set it to 1 CPU, it would ask for 2. Though I only saw this error once I entirely downgraded ray to 1.2.0, and higher versions didn't even pop this error. I haven't tested if this is a problem specific to the QMIX algorithm. I used a custom Gym environment, Python 3.7, Pytorch 1.4, Gym 0.18.3 |
Later I figure it was a genuine mistake. I thought I could fractionally divide the number of CPUs or GPUs by the number of workers.
I am not sure if this is an error from division because both division (/) and integer division (//) cause the same error. In config, it actually shows the correct number (num_of_cpus_per_agent = 1), but in Ray, it is automatically rounded up to 2 per agent, which causes that 6 CPUs requirement shown in the screenshot. Other than that I don't have a clear explanation. Delete those two lines from my config file, everything works just fine. |
Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks! |
Hi @xwjiang2010, although I resolved this issue by setting up my config correctly, I do believe that the problem (about pending stuck) still persists in Ray version > 1.2.
In my case, the divisions don't seem to cause the error but specifying the parameters |
Yes. After removing num_gpus_per_worker, pending issue is fixed. |
I have a similar issue, just on basic Policy Gradient configuration rrlib fails to start the job Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/31.34 GiB heap, 0.0/15.67 GiB objects Still no Joy, Stuck at Pending ! |
Hey Jules, what version are you using?
…On Mon, Sep 27, 2021 at 4:15 AM Jules ***@***.***> wrote:
I have a similar issue, just on basic Policy Gradient configuration rrlib
fails to start the job
from ray.rllib.agents.pg.pg import (
DEFAULT_CONFIG,
PGTrainer as trainer)
Even though I have 12 CPU cores, I have tried setting config_update = {
"env": args.env,
"num_gpus": 1,
"num_workers": 10,
"evaluation_num_workers": 4,
"evaluation_interval": 1
Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/31.34 GiB heap, 0.0/15.67
GiB objects
Still no Joy, Stuck at Pending !
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#16425 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZMIIMFJBWM7GBPIHPTUEBG6DANCNFSM46WTBD7A>
.
|
Hello I am using ray version 1.6.0 but on Microsoft Windows 10, with Anaconda python 3.7.1. When I had this problem. But I cannot remember which order I installed pip install ray[rllib] I actually got a ppo ray job working working with a "framework":"torch" under another Anaconda python 3.8.1 conda environment. |
I think there should be an argument to set the |
Do we add this print command within the function 'main'? |
Hey everyone,
trying to run Ape-X with tune.run() on ray 1.3.0 and the status remains "pending". I get the same message indefinitely
== Status ==
Memory usage on this node: 7.5/19.4 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/8.24 GiB heap, 0.0/4.12 GiB objects
Result logdir: /home/flo/ray_results/APEX
Number of trials: 1/1 (1 PENDING)
+---------------------------+----------+-------+
| Trial name | status | loc |
|---------------------------+----------+-------|
| APEX_PFCAsset_985a1_00000 | PENDING | |
+---------------------------+----------+-------+
If I use the debug flag, it also outputs the following (a lot of times):
2021-06-10 09:22:33,760 DEBUG trial_runner.py:621 -- Running trial APEX_PFCAsset_985a1_00000
2021-06-10 09:22:33,760 DEBUG trial_executor.py:43 -- Trial APEX_PFCAsset_985a1_00000: Status PENDING unchanged.
2021-06-10 09:22:33,761 DEBUG trial_executor.py:62 -- Trial APEX_PFCAsset_985a1_00000: Saving trial metadata.
Downgrading to 1.2.0 solves the problem. I'm using Linux and Windows. Also tried with the last wheel on 2.0.0 on the website and version 1.4.0 and get the same issue. I also experienced it with A3C and another user on slack reports to have experienced it with PPO as well, according to him the problem could lie in the resource allocation.
The text was updated successfully, but these errors were encountered: