Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Adding multiprocessing support for off policy algorithms #179

Closed
yonkshi opened this issue Oct 6, 2020 · 13 comments · Fixed by #439
Closed

[Feature request] Adding multiprocessing support for off policy algorithms #179

yonkshi opened this issue Oct 6, 2020 · 13 comments · Fixed by #439
Assignees
Labels
enhancement New feature or request experimental Experimental Feature
Milestone

Comments

@yonkshi
Copy link

yonkshi commented Oct 6, 2020

I am in the process of adding multiprocessing(vectorized envs) support for off-policy algorithms (TD3, SAC, DDPG etc), I've added support for sampling multiple actions and updates timesteps appropriately to the number of vectorized environments. The modified code can run without throwing an error, but the algorithms don't really converge anymore.
I tried on OpenAI Gym's Pendulum-v0, where single instance envs made from make_vec_env('Pendulum-v0', n_envs = 1, vec_env_cls=DummyVecEnv) trains fine. If I specify multiple instances such as make_vec_env('Pendulum-v0', n_envs = 2, vec_env_cls=DummyVecEnv) or make_vec_env('Pendulum-v0', n_envs = 2, vec_env_cls=SubprocVecEnv), then the algorithms don't converge at all.
Here's a warning message that I get, which I suspect is closely related to the non-convergence.

/home/me/code/stable-baselines3/stable_baselines3/sac/sac.py:237: UserWarning: Using a target size (torch.Size([256, 2])) that is different
to the input size (torch.Size([256, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.                                                                                                                                             critic_loss = 0.5 * sum([F.mse_loss(current_q, q_backup) for current_q in current_q_estimates])   

It appears to me that the replay buffer wasn't not retrieving n_envs samples thus the loss target had to rely on broadcasting

Some pointers on modifying the replay buffer so it would support multiprocessing would be much appreciated! If the authors would like, I can create a PR

yonkshi@1579713

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 6, 2020

On a quick glimpse I think you need to over the experiences from each environment here and store each experience one by one. I am not sure how the current code works and manages to store experiences for all environments at once. Also taking "all" over dones is not right, because environment episodes can end at different times.

This could be added to contrib repository which we will make available soon.

@Miffyli Miffyli added enhancement New feature or request experimental Experimental Feature labels Oct 6, 2020
@araffin
Copy link
Member

araffin commented Oct 9, 2020

This could be added to contrib repository which we will make available soon.

In fact, multiprocessing for all algorithms is a feature that I would like to have for v1.1+, not only for the contrib.
Most of the work will be in changing the replay buffer and adding checks (because some features are disabled for SAC/TD3 when n_envs>1, like training after n episodes).

@nick-harder
Copy link

I would be glad to also work on this feature. Is there any work going on in this direction? Whom can I talk to?

@araffin
Copy link
Member

araffin commented Apr 19, 2021

I would be glad to also work on this feature. Is there any work going on in this direction? Whom can I talk to?

There was apparently some work already done but far from what would be needed.
Anyway, if you work on that feature, please start with only one algorithm (for instance SAC) and create a draft PR so we can discuss the details there.

If you have any question before, you can ask them here ;)

@yonkshi
Copy link
Author

yonkshi commented Apr 19, 2021

@nick-harder My own research have pivoted away from RL / SB3 so I haven't made much progress since the previous commit. Please feel free to continue working on this feature. I agree with what Antonin said, start with one algorithm and try to verify that it converges properly.

@araffin araffin self-assigned this May 17, 2021
@araffin
Copy link
Member

araffin commented May 17, 2021

I'll be working on that in the coming weeks (I need to implement it for a personal project)

@prathameshpck
Copy link

Hey I was actually implementing a project and this feature would help me immensely
Any progress with/ any estimate on its completion?

@araffin
Copy link
Member

araffin commented Jun 26, 2021

Hey I was actually implementing a project and this feature would help me immensely
Any progress with/ any estimate on its completion?

You can find a minimal working branch (yet unpolished) there: #439

I activated the feature for SAC only for now but it should work with the other algorithms.
It also only work in the basic case onyl for now (no dict obs, no HER replay).

@prathameshpck
Copy link

Thanks a lot
Ill try to look into it
I'm fairly new to SB3 and its a pretty big learning curve to contrib immediately
Understanding the codebase could take a while

@araffin
Copy link
Member

araffin commented Nov 4, 2021

I updated my current implementation, remove some for loops and added support for dict obs.

@araffin araffin changed the title [Feature request / WIP] Adding multiprocessing support for off policy algorithms [Feature request] Adding multiprocessing support for off policy algorithms Nov 4, 2021
@araffin
Copy link
Member

araffin commented Nov 6, 2021

I added experimental support for multi env with HER replay buffer in #654

@araffin araffin unpinned this issue Dec 2, 2021
@Kittiwin-Kumlungmak
Copy link

Hello,

Dose DDPG support multiprocessing in the latest version of SB3? v1.4.0?

I'm just curious because I tried multiprocessing on my custom env and get this assertion:

"AssertionError: You must use only one env when doing episodic training."

Also, I found out that new features such as EvalCallback with StopTrainingOnNoModelImprovement don't exist in the library installed in my environment. Even though I could find the code in the repo.

I tried pip uninstall and pip install again. The code still was not updated.

Was that because I used pip to install? Is there a solution for this problem?

Thank you very much in advance.
I'm new to RL but your work makes RL much simpler than I thought. Still not easy though.

@araffin
Copy link
Member

araffin commented Mar 3, 2022

"AssertionError: You must use only one env when doing episodic training."

Please read the documentation, you are using train_freq=(1, "episode") (episodic training), to use mutliple env, you must use "step" as the unit (or ŧrain_freq=1 for short).
We recommend you to use TD3/SAC anyway (improved versions of DDPG).

Also, I found out that new features such as EvalCallback with StopTrainingOnNoModelImprovement don't exist in the library installed in my environment. Even though I could find the code in the repo.

You need to install master version (cf. doc) as it is not yet released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request experimental Experimental Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants