-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have you ever comfirm controlling one drone with "rpm" using learn.py ? #180
Comments
Hi @paehal , I trained stable-baseline3 PPO to do hover with just RPMs (in the plus/minus 5% range of the hover value) back in 2020 without yaw control (as it wasn't penalized in the reward). I agree it's a more difficult RL problem and that's why the base RL aviary class includes simplified actions spaces for the 1D and the velocity control cases. video-10.28.2020_09.45.37.mp4This was a 4 layer architecture [256, 256, 256, 128, 2 shared 2 separate for qf and pol], with a 12 vector input [position, ori, vel, ang_vel] to 4 motor velocities (in the +-5% RPMs around the hover RPMs) after 8 hours and ~5M time steps (48Hz ctrl). |
Thanks for the reply and sharing the video. Glad to hear that rpm control has been stable in the past. I would like to do a study under the same conditions as yours in the latest repository, is that possible? Here is what I am wondering. Do I just run "python learn.py" with action type as rpm? |
yes
no, the action will be a vector of size 4 with the desired RPMs (in fact a plus/minus 5% centered in the hover RPMs) of each motor
What is mainly different in the current HoverAviary is that the reward is always positive (instead of including negative penalties), it is only based on position (the result above also included a reward component based on the velocity) and the environment does not early terminate if the quadrotor flips or flies out of bound. It might be necessary to reintroduce some of those details. |
Let me confirm. In latest repository, does the environment terminate if the quadrotor flips or flies out of bound? |
No, you can add that to the
method (FYI, that the reward achieved by a "successful" one-dimensional hover is ~470 (in 3' on my machine), I just tried training the 3D hover, as is, for ~30' and it stopped at a reward of ~250). |
Hi @paehal I added back the truncation condition and trained this in ~10' (this is the current code in RL.mp4 |
@JacopoPan ![]() Related to this, I have a question: how can I load a trained model in a different job and save a video of its performance? Even setting --record_video to True, the video is not being saved. Also, when I tried to load a different trained model with the following settings, targeting a model in a specified folder, an error occurred. Since I'm not familiar with stable_baseline3, I would appreciate if you could help me identify the cause.
In a previous version, there was something like test_learning.py, which, when executed, allowed me to verify the behavior in a video. |
The current version of script |
Quick response, thank you. I was able to understand what you were saying by carefully reading the code. I confirmed that the evaluation is working for the first time after training. I was able to achieve this by making some changes to the code since I wanted to run a pretrained model without retraining it. Also, this is a different question, but (please let me know if it's better to create a separate issue), I believe that increasing the control_freq generally improves control (e.g., Hovering). So, here are the following questions:
|
Ctrl freq is both the frequency at which observations are produced and actions are taken by the environment. The main thing to note is that the observation contains the actions of the last .5 seconds, so increasing the ctrl freq will increase the obs space. |
Thank you for your reply.
My understanding aligns with this, which is great. Is it also correct to say that this PyBullet step is responsible for the actual physics simulation?
This corresponds to the following part in the code, right? I'm asking out of curiosity, but where did the idea of using actions from the last 0.5 seconds as observations come from? Was it from a paper or some other source? Additionally, if I want to change the MLP network model when increasing ctrl_freq because the last buffer action becomes too large, would the following setup be appropriate? Have you had any experience with changing the MLP network structure in a similar situation?
|
The sim/pybullet frequency is the actual physics integration frequency, yes. The idea of the action buffer is that the policy might be better guided by knowing what the controller had done just before, the proportionality to the control frequency makes it dependent on the wall-clock only, and not the type of controller (but it might be appropriate to change that, depending on application). For custom SB3 policies, I can only refer you to the relative documentation https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html I used different critic/actor network sizes in past SB3 versions but the current focus of this repo is having very few dependencies and compatibility with the simplest/most stock versions of them. |
@JacopoPan |
@JacopoPan how did you calculate the 470 for a "successful" training or value for a successful hover? |
Hello, it's been a long while.
I haven't touched this repository much lately, but I'm glad to see that there has been a lot of progress.
I have one question, and this is something I had trouble with in previous version.
Have you ever seen a configuration where "rpm" is sufficient for drone to learn its own policies instead of "one_d_rpm"?
I think "rpm" is still more difficult to control. However, I believe that "rpm" control is a necessary setting for a drone flying around in 3D space.
Best regards,
The text was updated successfully, but these errors were encountered: