-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port from Stable Baselines 2 to 3 (test code only) #40
Conversation
Codecov Report
@@ Coverage Diff @@
## master #40 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 24 24
Lines 684 683 -1
=========================================
- Hits 684 683 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The tests are skipped by default, so I ran them on my machine, and half of them failed. I'll try running them before the change to see if they're still flaky. I'm wondering if we should just delete these tests? |
Three failures on SB2 as well. HalfCheetah and Humanoid failed in both runs. Hopper failed in SB3, Walker2D failed in SB2. But these may well vary by seed. Maybe this suggests something broken with the environments? Don't have bandwidth to investigate further. I suspect combination of varying across seed, and mean reward genuinely being lower in seals vs Gym because fixed-horizon environments are less sample efficient for RL, and we're only training for 200k timesteps. |
Interesting to note that Humanoid and HalfCheetah are doing much worse on the SB2 run (large negative return) than on the SB3 run (positive return). |
I vaguely remember that when we introduced the Mujoco envs we decided to skip this tests because they never passed consistently. Here's the relevant comment thread: |
My vote is to leave the tests in (continuing to skip them on CI), maybe leaving a TODO, and then inspect them more throughly when someone has the bandwidth to investigate what's going on with multiple seeds. If we suddenly have a ton of extra time on our hands then this seems like a good place to try tracking average performance via airspeed velocity. |
Airspeed Velocity seems a much better fit for this in that we care about changes over time rather than having a particular pass threshold in mind. I'm OK leaving the tests in with a TODO. |
We have a stub to do some sanity checking via RL training of our environments, using Stable Baselines 2 PPO. Move to Stable Baselines 3 so we can drop the TensorFlow dependency.