-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use data acquired by users #701
Comments
Yeah I have the same question, wondering how to apply this imitation library on human demonstrations or real robot demonstration. |
@jas-ho can you take a look at this? My understanding is that However, we don't currently have any out-of-the-box demo. Collecting arbitrary human demos is out of scope (think of how different e.g. demos from motion capture of a person performing a task, demos from a person teleoperating a robot or demos of humans writing text solving tasks are). But being able to do this in some simple RL environments seems valuable. Moreover, DAgger -- part of the focus of this issue -- learns not from offline demonstrations but from online actions provided by a human expert. Right now, we only support "synthetic" experts -- i.e. RL policies. This is a limitation. I think we can tackle both of these problems by writing an "interactive policy": a policy, compatible with the Stable Baselines3 policy API that This interactive policy will need to be specialized to the kind of task -- the right UI for teleoperating a robot is different from that of controlling a video game character is different to that of a language modeling task. I'd suggest starting with something like Atari: if we give an example, people can adapt it to their own use case fairly readily. But without an example, easy to get lost. WDYT @ernestum? |
I think the idea of an interactive policy is worth exploring. Maybe the "polling" mechanism won't work for all scenarios because if the polling is interleaved with some learning process, that is slower than the typical environment execution, it might become infeasible. Maybe playing an Atari game in slow-motion is still feasible, but as soon as we might have real physical systems in the loop can't slow them down easily. In this case we might want to prefer a pushing architecture where the expert somehow feeds transitions to a learner during a trajectory but the learner can query for specific initial states that it is interested in? |
could we start with an environment that works with the (probably simpler) polling mechanism and move on to a pushing version later, if desired? |
Yeah, starting with an MVP seems good. There's some parallels here with #716 moving our DRLHP implementation from sync to async. Although required significant refactor think it was probably still best to start with sync version to get the basic algorithm working. |
Sounds good. I am excited to see what comes out of this exploration! |
For atari games, it should be possible to just adapt the implementation of interactive.py in the retro-gym repo. I hesitate to make retro-gym a dependency since it seems to be entirely unmaintained. So probably this would mean just copy-pasting the parts of the source we care about (with a comment pointing to the original source) and adapting it to our needs. Any thoughts @AdamGleave @ernestum ? |
Copying it across with attribution seems OK -- it's MIT licensed like From a skim their code feels a bit bloated, I'd hope we can simplify it, but perhaps GUI code just needs to be like this. |
We got a lot of the way there in #776. The main feature missing is being able to show to the user what would be displayed by This is a little tricky since (a) when After discussion with @zajaczajac the least-bad solution we currently see is:
|
I thought about this earlier this week and agree with the plan as written above. It's should be an easy fix after dictionary observation spaces are supported (see #681 ) |
I am testing 'imitation' to work with proprietary environments for the robot in our Mujoco-based lab. For testing I am generating Pygame environments based on Gym. I am creating user generated data. When working with BC it is clear that those demos have to be in Trajectory format (act,obs,infos,terminal) and inside bc.BC in demonstrations (something like demosntration = exppert_trajs). When I work with Dagger, it is not very clear in the documentation how to do this process. The bc.BC is the same as a normal BC process, but in SimpleDAggerTrainer, it is not clear which expert_policy should be added. I have tried using data generated by the PPO and there it is clear that the expert_policy = expert (where expert = PPO(...), expert.learn(...)), but if I want to use human data I am not sure what I should do. Do I need to use an expert that is for example a PPO? Do I need to do a bc_trainer.learn(...) and use that result as expert_policy inside SimpleDAggerTrainer, do I need to get the expert_policy from the acquired human data, if so, how can I generate the policy from this human data?
I have uploaded an example of a code where Arm2d is my own environment (a robotic arm in 2 dimensions), expert_trajs are 30 action state pairs data taken using PyGame and expert_policy in SimpleDAggerTrainer in this case has been tested with expert (a trained PPO with 10000 iterations). I don't know if this configuration is correct, or if as I have commented previously, it is necessary to do a BC training and use its policy as expert_policy or if I need to get the Policy from the user data (and if this is the option, how it would be done).
The text was updated successfully, but these errors were encountered: