Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use data acquired by users #701

Open
AdrianPrados opened this issue Apr 25, 2023 · 10 comments
Open

Use data acquired by users #701

AdrianPrados opened this issue Apr 25, 2023 · 10 comments
Labels
enhancement New feature or request
Milestone

Comments

@AdrianPrados
Copy link

I am testing 'imitation' to work with proprietary environments for the robot in our Mujoco-based lab. For testing I am generating Pygame environments based on Gym. I am creating user generated data. When working with BC it is clear that those demos have to be in Trajectory format (act,obs,infos,terminal) and inside bc.BC in demonstrations (something like demosntration = exppert_trajs). When I work with Dagger, it is not very clear in the documentation how to do this process. The bc.BC is the same as a normal BC process, but in SimpleDAggerTrainer, it is not clear which expert_policy should be added. I have tried using data generated by the PPO and there it is clear that the expert_policy = expert (where expert = PPO(...), expert.learn(...)), but if I want to use human data I am not sure what I should do. Do I need to use an expert that is for example a PPO? Do I need to do a bc_trainer.learn(...) and use that result as expert_policy inside SimpleDAggerTrainer, do I need to get the expert_policy from the acquired human data, if so, how can I generate the policy from this human data?

DaggerQuestion

I have uploaded an example of a code where Arm2d is my own environment (a robotic arm in 2 dimensions), expert_trajs are 30 action state pairs data taken using PyGame and expert_policy in SimpleDAggerTrainer in this case has been tested with expert (a trained PPO with 10000 iterations). I don't know if this configuration is correct, or if as I have commented previously, it is necessary to do a BC training and use its policy as expert_policy or if I need to get the Policy from the user data (and if this is the option, how it would be done).

@AdrianPrados AdrianPrados added the enhancement New feature or request label Apr 25, 2023
@ernestum ernestum added this to the Release v1.x milestone May 26, 2023
@Neo-manchester
Copy link

Yeah I have the same question, wondering how to apply this imitation library on human demonstrations or real robot demonstration.

@AdamGleave
Copy link
Member

@jas-ho can you take a look at this?

My understanding is that imitation should work with human-generated demonstrations just fine, so long as they're of type imitation.algorithms.base.AnyTransitions (e.g. a sequence of trajectories).

However, we don't currently have any out-of-the-box demo. Collecting arbitrary human demos is out of scope (think of how different e.g. demos from motion capture of a person performing a task, demos from a person teleoperating a robot or demos of humans writing text solving tasks are). But being able to do this in some simple RL environments seems valuable.

Moreover, DAgger -- part of the focus of this issue -- learns not from offline demonstrations but from online actions provided by a human expert. Right now, we only support "synthetic" experts -- i.e. RL policies. This is a limitation.

I think we can tackle both of these problems by writing an "interactive policy": a policy, compatible with the Stable Baselines3 policy API that imitation is built on, that queries a human for the action to take. We can then generate demonstrations using imitation.scripts.eval_policy with that interactive policy, and use DAgger by specifying the expert as that interactive policy.

This interactive policy will need to be specialized to the kind of task -- the right UI for teleoperating a robot is different from that of controlling a video game character is different to that of a language modeling task. I'd suggest starting with something like Atari: if we give an example, people can adapt it to their own use case fairly readily. But without an example, easy to get lost.

WDYT @ernestum?

@ernestum
Copy link
Collaborator

ernestum commented Aug 4, 2023

I think the idea of an interactive policy is worth exploring. Maybe the "polling" mechanism won't work for all scenarios because if the polling is interleaved with some learning process, that is slower than the typical environment execution, it might become infeasible. Maybe playing an Atari game in slow-motion is still feasible, but as soon as we might have real physical systems in the loop can't slow them down easily. In this case we might want to prefer a pushing architecture where the expert somehow feeds transitions to a learner during a trajectory but the learner can query for specific initial states that it is interested in?
Maybe there are parallels between this polling/pushing dynamic and the problems we have when collecting human preferences?

@jas-ho
Copy link
Contributor

jas-ho commented Aug 4, 2023

Maybe the "polling" mechanism won't work for all scenarios
do we need one mechanism which works for all scenarios?

could we start with an environment that works with the (probably simpler) polling mechanism and move on to a pushing version later, if desired?

@AdamGleave
Copy link
Member

could we start with an environment that works with the (probably simpler) polling mechanism and move on to a pushing version later, if desired?

Yeah, starting with an MVP seems good.

There's some parallels here with #716 moving our DRLHP implementation from sync to async. Although required significant refactor think it was probably still best to start with sync version to get the basic algorithm working.

@ernestum
Copy link
Collaborator

ernestum commented Aug 7, 2023

Sounds good. I am excited to see what comes out of this exploration!

@jas-ho
Copy link
Contributor

jas-ho commented Aug 8, 2023

For atari games, it should be possible to just adapt the implementation of interactive.py in the retro-gym repo. I hesitate to make retro-gym a dependency since it seems to be entirely unmaintained. So probably this would mean just copy-pasting the parts of the source we care about (with a comment pointing to the original source) and adapting it to our needs. Any thoughts @AdamGleave @ernestum ?

@AdamGleave
Copy link
Member

For atari games, it should be possible to just adapt the implementation of interactive.py in the retro-gym repo. I hesitate to make retro-gym a dependency since it seems to be entirely unmaintained. So probably this would mean just copy-pasting the parts of the source we care about (with a comment pointing to the original source) and adapting it to our needs. Any thoughts @AdamGleave @ernestum ?

Copying it across with attribution seems OK -- it's MIT licensed like imitation which makes things straightforward IP wise.

From a skim their code feels a bit bloated, I'd hope we can simplify it, but perhaps GUI code just needs to be like this.

@AdamGleave
Copy link
Member

We got a lot of the way there in #776. The main feature missing is being able to show to the user what would be displayed by env.render(), even when the observations are quite different (e.g. low-level joint positions for a physics simulator; or preprocessed observations for Atari).

This is a little tricky since (a) when policy.predict(obs) gets called, the observation obs may no longer correspond to the current environment state, so we can't just call env.render() then to retrieve it -- we have to have already saved it somewhere; (b) usually we'd store such auxiliary data in the info dict returned by env.step, but policy.predict only sees obs not info.

After discussion with @zajaczajac the least-bad solution we currently see is:

  • Have an environment wrapper that replaces the observation with a dict (or tuple) containing the original observation, and the return value of env.render(mode="rgb_array").
  • Wrap the actual policy (that RL is being trained with) with something that just extracts the original observation.
  • Wrap the interactive policy with something that extracts the rendered observation.
    With this, the existing code should Just Work (TM), albeit with an annoying amount of orchestration on the part of the user of the API (so we should at the least provide some clear examples of this).

@NixGD
Copy link
Contributor

NixGD commented Sep 15, 2023

I thought about this earlier this week and agree with the plan as written above. It's should be an easy fix after dictionary observation spaces are supported (see #681 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants