Use data acquired by users #701

AdrianPrados · 2023-04-25T11:24:16Z

I am testing 'imitation' to work with proprietary environments for the robot in our Mujoco-based lab. For testing I am generating Pygame environments based on Gym. I am creating user generated data. When working with BC it is clear that those demos have to be in Trajectory format (act,obs,infos,terminal) and inside bc.BC in demonstrations (something like demosntration = exppert_trajs). When I work with Dagger, it is not very clear in the documentation how to do this process. The bc.BC is the same as a normal BC process, but in SimpleDAggerTrainer, it is not clear which expert_policy should be added. I have tried using data generated by the PPO and there it is clear that the expert_policy = expert (where expert = PPO(...), expert.learn(...)), but if I want to use human data I am not sure what I should do. Do I need to use an expert that is for example a PPO? Do I need to do a bc_trainer.learn(...) and use that result as expert_policy inside SimpleDAggerTrainer, do I need to get the expert_policy from the acquired human data, if so, how can I generate the policy from this human data?

I have uploaded an example of a code where Arm2d is my own environment (a robotic arm in 2 dimensions), expert_trajs are 30 action state pairs data taken using PyGame and expert_policy in SimpleDAggerTrainer in this case has been tested with expert (a trained PPO with 10000 iterations). I don't know if this configuration is correct, or if as I have commented previously, it is necessary to do a BC training and use its policy as expert_policy or if I need to get the Policy from the user data (and if this is the option, how it would be done).

Neo-manchester · 2023-07-12T07:38:25Z

Yeah I have the same question, wondering how to apply this imitation library on human demonstrations or real robot demonstration.

AdamGleave · 2023-08-03T23:59:00Z

@jas-ho can you take a look at this?

My understanding is that imitation should work with human-generated demonstrations just fine, so long as they're of type imitation.algorithms.base.AnyTransitions (e.g. a sequence of trajectories).

However, we don't currently have any out-of-the-box demo. Collecting arbitrary human demos is out of scope (think of how different e.g. demos from motion capture of a person performing a task, demos from a person teleoperating a robot or demos of humans writing text solving tasks are). But being able to do this in some simple RL environments seems valuable.

Moreover, DAgger -- part of the focus of this issue -- learns not from offline demonstrations but from online actions provided by a human expert. Right now, we only support "synthetic" experts -- i.e. RL policies. This is a limitation.

I think we can tackle both of these problems by writing an "interactive policy": a policy, compatible with the Stable Baselines3 policy API that imitation is built on, that queries a human for the action to take. We can then generate demonstrations using imitation.scripts.eval_policy with that interactive policy, and use DAgger by specifying the expert as that interactive policy.

This interactive policy will need to be specialized to the kind of task -- the right UI for teleoperating a robot is different from that of controlling a video game character is different to that of a language modeling task. I'd suggest starting with something like Atari: if we give an example, people can adapt it to their own use case fairly readily. But without an example, easy to get lost.

WDYT @ernestum?

ernestum · 2023-08-04T09:39:22Z

I think the idea of an interactive policy is worth exploring. Maybe the "polling" mechanism won't work for all scenarios because if the polling is interleaved with some learning process, that is slower than the typical environment execution, it might become infeasible. Maybe playing an Atari game in slow-motion is still feasible, but as soon as we might have real physical systems in the loop can't slow them down easily. In this case we might want to prefer a pushing architecture where the expert somehow feeds transitions to a learner during a trajectory but the learner can query for specific initial states that it is interested in?
Maybe there are parallels between this polling/pushing dynamic and the problems we have when collecting human preferences?

jas-ho · 2023-08-04T17:22:36Z

Maybe the "polling" mechanism won't work for all scenarios
do we need one mechanism which works for all scenarios?

could we start with an environment that works with the (probably simpler) polling mechanism and move on to a pushing version later, if desired?

AdamGleave · 2023-08-04T22:40:28Z

could we start with an environment that works with the (probably simpler) polling mechanism and move on to a pushing version later, if desired?

Yeah, starting with an MVP seems good.

There's some parallels here with #716 moving our DRLHP implementation from sync to async. Although required significant refactor think it was probably still best to start with sync version to get the basic algorithm working.

ernestum · 2023-08-07T16:00:18Z

Sounds good. I am excited to see what comes out of this exploration!

jas-ho · 2023-08-08T19:16:25Z

For atari games, it should be possible to just adapt the implementation of interactive.py in the retro-gym repo. I hesitate to make retro-gym a dependency since it seems to be entirely unmaintained. So probably this would mean just copy-pasting the parts of the source we care about (with a comment pointing to the original source) and adapting it to our needs. Any thoughts @AdamGleave @ernestum ?

AdamGleave · 2023-08-08T19:19:16Z

For atari games, it should be possible to just adapt the implementation of interactive.py in the retro-gym repo. I hesitate to make retro-gym a dependency since it seems to be entirely unmaintained. So probably this would mean just copy-pasting the parts of the source we care about (with a comment pointing to the original source) and adapting it to our needs. Any thoughts @AdamGleave @ernestum ?

Copying it across with attribution seems OK -- it's MIT licensed like imitation which makes things straightforward IP wise.

From a skim their code feels a bit bloated, I'd hope we can simplify it, but perhaps GUI code just needs to be like this.

AdamGleave · 2023-09-09T04:29:49Z

We got a lot of the way there in #776. The main feature missing is being able to show to the user what would be displayed by env.render(), even when the observations are quite different (e.g. low-level joint positions for a physics simulator; or preprocessed observations for Atari).

This is a little tricky since (a) when policy.predict(obs) gets called, the observation obs may no longer correspond to the current environment state, so we can't just call env.render() then to retrieve it -- we have to have already saved it somewhere; (b) usually we'd store such auxiliary data in the info dict returned by env.step, but policy.predict only sees obs not info.

After discussion with @zajaczajac the least-bad solution we currently see is:

Have an environment wrapper that replaces the observation with a dict (or tuple) containing the original observation, and the return value of env.render(mode="rgb_array").
Wrap the actual policy (that RL is being trained with) with something that just extracts the original observation.
Wrap the interactive policy with something that extracts the rendered observation.
With this, the existing code should Just Work (TM), albeit with an annoying amount of orchestration on the part of the user of the API (so we should at the least provide some clear examples of this).

NixGD · 2023-09-15T15:31:18Z

I thought about this earlier this week and agree with the plan as written above. It's should be an easy fix after dictionary observation spaces are supported (see #681 )

AdrianPrados added the enhancement New feature or request label Apr 25, 2023

ernestum added this to the Release v1.x milestone May 26, 2023

jas-ho linked a pull request Aug 10, 2023 that will close this issue

Use data acquired by users #768

Closed

michalzajac-ml mentioned this issue Sep 6, 2023

Introduce interactive policies to gather data from a user #776

Merged

ZiyueWang25 mentioned this issue Oct 3, 2023

Add rgb observation to obs for interactive policy prediction #795

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use data acquired by users #701

Use data acquired by users #701

AdrianPrados commented Apr 25, 2023

Neo-manchester commented Jul 12, 2023

AdamGleave commented Aug 3, 2023

ernestum commented Aug 4, 2023

jas-ho commented Aug 4, 2023 •

edited

Loading

AdamGleave commented Aug 4, 2023

ernestum commented Aug 7, 2023

jas-ho commented Aug 8, 2023

AdamGleave commented Aug 8, 2023

AdamGleave commented Sep 9, 2023

NixGD commented Sep 15, 2023

Use data acquired by users #701

Use data acquired by users #701

Comments

AdrianPrados commented Apr 25, 2023

Neo-manchester commented Jul 12, 2023

AdamGleave commented Aug 3, 2023

ernestum commented Aug 4, 2023

jas-ho commented Aug 4, 2023 • edited Loading

AdamGleave commented Aug 4, 2023

ernestum commented Aug 7, 2023

jas-ho commented Aug 8, 2023

AdamGleave commented Aug 8, 2023

AdamGleave commented Sep 9, 2023

NixGD commented Sep 15, 2023

jas-ho commented Aug 4, 2023 •

edited

Loading