Support for non-static data for reinforcement learning #713

ghost · 2020-01-19T20:53:33Z

What would be the best approach for reinforcement learning problems where you would need to interact with the environment for data? Maybe DataLoader is restricting?

williamFalcon · 2020-01-21T12:15:23Z

could you post a snippet?

irustandi · 2020-01-22T01:24:01Z

Along this line, I think what would be good is to have the PyTorch Lightning equivalent of the reinforcement learning examples in PyTorch or PyTorch Ignite:

https://github.com/pytorch/examples/tree/master/reinforcement_learning
https://github.com/pytorch/ignite/tree/master/examples/reinforcement_learning

Is this possible?

colllin · 2020-01-22T20:11:37Z

I'm interested in this too. I'm thinking about trying to make it work using pytorch's new IterableDataset for feeding data from a (prioritized) replay buffer.

Edit: Then I would rollout episodes (across a cluster) before each "epoch", which is just a fixed number of training steps between rollouts.

Borda · 2020-01-24T10:18:24Z

@colllin may you consider creating a PR?

djbyrne · 2020-02-11T20:31:52Z

Hey guys, also really interested in using Pytorch Lightning for Reinforcement Learning. Not sure that the Dataloader is the best structure for RL though, has anyone found a good way of incorporating Dataloaders for things like gym environments?

ghost · 2020-02-11T20:35:27Z

I've been looking at pytorch's built in map-stype and iterable-style datasets, and I think there might be a way of getting RL to work with them. Map-style might work for replay buffers, otherwise iterable-style would provide more flexibility in feeding data. I'll post code if I get something to work.

djbyrne · 2020-02-11T21:03:21Z

I was trying to see if there was a good way to incorporate the Data loader into the RL environment, but it doesn't seem to fit. Using it for a replay buffer sounds like a good idea. But what should you do if you are using Lightning for an RL agent that doesn't use a Replay Buffer? should you just use a dummy DataLoader that isn't utilized?

ghost · 2020-02-11T21:08:41Z

Pytorch's IterableDataset lets you use a python iterator as your dataloader. That should work as a sort of dummy dataloader. You can just ask for a sample, and run the environment in next(). This is what I'm thinking might work.

djbyrne · 2020-02-12T17:01:44Z

Was looking at something like this to use the DataLoader for simply retrieving the current state

class Environment(IterableDataset):
    """Basic Gym Environment Dataset."""

    def __init__(self, env):
        super(EnvDataset).__init__()
        self.env = env
        self.obs = self.env.reset()
        
    def __iter__(self):
        return iter([self.env.state])

This would provide a "dummy" dataloader, providing Lightning with everything it needs. However, this solution feels like trying to fit the project to the framework.

Would it be possible to change the hard requirement of providing a dataloader to Lightning for systems like RL agents?

AwokeKnowing · 2020-02-12T17:55:18Z

Doesn't RL usually involve 'rollouts with existing network' then 'evaluation of the data' for learning? It seems kind of odd even for RL to have the 'next()' of the environment in the 'inner loop' of the learning.

There does need to be a hook to switch back and forth between learning and 'rollouts' but it might be counterproductive to put the learning on a 'per frame basis' where each pulling of a sample from the dataloader 'runs' the environment. So I'm just saying in the design of this, it's not about pulling one sample from 'the environment', it's about pulling a 'batch of data' from the environment, but there would be a benefit to having a standard way to connect the dataloader to the environment to pull batches (theoretically as small as single frames/samples).

williamFalcon · 2020-02-12T18:04:47Z

@AwokeKnowing i wish i was more up to speed on RL but haven't been doing much of it. I'd love to make sure lightning supports it. Mind suggesting what needs to change to do that?

thanks

djbyrne · 2020-02-12T18:10:31Z

@AwokeKnowing are you saying that the dataset would have a reference to both the agent and the env. Then the iter/getitem function inside the dataset would collect a batch of transitions ?

colllin · 2020-02-12T19:51:40Z

I’ve been thinking about this a lot. Here’s how I would try to sort this out in a simple way: In on_before_epoch hook, rollout 5000 time steps in the environment using the current policy. Store the episodes in a replay buffer. Then, for train_dataloader, sample batches from the replay_buffer. There are a few ways to accomplish this... - probably the easiest for lightning is to create a custom dataset with __len__ hard-coded to 5000 and returns random steps from your replay buffer. - an iterable-style dataset feels more appropriate, but I’m not sure how you then tell lightning how many training steps to take per epoch. Perhaps you could also hard-code __len__ on your custom iterable dataset. - if you need to train on complete episodes (e.g. for an RNN), you might want to instead create a dummy dataset that does nothing and pass your own collate fn when instantiating the dataloader so that you can sample an entire episode from your replay buffer. Maybe you can come up with something cleaner — a dataset that returns entire episodes and a dataloader batch_size of 1? Then for an on-policy algorithm you would want to clear your replay buffer after each epoch. Above I use 5000 as a placeholder but you would want to tune that value.

AwokeKnowing · 2020-02-12T21:39:44Z

@colllin I don't think any hardcoded value (5000) is appropriate because some tasks the samples ("frames") are a few floats (many gym) a tiny matrix (chess/go) or sometimes they are 1024x1024 images. And some tasks (meta learning) may require different amounts of samples per 'model update' step.

So from Lightning perspective, it cannot know how many rollouts. This has to be configured for each task.

What I am suggesting is something inbetween the DataLoader and Environment, call it EnvDataManager. The EnvDataManager is configured with information about how to collect rollouts and feed to DataLoader. when DataLoader requests Data from EnvDataManager (EDM), and EDM decides it's time for a 'brain' update, EDM updates the model used by the Environment and collects more samples (async) and begins feeding the DataLoader. The EDM would also know when to 'add' to the existing data vs 'replace' with new data.

Note that by 'new model weights' I just mean access to the agent to run inference to select an action to pass to the environment to get a new observation. However, typically in RL you don't run the latest 'agent' but a 'checkpoint', which also you pass to multiple 'rollout servers/processes'. You might even have a couple different versions of the agent. Thus I said 'weights' and the EDM will keep copies of them as needed.

@djbyrne I think so, if I understood you correctly.

@williamFalcon I think Lightning is flexible enough to work with RL but as it wraps common scenarios, I think the case of RL where it's not a 'static data set' has some good potential for wrapping so people don't do same/similar custom code in all their RL projects to work with Lighning.

colllin · 2020-02-13T00:20:32Z

The 5000 was in reference to either the number of time steps of experience you want to collect by rolling-out the current version of the agent, or the number of updates to the model you want to make. Often, these numbers are similar. I think your diagram is sensible. I was only suggesting that the roll-outs probably make sense to occur in an on_epoch_start hook so that you can collect some experience, after which lightning will perform a training loop over each batch of your dataloader, which I was proposing to "fake" the length of your dataset by returning 5000*batch_size so that lightning will call your training_step 5000 times. Then all you need to do is get your experience collected in on_epoch_start into whatever data structure is being sampled by your dataloader. Actually it looks like on_epoch_start is called before train_dataloader (which is called before every epoch rather than keeping around the dataloader across epochs) so it should be pretty easy to rollout in on_epoch_start, save a dataset (self.rollout_data = whatever), and then return a new dataloader in train_dataloader. That's how I see it, but maybe we're coming from different ends of the RL/lightning universe and this doesn't make sense to you. Either way, I'm excited to see what you come up with.

djbyrne · 2020-02-13T09:41:50Z

I came across the Ptan RL library which uses a class called ExperienceSource. This is essentially an iterator that keeps track of the environment and the weights of the current policy and rolls out the batch of trajectory data. I think this is aligned with what you were describing @AwokeKnowing

AwokeKnowing · 2020-02-13T15:39:58Z

@djbyrne yes that's the general idea. Though the ExperienceSource there seems to include the part about working with gym and DQN-specific concepts etc.

I think for a PyTorch Lightning, it would make more sense to have the ExperienceDataManager not know how to work with gym and specific buffers etc, but rather be focused on interfacing with the Lightning Agent and Lightning Dataset. Maybe a better name is DynamicDataset

So concretely, on the Lightning side, we need to provide a class that 'looks like a dataset' (to the dataloader) but also can receive 'model checkpoints'.

Then the users could use a library like Ptan or their own, or just a simple couple handcoded methods to launch/run the 'gym'. But Lightning would give them automatic flow up 'updated agents', and a clear place to feed the data.

It seems a bit odd that a 'dataset' should have this functionality, but in RL the 'dataset' is very much 'alive', and changes in the model directly affect the data that is passed to the dataloader. There may be something we can learn from the Unity ML agents about where to separate the concerns.

So think of simplest possible environment that provides an observation of a number 1 to 10, and the action is 1 or 0 to say whether it's over 5 or not, and the reward is 1 or 0. We need Lightning to think of the series of observations and rewards as a DynamicDataset, and we need Lightning to provide the agent checkpoints to the DynamicDataset so that it can continue to generate (unlimited) data.

djbyrne · 2020-02-13T17:06:14Z

@AwokeKnowing yeah I agree that the EDM should not need to know about the specific env or buffer and should really just be an interface.

If the lightning model contained a function for env_step() where the user can provide the logic for carrying out a single step of their specific environment. The EDM would have a reference to the PL model which provides access to the weights, forward and the env_step. Then the EDM can handle the rollout agnostic to the type of environment being used and provides the dataset interface for the dataloader.

I wonder is this actually a problem that lightning should be trying to solve or should this be solely in the domain of the user?

AwokeKnowing · 2020-02-13T17:28:49Z

@djbyrne for the question of should 'lightning solve it', the question is, is there some 'repetitive' code that all RL projects will be writing to wire these together? If so, I think yes, because the point of Pytorch Lightning is that I want to just write the logic (the model, and the code to interact with "minecraft" or my own env using the model) and I don't want to write the code to manage checkpoint agents, and transform pool my observations into a 'dataset'.

What would help is to actually use PL to do 10 similar and totally different RL projects, and see what is the repetitive code specific to RL 'data', and try to put that part in PL as a DynamicDataset.

my expectation is that a common thread is managing the agent checkpoints, and batching together observations in randomized buffers of sequences. It would be good to have something that you can hook and environemnt to and start 'filling up' a dataset. The difference between RL and other dynamic datasets (eg a webcam) is that the 'agent' affects the data. So standardizing the way to plug the PL agent (checkpoints) into a dataset, and saving the samples to disk or db as buffers. Leaving the RL practicioner to write the code of the model, and the code of directly pulling samples from the environment given a particular agent. And there would need to be clear place to inject logic of how to select data from the buffers to feed to dataloader.

djbyrne · 2020-02-13T21:42:31Z

Id certainly be up for building some varied examples of RL projects with lightning. Get a better idea of what works across the board

williamFalcon added the information needed label Jan 21, 2020

Borda added feature Is an improvement or enhancement help wanted Open to be worked on labels Feb 22, 2020

djbyrne mentioned this issue Mar 26, 2020

Example: Simple RL example using DQN/Lightning #1232

Merged

5 tasks

williamFalcon closed this as completed in #1232 Mar 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for non-static data for reinforcement learning #713

Support for non-static data for reinforcement learning #713

ghost commented Jan 19, 2020

williamFalcon commented Jan 21, 2020

irustandi commented Jan 22, 2020

colllin commented Jan 22, 2020 •

edited

Loading

Borda commented Jan 24, 2020

djbyrne commented Feb 11, 2020

ghost commented Feb 11, 2020

djbyrne commented Feb 11, 2020

ghost commented Feb 11, 2020

djbyrne commented Feb 12, 2020

AwokeKnowing commented Feb 12, 2020

williamFalcon commented Feb 12, 2020

djbyrne commented Feb 12, 2020

colllin commented Feb 12, 2020 via email •

edited by Borda

Loading

AwokeKnowing commented Feb 12, 2020 •

edited

Loading

colllin commented Feb 13, 2020

djbyrne commented Feb 13, 2020

AwokeKnowing commented Feb 13, 2020 •

edited

Loading

djbyrne commented Feb 13, 2020

AwokeKnowing commented Feb 13, 2020 •

edited

Loading

djbyrne commented Feb 13, 2020

Support for non-static data for reinforcement learning #713

Support for non-static data for reinforcement learning #713

Comments

ghost commented Jan 19, 2020

williamFalcon commented Jan 21, 2020

irustandi commented Jan 22, 2020

colllin commented Jan 22, 2020 • edited Loading

Borda commented Jan 24, 2020

djbyrne commented Feb 11, 2020

ghost commented Feb 11, 2020

djbyrne commented Feb 11, 2020

ghost commented Feb 11, 2020

djbyrne commented Feb 12, 2020

AwokeKnowing commented Feb 12, 2020

williamFalcon commented Feb 12, 2020

djbyrne commented Feb 12, 2020

colllin commented Feb 12, 2020 via email • edited by Borda Loading

AwokeKnowing commented Feb 12, 2020 • edited Loading

colllin commented Feb 13, 2020

djbyrne commented Feb 13, 2020

AwokeKnowing commented Feb 13, 2020 • edited Loading

djbyrne commented Feb 13, 2020

AwokeKnowing commented Feb 13, 2020 • edited Loading

djbyrne commented Feb 13, 2020

colllin commented Jan 22, 2020 •

edited

Loading

colllin commented Feb 12, 2020 via email •

edited by Borda

Loading

AwokeKnowing commented Feb 12, 2020 •

edited

Loading

AwokeKnowing commented Feb 13, 2020 •

edited

Loading

AwokeKnowing commented Feb 13, 2020 •

edited

Loading