This is a repo where I implement the algorithms in the paper, Fast reinforcement learning with generalized policy updates, and expand the algorithm to non-linear tasks. Thanks to Shaobo Hou and Andre Barreto for helping me out and answering some questions I had about the paper!
- Background
- Approach
- Traditional Reinforcement Learning Framework
- Environment
- Fast Reinforcement Learning Framework
- Algorithms
- Results
- Expanding the Research Paper for Non-Linear Tasks
- Future Work
- File Descriptions
- Additional Resources
Reinforcement learning algorithms are powerful, especially paired with neural networks. They've allowed us to create AI agents that can solve complex games like Go and perform at a high level on visually rich domains like Atari games.
However, current reinforcement learning methods are very sample inefficient; they require a huge amount of training samples in order to learn to play at a reasonable level. To quote the paper: "in order to learn how to play one game of Atari, a simple video game console from the 1980s, an agent typically consumes an amount of data corresponding to several weeks of uninterrupted playing; it has been shown that, in some cases, humans are able to reach the same performance level in around 15 min".
Thus, there is motivation to find more sample efficient algorithms that require less training samples in order to reach a reasonable level of performance. This paper proposes some algorithms to address this problem.
One hypothesis put forward as to why humans are able to learn a lot faster compared to today's AI algorithms, is because we can draw from our past experiences to help "jump-start" the learning process of a new task we haven't seen before; therefore requiring less experience and time to achieve a reasonable level of performance on a new task. This is with the assumption that some of the tasks we've learned to do in the past may share some similar characteristics and features with the new task at hand, and so we don't have to start from scratch when learning this new task.
In our early childhood years, we are faced with the task of developing motor control; learning how to crawl, stand, balance, walk and run. Learning these skills help us navigate and interact with the world we live in. When learning how to play a new sport like badminton for example, we can use the motor control skills we learned from these past tasks to help learn related, but more specialized footwork tasks in badminton; i.e. we don't have to re-learn how to walk to learn how to play a new sport. Additionally if we've previously played another racket sport like tennis, we've most likely learned some hand-eye coordination skills, which can also be applied to another racket sport like badminton, which may speed up the learning process even more.
In contrast, AI algorithms in general don't have the capability of re-using skills learned from past tasks. Every new task, the AI agent starts with a fresh clean state and has to learn from scratch. In the case of developing a robot that could play badminton, that would mean that the AI agent would have to learn how to stand and walk before even learning how to play the sport itself. If we were to create an algorithm that could allow the AI agent to draw from past experiences and leverage the skills it learned from past related tasks, perhaps it'll help the AI agent be more sample efficient and learn faster, like humans.
The traditional framework used for reinforcement learning contains two key concepts that are defined as such:
- an environment that encapsulates the domain of the problem we're trying to solve
- an agent that can make observations of the environment, perform actions to interact with the environment and receives rewards from the environment
(diagram can be found in Sutton and Barto's book as Figure 3.1 on page 48)
At every time step, the agent observes the environment state
What determines what reward the agent receives and what state results from the agent's actions is the dynamics function. It's defined as
Thus, the dynamics function determines the "dynamics" of the environment; essentially the rules or laws that govern how the environment behaves and how the agent is rewarded.
In the traditional reinforcement learning framework described above, the agent's goal is to maximize the reward it receives. The reward is defined by the dynamics of the environment, and thus you could say that the reward dynamics determines the task the agent must optimize for.
For example, playing a zero-sum board game would have reward dynamics structured such that winning the game would result in receiving a positive reward, while losing the game would result in receiving a negative reward. Therefore given the reward dynamics of this environment, training an agent to maximize the reward received would essentially train the agent to optimize for the task of developing a strategy that allows it win as much as possible on this board game.
As mentioned before, agents typically train from scratch for every new task at hand. But in this paper, the authors develop a framework where agents are able to utilize skills it developed from past tasks, in order to help learn reasonable strategies for tackling new tasks faster.
To do so, we must first formalize the definition of a task. A task,
Therefore, the concept of tasks splits the dynamics function
To give some intuition why this might be helpful, we can go back to the badminton analogy. In badminton, different court lines are used to determine the boundaries of the court, depending on whether you are playing a singles or doubles game:
(image taken from the thumbnail of this video)
The highlighted region shows the area of the court that's considered "in", while everything else is considered "out". Whether you're playing a singles game or doubles game, you are still subjected to the same laws of physics that govern our world; e.g. the way the birdie moves and the way that you interact with the physical world does not change based on which game format you're playing. Therefore, the transition dynamics for both the singles and doubles game format are the same.
What is different, however, are the reward dynamics. A birdie landing between the singles sideline and the doubles sideline would have a different result depending on the game format you're playing. If you're playing singles, it would be considered "out", whereas if you were playing doubles, it would be considered "in". Therefore even though the birdie landed in the same spot, the particular game format you're playing would determine whether you or your opponent would've scored a point. More generally, the same sequence of events (arriving at state
With badminton, it's not difficult to see that having past experience with one game format would help you learn the other game format much quicker compared to someone with no experience with badminton at all. The transition dynamics are the same in both formats, which means that the way the birdie moves and the way you interact with the world is the same. In addition, the two game formats are very similar, with only a slight difference in court boundaries. As a result, a lot of skills learned while playing one game format will be relevant and useful when learning to play the other game format.
Therefore given the transition dynamics are the same, we can create a general algorithm that emulates this "transfer of skills" phenomenon. We can train an agent on a set of tasks, and allow it to utilize the skills it learned previously to help achieve a reasonable level of performance on a new task, in a relatively shorter amount of time than if it were to start from scratch.
(image taken from Fig. S1 a) on page 5 of the supplementary information of the paper)
The environment used in the paper is a 10 by 10 grid with 2 different types of objects (represented by the red squares and blue triangles) occupying a total of 10 cells at a given time. The agent (represented by the yellow circle) can move up, down, left or right, and will automatically pick up objects if it moves to a cell occupied by one. Once an object is picked up by the agent, another object of a random type is randomly generated in one of the unoccupied cells in the grid.
In the paper, the traditional reinforcement learning framework is modified and enhanced so that it can support more sample-efficient algorithms that allow the agent to re-use skills learned from past tasks on new tasks. This section details the changes added to the framework.
The transition feature function
In the grid environment described above, for example, we could define a transition feature function
In the paper, the assumption is made that any reward function / task,
Assuming the transition dynamics remain the same (and therefore, the transition feature function
Given the assumption that the transition dynamics remain the same,
The successor feature function is defined as such:
The successor feature function outputs the total expected discounted transition features the agent will accumulate over the course of the episode (or as the paper defines it, successor features), if it were to apply action
The successor feature function bares a resemblance to the definition of the action-value function,
The action-value function outputs the total expected discounted transition rewards the agent will accumulate over the course of the episode, if it were to apply action
Indeed when you take the dot product of the successor feature function
In the traditional reinforcement learning framework, we obtain the transition reward
In the fast reinforcement learning framework (under the assumption that the reward dynamics can be linearly approximated by the transition features), we first calculate the transition features
This is only possible because unlike the dynamics function in the traditional reinforcement learning framework, the transition dynamics (defined by the transition feature function) are separated from the reward dynamics (defined by the task vectors) in this framework (and this separation was made possible because we made the linear assumption). Later on, it will be evident why this separation allows the agent to re-use the skills it learned from past tasks on new tasks.
Suppose we train an agent on a variety of different tasks with different reward dynamics, to produce a set of learned action-value functions for each task
Now suppose we train an agent on a variety of different tasks with different reward dynamics, to produce a set of learned successor feature functions for each task
More formally, the generalized policy is a function,
(image taken from Fig. 3 of the paper)
For every successor feature function
Since the successor feature functions compute successor features rather than action-values, it is possible to compute new action-values using previously-learned successor feature functions on new tasks. This is in contrast with action-value functions, where the outputted values are based on the specific reward dynamics used when training that function.
Let's go back to the badminton analogy to gain some intuition on why the generalized policy may perform better than learning from scratch. Suppose we've trained an agent on both the singles and doubles format and have learned a corresponding successor feature function
Therefore if the agent learned the game formats properly, we should expect
With the successor feature functions defined this way, we can then create tasks for each corresponding format, with a +1 reward if the birdie landing in that region is considered "in", and a -1 reward if the birdie landing in that region is considered "out".
When playing either format, the agent can use the generalized policy to help decide what optimal action to take:
In these cases, we would trivially expect that the generalized policy would use the policy derived from
Under the traditional reinforcement learning framework, we would have to retrain the agent to learn a new action-value function
Assuming that the strategies the agent learned in the tasks
Intuitively, we can interpret the generalized policy in this case as leveraging the agent's knowledge and skills developed from playing the singles and doubles format (e.g. which shots result in the birdie landing in certain regions of the court) to determine what are the best actions to take under the new game rules.
We can apply the generalized policy in exactly the same way for the grid world environment specified in the paper. If the agent's been trained on a set of tasks to produce a set of successor feature functions, then the generalized policy for a new task
The paper outlines a series of algorithms to test the fast reinforcement learning framework. This section lists the algorithms in detail.
Given a new task, we train the agent to learn an action-value function using Q-learning. The performance of the agent over time using this method of learning will be used as a benchmark to compare against the fast reinforcement learning framework methods.
The algorithm is as follows:
-
Randomly initialize an action-value function
$q$ -
For every training episode, for every time step in the episode, take the optimal action according to the action-value function or uniformly randomly sample an action according to an exploration parameter
$\epsilon$ -
Apply the selected action
$a$ to the current state$s$ and observe the new resulting state$s^{'}$ and transition reward$r$ -
Update the action-value function according to this update rule:
$\delta \leftarrow r + \gamma q(s^{'},a^{'})-q(s,a)$ $\theta \leftarrow \theta + \alpha\delta\nabla\theta q(s,a)$ - where
$a^{'}$ is the action that maximizes the action-value at state$s^{'}$ (i.e.$a^{'} \leftarrow \text{argmax}_b q(s^{'},b)$ ),$\gamma$ is the discount factor,$\alpha$ is the learning rate and$\theta$ are the parameters of the action-value function$q$
-
Periodically measure the agent's performance by averaging over 100 episodes the total reward it receives when greedily following its policy derived from the current learned action-value function
$q$
Given the hand-coded transition feature function
The algorithm to learn the successor feature functions is as follows:
- Randomly initialize a successor feature function
$\psi$ for each task vector we're learning from - For every training episode, randomly pick one of the task vectors to train on
- For every time step in the episode, take the optimal action according to the corresponding successor feature function or uniformly randomly sample an action according to an exploration parameter
$\epsilon$ - Apply the selected action
$a$ to the current state$s$ and observe the new resulting state$s^{'}$ - Update the corresponding successor feature function according to this update rule:
$\delta \leftarrow \phi_h(s,a,s^{'}) + \gamma\psi(s^{'},a^{'})-\psi(s,a)$ $\theta \leftarrow \theta + \alpha\delta\nabla\theta\psi(s,a)$ - where
$a^{'}$ is the action that maximizes the action-value at state$s^{'}$ (i.e.$a^{'} \leftarrow \text{argmax}_b \psi(s^{'},b)^{\top}\text{w}$ ),$\gamma$ is the discount factor,$\alpha$ is the learning rate and$\theta$ are the parameters of the corresponding successor feature function$\psi$
Note that the update rule is very similar to the update rule in regular Q-learning. Normally in regular Q-learning, the TD-error
In this case, rather than predicting action-values, we're predicting transition features. Therefore the TD-error
Once the successor feature function is learned for the two task vectors
The algorithm for learning the new task vector and testing the agent's performance is as follows:
- Randomly initialize a vector
$\text{w}$ that will learn the reward dynamics of the new task - For every training episode, for every time step in the training episode, uniformly randomly sample an action
- Apply the selected action
$a$ to the current state$s$ and observe the new resulting state$s^{'}$ and transition reward$r$ - Update the task vector
$\text{w}$ according to this update rule:$\min_\text{w}[\phi_h(s,a,s^{'})^{\top}\text{w}-r]^2$
- Periodically measure the agent's performance by averaging over 100 episodes the total reward it receives when greedily following its generalized policy
$\pi_\Psi(s;\text{w})$ , given the current values of the learned task vector$\text{w}$
When not given a hand-coded transition feature function, we can still use the fast reinforcement learning framework. However, we must first train the agent to learn a transition feature function
The algorithm to learn the transition feature function is as follows:
- Randomly initialize a transition feature function
$\phi_l$ - Randomly initialize a set of vectors
$\text{w}_1,...,\text{w}_k$ that will learn the reward dynamics of the new tasks - For every training episode, for every time step in the episode, uniformly randomly sample an action
- Apply the selected action
$a$ to the current state$s$ and observe the new resulting state$s^{'}$ - Update the transition feature model and the task vectors according to this update rule:
$\min_{\phi_l} \sum\limits_{j=1}^k \min_{\text{w}_j} [\phi_l(s,a,s^{'})^{\top}\text{w}_j-r_j]^2$ - where
$r_j$ is the transition reward associated with task$j$
Once the transition feature function is learned, we can apply the same algorithms in the previous section to learn the successor feature function (except the agent will be using
Note that the transition feature function
Nevertheless, this potential discrepancy is inconsequential as it turns out that unit task vectors serve as a good enough basis to learn successor feature functions, regardless of what the task actually represents.
The agent was evaluated on a new task where a reward of +1 is given if the agent picks up a red square and a reward of -1 is given if it picks up a blue triangle, and 0 otherwise. This task corresponds to a task vector
The total sum of rewards received by the agent averaged over 100 episodes is shown in the graph below:
Every 100 episodes, the agent is evaluated using its current model parameters (i.e. the action-value function for Q-learning and the generalized policy and the learned task vector for the fast reinforcement learning algorithms). Q-learning is represented by the blue line, fast reinforcement learning using a hand-coded transition feature function is represented by the orange line, and fast reinforcement learning using a learned transition feature function is represented by the green line.
As you can see, the fast reinforcement learning algorithms immediately perform much better than Q-learning, making the algorithms much more sample efficient. The agent is able to use the generalized policy to find a reasonable strategy for the new task, despite the fact that it never explicitly trained on a task that gave it a negative reward. However there is no guarantee that the generalized policy is optimal (as it depends on the tasks the agent was trained on when learning the successor feature functions), and so eventually Q-learning outperforms the fast reinforcement learning algorithms after many sample transitions.
For comparison, here are the graphs for the same experiments found as Figure 5 A and B on page 6 of the paper:
The paper describes a method to learn non-linear tasks by using a mapping,
For example, let's define a non-linear task as giving an agent a reward of +1 for picking up the most abundant object and a reward of -1 for picking up anything else. This can be modelled with
I explore an alternative method of learning non-linear tasks in which we parameterize the reward function,
The generalized policy using a learned reward function
The algorithm to learn the transition feature function would then be as follows:
- Randomly initialize a transition feature function
$\phi_l$ - Randomly initialize a set of reward functions
$u_1,...,u_k$ that will learn the reward dynamics of the new tasks - For every training episode, for every time step in the episode, uniformly randomly sample an action
- Apply the selected action
$a$ to the current state$s$ and observe the new resulting state$s^{'}$ - Update the transition feature model and the reward functions according to this update rule:
$\min_{\phi_l} \sum\limits_{j=1}^k \min_{u_j} [u_j(\phi_l(s,a,s^{'})) - r_j]^2$ - where
$r_j$ is the transition reward associated with task$j$
Four non-linear tasks were used to learn the transition feature function. The tasks are as follows:
- +1 reward is given to the agent if it picks up a red square, -1 reward is given if it picks up a blue triangle, and 0 reward is given otherwise, if the count for each object type is even (and vice versa when the count is odd)
- +1 reward is given to the agent if it picks up a red square, -1 reward is given if it picks up a blue triangle, and 0 reward is given otherwise, if the count for each object type is odd (and vice versa when the count is even)
- +1 reward is given to the agent if it picks up an object that is vertically aligned with another object in the grid world, +0.1 reward is given instead if there are no objects vertically aligned with the picked up object, and 0 reward is given otherwise
- +1 reward is given to the agent if it picks up an object that is horizontally aligned with another object in the grid world, +0.1 reward is given instead if there are no objects horizontally aligned with the picked up object, and 0 reward is given otherwise
Since both the transition feature function and the reward functions are approximated by neural networks, the parameters for both models can be trained jointly end-to-end via backpropagation, using the environment transition rewards as the targets. Intuitively, the transition feature function must learn to represent general enough transition features such that they can be useful as inputs for all of the reward functions
Once the transition feature function
The algorithm for learning the reward function for the new task and testing the agent's performance is as follows:
- Randomly initialize a reward function
$u$ that will learn the reward dynamics of the new task - For every training episode, for every time step in the training episode, uniformly randomly sample an action
- Apply the selected action
$a$ to the current state$s$ and observe the new resulting state$s^{'}$ and transition reward$r$ - Update the reward function
$u$ according to this update rule:$\min_u[u( \phi_l(s,a,s^{'}) ) - r]^2$
- Periodically measure the agent's performance by averaging over 100 episodes the total reward it receives when greedily following its generalized policy
$\pi_\Psi(s;u)$ , given the current parameters of the learned reward function$u$
For comparison, the previous experiment in the paper was conducted again, but this time on non-linear tasks. A transition feature function was learned by approximating the four non-linear tasks mentioned above using task vectors. The successor feature function was then learned from the learned transition feature function using unit basis task vectors, as before.
The non-linear task used for evaluation is as follows:
- +1 reward is given to the agent if it picks up the more abundant object, -1 reward is given if it picked up the less abundant object, and 0 reward is given otherwise
The total sum of rewards received by the agent averaged over 100 episodes using the learned reward function is shown in the graph below:
Fast reinforcement learning using non-linear reward functions is represented by the orange line, and fast reinforcement learning using task vectors is represented by the blue line.
As you can see, the agent using the non-linear reward functions to approximate the transition rewards performs marginally better than the agent using task vectors. With some tweaking of the reward function's neural network hyperparameters and the choice and number of tasks used, there could be potential for this framework to perform even better.
Using a function approximator like a neural network to model the reward function theoretically allows this framework to generalize to the reward dynamics of any task, linear or non-linear. It is definitely worth exploring some more!
So far, the fast reinforcement learning framework allows the agent to use a generalized policy that draws from past experience, to make better actions than it would make if it started learning from scratch. But as we've seen, Q-learning will eventually outperform the generalized policy as Q-learning theoretically converges to an optimal policy, whereas there's no such guarantee for the generalized policy.
However, we've seen that the generalized policy's initial baseline performance is much better than Q-learning, where the agent is starting from scratch. Naturally, an extension of the algorithms put forth by the paper is to let the agent leverage the strong initial baseline performance of the generalized policy, while learning a new successor feature function for the current task at hand, using a modified version of Q-learning.
The algorithm would be as follows:
- Given a set of learned successor feature functions from previous tasks,
$\Psi={\psi^\pi_1(s,a),...,\psi^\pi_n(s,a)}$ , add a new, randomly initialized successor feature function$\psi^\pi_{n + 1}$ to the set for the new task - Randomly initialize a reward function
$u_{n + 1}$ that will learn the reward dynamics of the new task - For every training episode, for every time step in the episode, take the optimal action according to the generalized policy
$\pi_\Psi(s ; u_{n + 1})$ or uniformly randomly sample an action according to an exploration parameter$\epsilon$ - Apply the selected action
$a$ to the current state$s$ and observe the new resulting state$s^{'}$ - Update the successor feature function according to this update rule:
$\delta \leftarrow \phi_l(s,a,s^{'}) + \gamma\psi^\pi_{n+1}(s^{'},a^{'})-\psi^\pi_{n+1}(s,a)$ $\theta \leftarrow \theta + \alpha\delta\nabla\theta\psi^\pi_{n+1}(s,a)$ - where
$a^{'}$ is the action that maximizes the action-value at state$s^{'}$ (i.e.$a^{'} \leftarrow \text{argmax}_b u_{n+1}( \psi^\pi_{n+1}(s^{'},b) )$ ),$\gamma$ is the discount factor,$\alpha$ is the learning rate and$\theta$ are the parameters of the successor feature function$\psi^\pi_{n+1}$
- Update the reward function
$u_{n+1}$ according to this update rule:$\min_{u_{n+1}}[u_{n+1}( \phi_l(s,a,s^{'}) ) - r]^2$
The same algorithm can be applied to task vectors as well, in which we would just take the dot product of the transition features and task vector to get the predicted transition reward and action-values, instead of feeding the transitions features as inputs into the reward function.
This algorithm essentially adds the successor feature learning algorithm to the new task learning algorithm used in earlier experiments. As mentioned before, the new task learning algorithm was able to have decent performance due to the strong initial baseline performance given by the generalized policy. But the performance plateaus because there is no successor feature function being learned for the new task (only the reward dynamics of the new task are learned, either via a task vector or reward function). Adding a separate successor feature function allows it to learn a policy specifically for the new task, using a modified version of Q-learning. Thus, using the modified algorithm allows the agent to leverage the strong initial baseline performance of the generalized policy for the initial episodes of training, while asymptotically learning an optimal policy for the new task.
We should expect to see a decent initial performance of the agent, similar to what we saw in previous experiments when it used its generalized policy. Over time, the successor feature function
Another benefit is that using the generalized policy could lead to potentially better action selection compared to following a randomly initialized policy, for the initial episodes. Perhaps this strong initial baseline performance may lead to more sample efficient learning and therefore faster convergence to an optimal policy as well.
The only downside of this algorithm is that the agent must know that it is given a new task, so that it can instantiate a new corresponding successor feature function to learn the successor features and reward dynamics of the new task. This would be impossible if the agent were to use the current algorithm in hypothetical, non-stationary continuous environments, where the reward dynamics could change over time.
A possible solution to this is to set a hyperparameter
The other issue is that if the environment is continuous, the space complexity for storing an ever-increasing amount of successor feature functions is intractably large. Therefore there may be a need to design a pruning mechanism to discard unneeded successor feature functions over time, to maintain a manageable amount of memory. Some possible pruning mechanisms could be:
- maintain the
$\eta$ -most recent successor feature functions, and discard any that are older- this mechanism may work under the assumption that the most recent reward dynamics encountered by the agent in the environment are more relevant and useful
- keep a running count of how often each successor feature function,
$\psi^\pi \in \Psi$ , is used in the generalized policy, and then over time discard the successor feature functions with the lowest count- a successor feature function,
$\psi^{\pi^{'}}(s,a)$ , is considered "used" by the generalized policy if it's corresponding action-value is the maximum out of all policies$\Pi$ and all actions$\mathscr{A}$ for the current state$s$ - i.e.
$\max_{a \in \mathscr{A}} \max_{\pi \in \Pi} u( \psi^\pi(s,a) ) = \max_{a \in \mathscr{A}} u( \psi^{\pi^{'}}(s,a) )$
- i.e.
- this mechanism may work under the assumption that successor feature functions that are used more often by the generalized policy are more relevant and useful
- a successor feature function,
Adding reward functions to approximate non-linear reward dynamics and adding successor feature function learning to the new task learning algorithm will give the agent the power to theoretically learn an optimal policy for any task, faster than traditional Q-learning methods. Adding a pruning mechanism to discard unneeded successor feature functions will give the agent the power to operate in a non-stationary, continuous environment, where the reward dynamics may change over time. The result is a sample efficient algorithm that can allow the agent to learn a variety of tasks in a variety of environments, in a relatively short amount of time. I'm excited to see what other things we can do to improve the algorithm, as the more general we can make it, the more applicable it can be to the real world!
q_learning.py
holds the training algorithm for Q-learningsf_q_learning.py
holds the training algorithm to learn successor feature functions using a hand-coded transition feature functiontask_learning.py
holds the training algorithm to learn a task vector on a new task, using the successor feature functions learned insf_q_learning.py
task_and_feature_learning.py
holds training algorithms to learn a transition feature function, learn successor feature functions using the learned transition feature function, and then learning a task vector on a new task, using the learned successor feature functionsnonlinear_task_and_feature_learning_task_vector.py
holds the same training algorithms astask_and_feature_learning.py
, except the agent trains on non-linear tasks instead of linear tasksnonlinear_task_and_feature_learning_reward_model.py
holds the same training algorithms asnonlinear_task_and_feature_learning_task_vector.py
, except the agent approximates the transition rewards using a reward function neural network, instead of task vectorsnonlinear_reward_functions.py
holds non-linear reward functions for the grid world environmentclasses.py
holds data structure classes for the grid world environment and all neural network modelsmain.py
holds functions that run all of the training algorithms listed aboverun_linear_experiments
will run the training algorithms inq_learning.py
,sf_q_learning.py
,task_learning.py
andtask_and_feature_learning.py
, on linear tasksrun_nonlinear_experiments
will run the training algorithms innonlinear_task_and_feature_learning_task_vector.py
andnonlinear_task_and_feature_learning_reward_model.py
, on non-linear tasks
models/
holds saved neural network models learned by the agentmodels/q_model.pkl
is the Q-learning function learned inq_learning.py
models/sf_list.pkl
is the list of successor feature functions learned insf_q_learning.py
models/tf_model.pkl
is the transition feature function learned intask_and_feature_learning.py
models/sf_list_with_learned_transition_features_of_dimension_2.pkl
is the list of successor feature functions learned intask_and_feature_learning.py
models/nonlinear_tf_model_task_vector.pkl
is the transition feature function learned innonlinear_task_and_feature_learning_task_vector.pkl
models/nonlinear_sf_list_with_learned_transition_features_of_dimension_2_task_vector.pkl
is the list of successor feature functions learned innonlinear_task_and_feature_learning_task_vector.pkl
models/nonlinear_tf_model_reward_model.pkl
is the transition feature function learned innonlinear_task_and_feature_learning_reward_model.pkl
models/nonlinear_sf_list_with_learned_transition_features_of_dimension_2_reward_model.pkl
is the list of successor feature functions learned innonlinear_task_and_feature_learning_reward_model.pkl
evaluation_data/
holds the evaluation data on the agent's performance on a new task over time as it trainsclassic_q_learning.pkl
is the evaluation data of the agent inq_learning.py
task_learning.pkl
is the evaluation data of the agent intask_learning.py
task_learning_with_learned_transition_features_of_dimension_2.pkl
is the evaluation data of the agent intask_and_feature_learning.py
nonlinear_task_learning_with_learned_transition_features_of_dimension_2_task_vector.pkl
is the evaluation data of the agent innonlinear_task_and_feature_learning_task_vector.py
nonlinear_task_learning_with_learned_transition_features_of_dimension_2_reward_model.pkl
is the evaluation data of the agent innonlinear_task_and_feature_learning_reward_model.py
assets/
holds media files used in thisREADME.md
requirements.txt
holds all required dependencies, which can be installed by typingpip install -r requirements.txt
in the command line
For this project, I'm using Python 3.7.11.