Skip to content

Iterative Policy Evaluation for estimating state-value function from an arbitrary policy.

License

Notifications You must be signed in to change notification settings

shuvoxcd01/Dyna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dyna: Dynamic Programming in RL

Environment

Denny Britz's reinforcement-learning repository has been a great help in creating the environment. Most of the environment related code is taken from there.

Grid World environment from Sutton's Reinforcement Learning book chapter 4. You are an agent on an MxN grid and your goal is to reach the terminal state at the top left or the bottom right corner.

For example, a 4x4 grid looks as follows:

T  o  o  o
o  x  o  o
o  o  o  o
o  o  o  T

x is your position and T are the two terminal states.

You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3). Actions going off the edge leave you in your current state. You receive a reward of -1 at each step until you reach a terminal state.

Policy-Evaluation & Policy-Iteration

  • Iterative Policy Evaluation for estimating state-value function from an arbitrary policy.
  • Exact Policy Evaluation using Bellman's Equation.
  • Policy Iteration (to find a optimal value function and a deterministic optimal policy)

Two types of policy evaluation implementations have been added.

  1. Exact Policy Evaluation: Uses Bellman's equation and solution to a system of linear equations.
  2. Iterative Policy Evaluation: Evaluates policy in an iterative manner.

The Iterative Policy Evaluation offers two functions that perform policy evaluation.

  1. estimate_state_value_function_inplace
  2. estimate_state_value_function

For the Grid World environment described above and a uniform random policy, all the functions converge to the following value assignment - which matches with Sutton & Barto's book (Reinforcement Learning An Introduction (Second Edition) See: Figure 4.1, Page: 77)

  0.0  -14.0  -20.0  -22.0   
-14.0  -18.0  -20.0  -20.0   
-20.0  -20.0  -18.0  -14.0   
-22.0  -20.0  -14.0    0.0   

An optimal value function (shown in grid below):

 0.0  -1.0  -2.0  -3.0
-1.0  -2.0  -3.0  -2.0
-2.0  -3.0  -2.0  -1.0
-3.0  -2.0  -1.0   0.0

An optimal deterministic policy (shown in grid below):

UP     LEFT   LEFT   DOWN   
UP     UP     UP     DOWN   
UP     UP     RIGHT  DOWN   
UP     RIGHT  RIGHT  UP     

To get an optimal value function and an optimal policy run: run_policy_iteration.py

About

Iterative Policy Evaluation for estimating state-value function from an arbitrary policy.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages