This repository contains all the tabular RL algorithms from Monte-Carlo - Q Learning, most which are implemented on minigrid environment
This repo made to learn and implemented differet classical/tabular alogrithms taught by David Silver at deep mind
- Empty
- Dyanamic obstacles
- FourRooms
- Empty(Empty grid)
MiniGrid-Empty-5x5-v0
MiniGrid-Empty-8x8-v0
- Everywhere reward is 0 except for the goal position which has a reward of 1.
- The total amount of reward recieved in a episode is
1-0.9*steps/max_steps
.
states |
---|
env.agent_pos (position of agent in grid) |
env.agent_dir (direction of head of agent) (0-4) |
states |
---|
turn right (0) |
turn_left (1) |
move_forward (2) |
- Episodes: all the alogrithms are initially ran for 150 episodes for training the policy(this might be altered depending upon convergence of particular alogrithm)
- All alogrithms follow ε-greedy policy. Except in case of Q learning base policy follows ε-greedy policy and target policy follows greedy policy)
- Initially ε for each case decreases by 0.01 for every episode to ensure proper exploration vs exploitation of policy.(Might be greater for monte-carlo)
- The update paramter α is set to 0.3 and works fine for all.
- The discount factor γ is set to 0.9 for all cases and works for fine for all.
- The parameter λ is set to 0.9 for SARSA-λ and backward-view SARSA
MiniGrid-Empty-8x8-v0
SARSA(Backward-View) converges to optimal policy with very little training as compared to other algorithms due to online updates.
- Dynamic obstacles
MiniGrid-Dynamic-Obstacles-8x8-v0
MiniGrid-Dynamic-Obstacles-5x5-v0
MiniGrid-Dynamic-Obstacles-Random-5x5-v0
MiniGrid-Dynamic-Obstacles-Random-6x6-v0
- Everywhere reward is 0 except for obstacles and goal position.
- If agent runs into a obstacle it get a reward of -1.
- Goal position has a reward of 1
- The total amount of reward recieved in a episode is
1-0.9*steps/max_steps
.
states |
---|
env.agent_pos (position of agent in grid) |
env.agent_dir (direction of head of agent) (0-3) |
env.grid.get(*env.front_pos) (0-1) (if obstacle is in front of agent) |
states |
---|
turn right (0) |
turn_left (1) |
move_forward (2) |
- Episodes: all the alogrithms are initially ran for 600 episodes for training the policy(this might be altered depending upon convergence of particular alogrithm)
- All alogrithms follow ε-greedy policy. Except in case of Q learning base policy follows ε-greedy policy and target policy follows greedy policy)
- Initially ε for each case decreases by 0.002 for every episode to ensure proper exploration vs exploitation of policy.
- rest all parameter are kept same as the previous environment(Empty).
MiniGrid-Dynamic-Obstacles-Random-6x6-v0
MiniGrid-Dynamic-Obstacles-8x8-v0
-
MiniGrid-Dynamic-Obstacles-Random-6x6-v0
-
Four rooms
states |
---|
env.agent_pos (position of agent in grid) |
env.agent_dir (direction of head of agent) (0-4) |
states |
---|
turn right (0) |
turn_left (1) |
move_forward (2) |
-
both the walls and goal position are fixed in order to keep the Q-table smaller in size.
-
Will try with random goal position which will increase state space by 3x.
-
Random policy:
- Will update this when I find a interesting environment to work with (no more minigrid :) .
Future works will be in different repo.
- Deep Q Net (Action value function approximation)
- Actor - Critic ( Policy approximation)