Skip to content

This repository contains all the tabular RL algorithms from Monte-Carlo - Q Learning, most which are implemented on minigrid environment

Notifications You must be signed in to change notification settings

yaswanth1701/Minigrid-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 

Repository files navigation

Classical-RL

This repository contains all the tabular RL algorithms from Monte-Carlo - Q Learning, most which are implemented on minigrid environment

This repo made to learn and implemented differet classical/tabular alogrithms taught by David Silver at deep mind

Environment:

Minigrid

Current Status:

  • Empty
  • Dyanamic obstacles
  • FourRooms

Present environments:

  • Empty(Empty grid)

  • MiniGrid-Empty-5x5-v0
  • MiniGrid-Empty-8x8-v0
Reward:
  • Everywhere reward is 0 except for the goal position which has a reward of 1.
  • The total amount of reward recieved in a episode is 1-0.9*steps/max_steps.
State space:
states
env.agent_pos (position of agent in grid)
env.agent_dir (direction of head of agent) (0-4)
Action space:
states
turn right (0)
turn_left (1)
move_forward (2)

General parameter and hyperparameters

  • Episodes: all the alogrithms are initially ran for 150 episodes for training the policy(this might be altered depending upon convergence of particular alogrithm)
  • All alogrithms follow ε-greedy policy. Except in case of Q learning base policy follows ε-greedy policy and target policy follows greedy policy)
  • Initially ε for each case decreases by 0.01 for every episode to ensure proper exploration vs exploitation of policy.(Might be greater for monte-carlo)
  • The update paramter α is set to 0.3 and works fine for all.
  • The discount factor γ is set to 0.9 for all cases and works for fine for all.
  • The parameter λ is set to 0.9 for SARSA-λ and backward-view SARSA

Rewared vs episodes

  • MiniGrid-Empty-8x8-v0

Monte-Carlo
SARSA/SARSA-0
SARSA-λ(Forward-view)
SARSA(Backward-View)

SARSA(Backward-View) converges to optimal policy with very little training as compared to other algorithms due to online updates.

Q Learning

  • Dynamic obstacles

  • MiniGrid-Dynamic-Obstacles-8x8-v0
  • MiniGrid-Dynamic-Obstacles-5x5-v0
  • MiniGrid-Dynamic-Obstacles-Random-5x5-v0
  • MiniGrid-Dynamic-Obstacles-Random-6x6-v0
Reward:
  • Everywhere reward is 0 except for obstacles and goal position.
  • If agent runs into a obstacle it get a reward of -1.
  • Goal position has a reward of 1
  • The total amount of reward recieved in a episode is 1-0.9*steps/max_steps.
State space:
states
env.agent_pos (position of agent in grid)
env.agent_dir (direction of head of agent) (0-3)
env.grid.get(*env.front_pos) (0-1) (if obstacle is in front of agent)
Action space:
states
turn right (0)
turn_left (1)
move_forward (2)

General parameter and hyperparameters

  • Episodes: all the alogrithms are initially ran for 600 episodes for training the policy(this might be altered depending upon convergence of particular alogrithm)
  • All alogrithms follow ε-greedy policy. Except in case of Q learning base policy follows ε-greedy policy and target policy follows greedy policy)
  • Initially ε for each case decreases by 0.002 for every episode to ensure proper exploration vs exploitation of policy.
  • rest all parameter are kept same as the previous environment(Empty).

Rewared vs episodes

  • MiniGrid-Dynamic-Obstacles-Random-6x6-v0

SARSA(Backward-View)
  • MiniGrid-Dynamic-Obstacles-8x8-v0

Q Learning
  • MiniGrid-Dynamic-Obstacles-Random-6x6-v0

  • Four rooms

states
env.agent_pos (position of agent in grid)
env.agent_dir (direction of head of agent) (0-4)
Action space:
states
turn right (0)
turn_left (1)
move_forward (2)
  • both the walls and goal position are fixed in order to keep the Q-table smaller in size.

  • Will try with random goal position which will increase state space by 3x.

  • Random policy:

SARSA(Backward-View)

Future environments:

  • Will update this when I find a interesting environment to work with (no more minigrid :) .

Future Works:

Future works will be in different repo.

Deep Reinforcement Learning
Basic:
  • Deep Q Net (Action value function approximation)
  • Actor - Critic ( Policy approximation)

About

This repository contains all the tabular RL algorithms from Monte-Carlo - Q Learning, most which are implemented on minigrid environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages