Training Progress in Maze environment:
0 training episodes:
Agent acts randomly, has no notion of goal states and collides with obstacles multiple times.
50 training episodes:
Agents learns to avoid obstacles, but doesn't know that reaching the goal state is more rewarding.
100 training episodes:
Agent learns to reach goal state quickly, but collides with obstacles on the way.
200 training episodes:
Agent learns to trade-off collision and time to reach the goals state. The currect policy seems to be close to optimal human behaviour.