Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxi-v3 infinite loop in policy_iteration #19

Open
jblackwood12 opened this issue Apr 2, 2024 · 0 comments
Open

Taxi-v3 infinite loop in policy_iteration #19

jblackwood12 opened this issue Apr 2, 2024 · 0 comments

Comments

@jblackwood12
Copy link

The "value_iteration()" works perfectly.

However, "policy_iteration()" appears to get stuck on line 135 of planner.py in "policy_evaluation()" as the break statement isn't reached.

if np.max(np.abs(prev_V - V)) < theta:
    break

This appears to happen due to the Taxi-v3 environment (link) having a -1 reward for each step taken, which prevents the absolute difference from line 135 from being satisfied.

Code to reproduce:

import gymnasium as gym
from bettermdptools.algorithms.planner import Planner

large_mdp = gym.make('Taxi-v3', render_mode=None)
observation, info = large_mdp.reset(seed=555) # passenger at green, destination at yellow
V, V_track, pi = Planner(large_mdp.unwrapped.P).policy_iteration()

To bypass the infinite loop, and reach an identical result to value iteration I increased theta and n_iters, this works, but seems against the spirit of policy iteration:

import gymnasium as gym
from bettermdptools.algorithms.planner import Planner

large_mdp = gym.make('Taxi-v3', render_mode=None)
observation, info = large_mdp.reset(seed=555) # passenger at green, destination at yellow
V, V_track, pi = Planner(large_mdp.unwrapped.P).policy_iteration(n_iters=10000, theta=10000)

I was reading up on Policy Evaluation, it seems that it is required that for convergence the value function (within policy_evaluation()) needs to be monotonically increasing, and the Taxi-v3 environment doesn't satisfy that requirement as it has -1 rewards for steps.

I wonder if a different criteria of convergence could be used in "policy_evaluation()", perhaps np.isclose and use a relative difference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant