You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, "policy_iteration()" appears to get stuck on line 135 of planner.py in "policy_evaluation()" as the break statement isn't reached.
if np.max(np.abs(prev_V - V)) < theta:
break
This appears to happen due to the Taxi-v3 environment (link) having a -1 reward for each step taken, which prevents the absolute difference from line 135 from being satisfied.
Code to reproduce:
import gymnasium as gym
from bettermdptools.algorithms.planner import Planner
large_mdp = gym.make('Taxi-v3', render_mode=None)
observation, info = large_mdp.reset(seed=555) # passenger at green, destination at yellow
V, V_track, pi = Planner(large_mdp.unwrapped.P).policy_iteration()
To bypass the infinite loop, and reach an identical result to value iteration I increased theta and n_iters, this works, but seems against the spirit of policy iteration:
import gymnasium as gym
from bettermdptools.algorithms.planner import Planner
large_mdp = gym.make('Taxi-v3', render_mode=None)
observation, info = large_mdp.reset(seed=555) # passenger at green, destination at yellow
V, V_track, pi = Planner(large_mdp.unwrapped.P).policy_iteration(n_iters=10000, theta=10000)
I was reading up on Policy Evaluation, it seems that it is required that for convergence the value function (within policy_evaluation()) needs to be monotonically increasing, and the Taxi-v3 environment doesn't satisfy that requirement as it has -1 rewards for steps.
I wonder if a different criteria of convergence could be used in "policy_evaluation()", perhaps np.isclose and use a relative difference?
The text was updated successfully, but these errors were encountered:
The "value_iteration()" works perfectly.
However, "policy_iteration()" appears to get stuck on line 135 of planner.py in "policy_evaluation()" as the break statement isn't reached.
This appears to happen due to the Taxi-v3 environment (link) having a -1 reward for each step taken, which prevents the absolute difference from line 135 from being satisfied.
Code to reproduce:
To bypass the infinite loop, and reach an identical result to value iteration I increased theta and n_iters, this works, but seems against the spirit of policy iteration:
I was reading up on Policy Evaluation, it seems that it is required that for convergence the value function (within policy_evaluation()) needs to be monotonically increasing, and the Taxi-v3 environment doesn't satisfy that requirement as it has -1 rewards for steps.
I wonder if a different criteria of convergence could be used in "policy_evaluation()", perhaps np.isclose and use a relative difference?
The text was updated successfully, but these errors were encountered: