Maximal Iteration cycle #34

ga72kud · 2022-01-20T17:49:04Z

I want to set the number of maximal iterations for the MDP (here as a certain variable x[3] describing the current iteration). If x[3] is greater than 20 it reaches the terminal state. Warning following code only for illustration I am wondering if there is another way to set the maximal limit and if the current iteration is readable.

mdp = QuickMDP(
	function gen(x, u, rng)
		x₁, x₂, x₃= x
		x₃=+1
		x′ = [x₁, x₂, x₃]

	isterminal = x[3] > 20,

In the solver one can set the variable, but it does not seem to fit here.

max_iterations=100

zsunberg · 2022-01-20T22:49:35Z

This should work - One thing that I immediately note is that you have written x₃=+1, which will set x₃ to 1. Did you want x₃+=1 (which will increment x₃) instead?

ga72kud · 2022-01-21T10:18:35Z

sorry that was a typo. I checked it before without typo and this is a an approach which will work. The disadvantage what I see is that I extend the state space in one dimension, which I guess could influence the performance of the solver (for example in the Grid Interpolation example). What I currently try to test if the CommonRLInterface would help as a workaround. I have not tested it yet.

ga72kud · 2022-01-21T11:14:35Z

Something like this stops the simulation after MAXITER Iterations... Now as far as I understand CommonRLInterface I can add POMDPs.jl ReinforcementLearning.jl etc.

using CommonRLInterface
include("envs/myEnv.jl")



env = myEnv(.1,0)

reset!(env)

rsum = 0.0
while !terminated(env)
    global rsum += act!(env, rand(actions(env)))
end

@show rsum

using CommonRLInterface
using StaticArrays
using Compose
using Plots
import ColorSchemes

begin
    MAXITER=5
end

mutable struct myEnv <: AbstractEnv
    s::Float64
    c::Int64
end

function CommonRLInterface.reset!(env::myEnv)
    env.s=0.0
    env.c=0
end

CommonRLInterface.actions(env::myEnv) = (-1.0, 0.0, 1.0)
CommonRLInterface.observe(env::myEnv) = env.s
CommonRLInterface.terminated(env::myEnv) = env.c>=MAXITER

function CommonRLInterface.act!(env::myEnv, a)
    print(".")
    env.c+=1
    r = -env.s^2 - a^2
    env.s = env.s + a + randn()
    return r
end

ga72kud · 2022-01-21T13:33:54Z

I am struggling to use POMDPs with the commonrlinterface. Is there a minimal example? At least here https://juliareinforcementlearning.org/CommonRLInterface.jl/dev/faqs/ is something mentioned:

Suppose you have an abstract environment type in your package called YourEnv. Support for AbstractEnv means:
You provide a convert methods julia convert(::Type{YourEnv}, ::AbstractEnv) convert(::Type{AbstractEnv}, ::YourEnv) If there are additional options in the conversion, you are encouraged to create and document constructors with additional arguments.
You provide an implementation of the interface functions from your framework only using functions from CommonRLInterface
You implement at minimum the required interface and as many optional functions as you'd like to support, where - YourCommonEnv is the concrete type returned by convert(Type{AbstractEnv}, ::YourEnv)

ga72kud · 2022-01-21T13:51:48Z

Is it something like this following. First I have provided the environment with the state, action, observation space in commonrlinterface something like this and then I have to use the convert function and parsing the action, observation and state space?

zsunberg · 2022-01-21T16:43:15Z

The instructions that you quote above are for package developers, not users. For users, you can just use convert, but you have to import the POMDPModelTools package:

https://juliapomdp.github.io/POMDPModelTools.jl/stable/common_rl/#CommonRLInterface-Integration

ga72kud · 2022-01-21T19:24:10Z

@zsunberg thank you. It seems that I misunderstood the CommonRL package, but I think it can solve my initial question with the max iteration cycle

POMDP <-- commonrlinterface
m = convert(POMDP, env)
planner = solve(xSolver(), m)
a = action(planner, initialstate(m))

In the QuickPOMDP there is no other method possible?!

ga72kud · 2022-01-25T18:27:28Z

Is there a possibility to use additional functions in QuickMDP or QuickPOMDP?

In this minimal incomplete example
cnt is a counting variable and the QuickMDP should stops if it reaches the threshold of 10. The current workaround is to increase the state space by a new variable (which makes no sense in my view)

cnt=0
mdp = QuickMDP(
		function gen(s, a, rng)
			x, v = s
			#incr_cnt()
			xₚ=clamp(x+Ts*v+rand(rng), PXMIN, PXMAX)
			vₚ=clamp(v+Ts*a, VMIN, VMAX)
			r = v > 0.5 ? 0.5 : -1
			return (sp=[xₚ, vₚ], r=r)
		end,
		actions = collect(0.:.1:1),
		initialstate = [[0.0, 0.0]],
		discount = 0.95,
                cnt+=1,
                isterminal = function(cnt) # or isterminal=cnt->cnt>10
			cnt > 10		
                 end,

zsunberg · 2022-01-28T22:09:58Z

Sorry for the delay in responding to this. Using a global cnt variable won't be a very good solution because then if you are trying to simulate multiple of these MDPs or using an online planner would cause issues because both models would try to use the global variable. I think augmenting the state space with time is probably the best solution to get something working quickly.

You may also be interested in: https://github.com/JuliaPOMDP/FiniteHorizonPOMDPs.jl .

ga72kud · 2022-01-31T06:37:23Z

Thank you for the link. I appreciate your help and information. It might be interesting for my application. It might be a quick way to augment the state space. I agree.
I want to train an MDP to reach a particular goal in a fixed time, and I do not want to teach a fixed goal state in advance. This problem is a common use case in optimal control. I am not sure if I could use it with the MDP. Is it necessary to use an online MDP or add a variable in the generative function (or transition, reward function) as the goal state? The question I have is, what if the reward function changes over time? I read many examples, but I did not see a use case like this.
I use an MDP with a classical value iteration algorithm. Suppose I would have a regular grid with 10x10 states. I would need 100 policies for different end states to train the MDP, which makes no sense. If I use the approximate solution, I interpolate between the states to get an approximated value for the reward function. I want to train a universal MDP capable of moving forward to a specific goal without specifying the goal state in advance. It might be interesting if additional information could be possible in the gen function. What happens with situations where the MDP needs memory (states are in some sense memory, but if I would have memory states, I could have small state spaces) and prior knowledge?

zsunberg · 2022-02-11T20:29:49Z

Hi @ga72kud ,

In order to find an optimal policy for a finite horizon problem, you have two options:

Find a single stationary policy for the augmented state space that includes time.
Find a non-stationary policy that consists of a list of policies for each time step.
If implemented properly, these are computationally identical. (The solvers in DiscreteValueIteration.jl are not optimized for this, but they should work to get a good start)

You are right to say that if you want to find a single optimal policy for reaching any goal, you have to include both the goal and the vehicle's position in the MDP state, so, for a 2D grid world, the MDP state would be four dimensional, and 5 dimensional if time is included.

My advice is to make the MDP model include everything so that it represents the problem correctly; i.e. do not include any approximation and do not worry about how hard it is to solve the problem when you are formulating it. Then, when you get to the solution stage, you can make approximations. This may include using neural networks for the value function as in DQN, or using simplified formulations of the problem to start with.

ga72kud · 2022-02-14T09:08:00Z

Thank you, I appreciate your information

ga72kud closed this as completed Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximal Iteration cycle #34

Maximal Iteration cycle #34

ga72kud commented Jan 20, 2022

zsunberg commented Jan 20, 2022

ga72kud commented Jan 21, 2022

ga72kud commented Jan 21, 2022

ga72kud commented Jan 21, 2022 •

edited

Loading

ga72kud commented Jan 21, 2022 •

edited

Loading

zsunberg commented Jan 21, 2022

ga72kud commented Jan 21, 2022

ga72kud commented Jan 25, 2022

zsunberg commented Jan 28, 2022

ga72kud commented Jan 31, 2022 •

edited

Loading

zsunberg commented Feb 11, 2022

ga72kud commented Feb 14, 2022

Maximal Iteration cycle #34

Maximal Iteration cycle #34

Comments

ga72kud commented Jan 20, 2022

zsunberg commented Jan 20, 2022

ga72kud commented Jan 21, 2022

ga72kud commented Jan 21, 2022

ga72kud commented Jan 21, 2022 • edited Loading

ga72kud commented Jan 21, 2022 • edited Loading

zsunberg commented Jan 21, 2022

ga72kud commented Jan 21, 2022

ga72kud commented Jan 25, 2022

zsunberg commented Jan 28, 2022

ga72kud commented Jan 31, 2022 • edited Loading

zsunberg commented Feb 11, 2022

ga72kud commented Feb 14, 2022

ga72kud commented Jan 21, 2022 •

edited

Loading

ga72kud commented Jan 21, 2022 •

edited

Loading

ga72kud commented Jan 31, 2022 •

edited

Loading