Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximal Iteration cycle #34

Closed
ga72kud opened this issue Jan 20, 2022 · 12 comments
Closed

Maximal Iteration cycle #34

ga72kud opened this issue Jan 20, 2022 · 12 comments

Comments

@ga72kud
Copy link

ga72kud commented Jan 20, 2022

I want to set the number of maximal iterations for the MDP (here as a certain variable x[3] describing the current iteration). If x[3] is greater than 20 it reaches the terminal state. Warning following code only for illustration I am wondering if there is another way to set the maximal limit and if the current iteration is readable.

mdp = QuickMDP(
	function gen(x, u, rng)
		x₁, x₂, x₃= x
		x₃=+1
		x′ = [x₁, x₂, x₃]
	isterminal = x[3] > 20,

In the solver one can set the variable, but it does not seem to fit here.

max_iterations=100
@zsunberg
Copy link
Member

This should work - One thing that I immediately note is that you have written x₃=+1, which will set x₃ to 1. Did you want x₃+=1 (which will increment x₃) instead?

@ga72kud
Copy link
Author

ga72kud commented Jan 21, 2022

sorry that was a typo. I checked it before without typo and this is a an approach which will work. The disadvantage what I see is that I extend the state space in one dimension, which I guess could influence the performance of the solver (for example in the Grid Interpolation example). What I currently try to test if the CommonRLInterface would help as a workaround. I have not tested it yet.

@ga72kud
Copy link
Author

ga72kud commented Jan 21, 2022

Something like this stops the simulation after MAXITER Iterations... Now as far as I understand CommonRLInterface I can add POMDPs.jl ReinforcementLearning.jl etc.

using CommonRLInterface
include("envs/myEnv.jl")



env = myEnv(.1,0)

reset!(env)

rsum = 0.0
while !terminated(env)
    global rsum += act!(env, rand(actions(env)))
end

@show rsum
using CommonRLInterface
using StaticArrays
using Compose
using Plots
import ColorSchemes

begin
    MAXITER=5
end

mutable struct myEnv <: AbstractEnv
    s::Float64
    c::Int64
end

function CommonRLInterface.reset!(env::myEnv)
    env.s=0.0
    env.c=0
end

CommonRLInterface.actions(env::myEnv) = (-1.0, 0.0, 1.0)
CommonRLInterface.observe(env::myEnv) = env.s
CommonRLInterface.terminated(env::myEnv) = env.c>=MAXITER

function CommonRLInterface.act!(env::myEnv, a)
    print(".")
    env.c+=1
    r = -env.s^2 - a^2
    env.s = env.s + a + randn()
    return r
end

@ga72kud
Copy link
Author

ga72kud commented Jan 21, 2022

I am struggling to use POMDPs with the commonrlinterface. Is there a minimal example? At least here https://juliareinforcementlearning.org/CommonRLInterface.jl/dev/faqs/ is something mentioned:

  • Suppose you have an abstract environment type in your package called YourEnv. Support for AbstractEnv means:

  • You provide a convert methods julia convert(::Type{YourEnv}, ::AbstractEnv) convert(::Type{AbstractEnv}, ::YourEnv) If there are additional options in the conversion, you are encouraged to create and document constructors with additional arguments.

  • You provide an implementation of the interface functions from your framework only using functions from CommonRLInterface

  • You implement at minimum the required interface and as many optional functions as you'd like to support, where - YourCommonEnv is the concrete type returned by convert(Type{AbstractEnv}, ::YourEnv)

@ga72kud
Copy link
Author

ga72kud commented Jan 21, 2022

Is it something like this following. First I have provided the environment with the state, action, observation space in commonrlinterface something like this and then I have to use the convert function and parsing the action, observation and state space?

@zsunberg
Copy link
Member

The instructions that you quote above are for package developers, not users. For users, you can just use convert, but you have to import the POMDPModelTools package:

https://juliapomdp.github.io/POMDPModelTools.jl/stable/common_rl/#CommonRLInterface-Integration

@ga72kud
Copy link
Author

ga72kud commented Jan 21, 2022

@zsunberg thank you. It seems that I misunderstood the CommonRL package, but I think it can solve my initial question with the max iteration cycle

POMDP <-- commonrlinterface
m = convert(POMDP, env)
planner = solve(xSolver(), m)
a = action(planner, initialstate(m))

In the QuickPOMDP there is no other method possible?!

@ga72kud
Copy link
Author

ga72kud commented Jan 25, 2022

Is there a possibility to use additional functions in QuickMDP or QuickPOMDP?

In this minimal incomplete example
cnt is a counting variable and the QuickMDP should stops if it reaches the threshold of 10. The current workaround is to increase the state space by a new variable (which makes no sense in my view)

cnt=0
mdp = QuickMDP(
		function gen(s, a, rng)
			x, v = s
			#incr_cnt()
			xₚ=clamp(x+Ts*v+rand(rng), PXMIN, PXMAX)
			vₚ=clamp(v+Ts*a, VMIN, VMAX)
			r = v > 0.5 ? 0.5 : -1
			return (sp=[xₚ, vₚ], r=r)
		end,
		actions = collect(0.:.1:1),
		initialstate = [[0.0, 0.0]],
		discount = 0.95,
                cnt+=1,
                isterminal = function(cnt) # or isterminal=cnt->cnt>10
			cnt > 10		
                 end,

@zsunberg
Copy link
Member

Sorry for the delay in responding to this. Using a global cnt variable won't be a very good solution because then if you are trying to simulate multiple of these MDPs or using an online planner would cause issues because both models would try to use the global variable. I think augmenting the state space with time is probably the best solution to get something working quickly.

You may also be interested in: https://github.com/JuliaPOMDP/FiniteHorizonPOMDPs.jl .

@ga72kud
Copy link
Author

ga72kud commented Jan 31, 2022

Thank you for the link. I appreciate your help and information. It might be interesting for my application. It might be a quick way to augment the state space. I agree.
I want to train an MDP to reach a particular goal in a fixed time, and I do not want to teach a fixed goal state in advance. This problem is a common use case in optimal control. I am not sure if I could use it with the MDP. Is it necessary to use an online MDP or add a variable in the generative function (or transition, reward function) as the goal state? The question I have is, what if the reward function changes over time? I read many examples, but I did not see a use case like this.
I use an MDP with a classical value iteration algorithm. Suppose I would have a regular grid with 10x10 states. I would need 100 policies for different end states to train the MDP, which makes no sense. If I use the approximate solution, I interpolate between the states to get an approximated value for the reward function. I want to train a universal MDP capable of moving forward to a specific goal without specifying the goal state in advance. It might be interesting if additional information could be possible in the gen function. What happens with situations where the MDP needs memory (states are in some sense memory, but if I would have memory states, I could have small state spaces) and prior knowledge?

@zsunberg
Copy link
Member

Hi @ga72kud ,

In order to find an optimal policy for a finite horizon problem, you have two options:

  1. Find a single stationary policy for the augmented state space that includes time.
  2. Find a non-stationary policy that consists of a list of policies for each time step.
    If implemented properly, these are computationally identical. (The solvers in DiscreteValueIteration.jl are not optimized for this, but they should work to get a good start)

You are right to say that if you want to find a single optimal policy for reaching any goal, you have to include both the goal and the vehicle's position in the MDP state, so, for a 2D grid world, the MDP state would be four dimensional, and 5 dimensional if time is included.

My advice is to make the MDP model include everything so that it represents the problem correctly; i.e. do not include any approximation and do not worry about how hard it is to solve the problem when you are formulating it. Then, when you get to the solution stage, you can make approximations. This may include using neural networks for the value function as in DQN, or using simplified formulations of the problem to start with.

@ga72kud
Copy link
Author

ga72kud commented Feb 14, 2022

Thank you, I appreciate your information

@ga72kud ga72kud closed this as completed Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants