-
Notifications
You must be signed in to change notification settings - Fork 100
Reinforcement Learning Task: Direct Policy Search in OpenAI Gym
Direct policy search often results in high-quality policies in complex reinforcement learning problems, which employs some optimization algorithms to search the parameters of the policy for maximizing the its total reward.
This page will show how to use ZOOpt to execute direct policy search on the OpenAI Gym.
To simplify this procedure, we provide necessary APIs in example/direct_policy_search_for_gym/gym_task.py
file.
Then we provide a function for running test in example/direct_policy_search_for_gym/run.py
file.
This function do following things
-
Construct a GymTask environment.
gym_task = GymTask(task_name) # choose a task by name gym_task.new_nnmodel(layers) # construct a neural network gym_task.set_max_step(max_step) # set max step in gym
-
Define corresponding objective and parameter.
# set dimension dim_size = gym_task.get_w_size() dim_regs = [[-10, 10]] * dim_size dim_tys = [True] * dim_size dim = Dimension(dim_size, dim_regs, dim_tys) # form up the objective function objective = Objective(gym_task.sum_reward, dim)
# terminal_value: procedure will stop in advance if this value is reached, it is not necessary for this example. parameter = Parameter(budget=budget,terminal_value=terminal_value)
-
Optimize.
solution_list = ExpOpt.min(objective, parameter, repeat=repeat, plot=True)
The whole process lists below.
from gym_task import GymTask
from zoopt import Dimension, Objective, Parameter, Opt, Solution
import matplotlib.pyplot as plt
import numpy as np
def run_test(task_name, layers, in_budget, max_step, repeat, terminal_value=None):
"""
Api of running direct policy search for gym task.
:param task_name: gym task name
:param layers:
layer information of the neural network
e.g., [2, 5, 1] means input layer has 2 neurons, hidden layer(only one) has 5 and output layer has 1
:param in_budget: number of calls to the objective function
:param max_step: max step in gym
:param repeat: repeat number in a test
:param terminal_value: early stop, algorithm should stop when such value is reached
:return: no return
"""
gym_task = GymTask(task_name) # choose a task by name
gym_task.new_nnmodel(layers) # construct a neural network
gym_task.set_max_step(max_step) # set max step in gym
budget = in_budget # number of calls to the objective function
rand_probability = 0.95 # the probability of sample in model
# set dimension
dim_size = gym_task.get_w_size()
dim_regs = [[-10, 10]] * dim_size
dim_tys = [True] * dim_size
dim = Dimension(dim_size, dim_regs, dim_tys)
# form up the objective function
objective = Objective(gym_task.sum_reward, dim)
parameter = Parameter(budget=budget, terminal_value=terminal_value)
parameter.set_probability(rand_probability)
solution_list = ExpOpt.min(objective, parameter, repeat=repeat, plot=True)
With the help of this function, users can run a test in few lines.
if __name__ == '__main__':
mountain_car_layers = [2, 5, 1]
run_test('MountainCar-v0', mountain_car_layers, 2000, 1000, 1)
For a few seconds, the optimization is done. Visualized optimization progress looks like
More concrete examples are available in the example/direct_policy_search_for_gym
directory .