Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Commit

Permalink
Release 0.9
Browse files Browse the repository at this point in the history
Main changes are detailed below:

New features -
* CARLA 0.7 simulator integration
* Human control of the game play
* Recording of human game play and storing / loading the replay buffer
* Behavioral cloning agent and presets
* Golden tests for several presets
* Selecting between deep / shallow image embedders
* Rendering through pygame (with some boost in performance)

API changes -
* Improved environment wrapper API
* Added an evaluate flag to allow convenient evaluation of existing checkpoints
* Improve frameskip definition in Gym

Bug fixes -
* Fixed loading of checkpoints for agents with more than one network
* Fixed the N Step Q learning agent python3 compatibility
  • Loading branch information
itaicaspi-intel authored Dec 19, 2017
1 parent 11faf19 commit 125c7ee
Show file tree
Hide file tree
Showing 41 changed files with 1,713 additions and 260 deletions.
68 changes: 40 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,16 @@ Training an agent to solve an environment is as easy as running:
python3 coach.py -p CartPole_DQN -r
```

<img src="img/doom.gif" alt="Doom Health Gathering" width="265" height="200"/><img src="img/minitaur.gif" alt="PyBullet Minitaur" width="265" height="200"/> <img src="img/ant.gif" alt="Gym Extensions Ant" width="250" height="200"/>
<img src="img/doom_deathmatch.gif" alt="Doom Deathmatch" width="267" height="200"/> <img src="img/carla.gif" alt="CARLA" width="284" height="200"/> <img src="img/montezuma.gif" alt="MontezumaRevenge" width="152" height="200"/>

Blog post from the Intel® Nervana™ website can be found [here](https://www.intelnervana.com/reinforcement-learning-coach-intel).


## Documentation

Framework documentation, algorithm description and instructions on how to contribute a new agent/environment can be found [here](http://coach.nervanasys.com).


## Installation

Note: Coach has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.
Expand Down Expand Up @@ -103,6 +109,8 @@ For example:

It is easy to create new presets for different levels or environments by following the same pattern as in presets.py

More usage examples can be found [here](http://coach.nervanasys.com/usage/index.html).

## Running Coach Dashboard (Visualization)
Training an agent to solve an environment can be tricky, at times.

Expand All @@ -121,11 +129,6 @@ python3 dashboard.py
<img src="img/dashboard.png" alt="Coach Design" style="width: 800px;"/>


## Documentation

Framework documentation, algoritmic description and instructions on how to contribute a new agent/environment can be found [here](http://coach.nervanasys.com).


## Parallelizing an Algorithm

Since the introduction of [A3C](https://arxiv.org/abs/1602.01783) in 2016, many algorithms were shown to benefit from running multiple instances in parallel, on many CPU cores. So far, these algorithms include [A3C](https://arxiv.org/abs/1602.01783), [DDPG](https://arxiv.org/pdf/1704.03073.pdf), [PPO](https://arxiv.org/pdf/1707.06347.pdf), and [NAF](https://arxiv.org/pdf/1610.00633.pdf), and this is most probably only the begining.
Expand All @@ -150,36 +153,45 @@ python3 coach.py -p Hopper_A3C -n 16

## Supported Environments

* OpenAI Gym
* *OpenAI Gym:*

Installed by default by Coach's installer.

* ViZDoom:
* *ViZDoom:*

Follow the instructions described in the ViZDoom repository -

https://github.com/mwydmuch/ViZDoom

Additionally, Coach assumes that the environment variable VIZDOOM_ROOT points to the ViZDoom installation directory.

* Roboschool:
* *Roboschool:*

Follow the instructions described in the roboschool repository -

https://github.com/openai/roboschool

* GymExtensions:
* *GymExtensions:*

Follow the instructions described in the GymExtensions repository -

https://github.com/Breakend/gym-extensions

Additionally, add the installation directory to the PYTHONPATH environment variable.

* PyBullet
* *PyBullet:*

Follow the instructions described in the [Quick Start Guide](https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA) (basically just - 'pip install pybullet')

* *CARLA:*

Download release 0.7 from the CARLA repository -

https://github.com/carla-simulator/carla/releases

Create a new CARLA_ROOT environment variable pointing to CARLA's installation directory.

A simple CARLA settings file (```CarlaSettings.ini```) is supplied with Coach, and is located in the ```environments``` directory.


## Supported Algorithms
Expand All @@ -190,24 +202,24 @@ python3 coach.py -p Hopper_A3C -n 16



* [Deep Q Network (DQN)](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
* [Double Deep Q Network (DDQN)](https://arxiv.org/pdf/1509.06461.pdf)
* [Deep Q Network (DQN)](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) ([code](agents/dqn_agent.py))
* [Double Deep Q Network (DDQN)](https://arxiv.org/pdf/1509.06461.pdf) ([code](agents/ddqn_agent.py))
* [Dueling Q Network](https://arxiv.org/abs/1511.06581)
* [Mixed Monte Carlo (MMC)](https://arxiv.org/abs/1703.01310)
* [Persistent Advantage Learning (PAL)](https://arxiv.org/abs/1512.04860)
* [Categorical Deep Q Network (C51)](https://arxiv.org/abs/1707.06887)
* [Quantile Regression Deep Q Network (QR-DQN)](https://arxiv.org/pdf/1710.10044v1.pdf)
* [Bootstrapped Deep Q Network](https://arxiv.org/abs/1602.04621)
* [N-Step Q Learning](https://arxiv.org/abs/1602.01783) | **Distributed**
* [Neural Episodic Control (NEC)](https://arxiv.org/abs/1703.01988)
* [Normalized Advantage Functions (NAF)](https://arxiv.org/abs/1603.00748.pdf) | **Distributed**
* [Policy Gradients (PG)](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) | **Distributed**
* [Asynchronous Advantage Actor-Critic (A3C)](https://arxiv.org/abs/1602.01783) | **Distributed**
* [Deep Deterministic Policy Gradients (DDPG)](https://arxiv.org/abs/1509.02971) | **Distributed**
* [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf)
* [Clipped Proximal Policy Optimization](https://arxiv.org/pdf/1707.06347.pdf) | **Distributed**
* [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Distributed**

* [Mixed Monte Carlo (MMC)](https://arxiv.org/abs/1703.01310) ([code](agents/mmc_agent.py))
* [Persistent Advantage Learning (PAL)](https://arxiv.org/abs/1512.04860) ([code](agents/pal_agent.py))
* [Categorical Deep Q Network (C51)](https://arxiv.org/abs/1707.06887) ([code](agents/categorical_dqn_agent.py))
* [Quantile Regression Deep Q Network (QR-DQN)](https://arxiv.org/pdf/1710.10044v1.pdf) ([code](agents/qr_dqn_agent.py))
* [Bootstrapped Deep Q Network](https://arxiv.org/abs/1602.04621) ([code](agents/bootstrapped_dqn_agent.py))
* [N-Step Q Learning](https://arxiv.org/abs/1602.01783) | **Distributed** ([code](agents/n_step_q_agent.py))
* [Neural Episodic Control (NEC)](https://arxiv.org/abs/1703.01988) ([code](agents/nec_agent.py))
* [Normalized Advantage Functions (NAF)](https://arxiv.org/abs/1603.00748.pdf) | **Distributed** ([code](agents/naf_agent.py))
* [Policy Gradients (PG)](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) | **Distributed** ([code](agents/policy_gradients_agent.py))
* [Asynchronous Advantage Actor-Critic (A3C)](https://arxiv.org/abs/1602.01783) | **Distributed** ([code](agents/actor_critic_agent.py))
* [Deep Deterministic Policy Gradients (DDPG)](https://arxiv.org/abs/1509.02971) | **Distributed** ([code](agents/ddpg_agent.py))
* [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf) ([code](agents/ppo_agent.py))
* [Clipped Proximal Policy Optimization](https://arxiv.org/pdf/1707.06347.pdf) | **Distributed** ([code](agents/clipped_ppo_agent.py))
* [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Distributed** ([code](agents/dfp_agent.py))
* Behavioral Cloning (BC) ([code](agents/bc_agent.py))



Expand Down
3 changes: 3 additions & 0 deletions agents/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,16 @@

from agents.actor_critic_agent import *
from agents.agent import *
from agents.bc_agent import *
from agents.bootstrapped_dqn_agent import *
from agents.clipped_ppo_agent import *
from agents.ddpg_agent import *
from agents.ddqn_agent import *
from agents.dfp_agent import *
from agents.dqn_agent import *
from agents.categorical_dqn_agent import *
from agents.human_agent import *
from agents.imitation_agent import *
from agents.mmc_agent import *
from agents.n_step_q_agent import *
from agents.naf_agent import *
Expand Down
104 changes: 41 additions & 63 deletions agents/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ def __init__(self, env, tuning_parameters, replicated_device=None, task_id=0):
self.task_id = task_id
self.sess = tuning_parameters.sess
self.env = tuning_parameters.env_instance = env
self.imitation = False

# i/o dimensions
if not tuning_parameters.env.desired_observation_width or not tuning_parameters.env.desired_observation_height:
Expand All @@ -61,7 +62,12 @@ def __init__(self, env, tuning_parameters, replicated_device=None, task_id=0):
self.measurements_size = tuning_parameters.env.measurements_size = (self.measurements_size[0] + 1,)

# modules
self.memory = eval(tuning_parameters.memory + '(tuning_parameters)')
if tuning_parameters.agent.load_memory_from_file_path:
screen.log_title("Loading replay buffer from pickle. Pickle path: {}"
.format(tuning_parameters.agent.load_memory_from_file_path))
self.memory = read_pickle(tuning_parameters.agent.load_memory_from_file_path)
else:
self.memory = eval(tuning_parameters.memory + '(tuning_parameters)')
# self.architecture = eval(tuning_parameters.architecture)

self.has_global = replicated_device is not None
Expand Down Expand Up @@ -121,11 +127,12 @@ def __init__(self, env, tuning_parameters, replicated_device=None, task_id=0):

def log_to_screen(self, phase):
# log to screen
if self.current_episode > 0:
if phase == RunPhase.TEST:
exploration = self.evaluation_exploration_policy.get_control_param()
else:
if self.current_episode >= 0:
if phase == RunPhase.TRAIN:
exploration = self.exploration_policy.get_control_param()
else:
exploration = self.evaluation_exploration_policy.get_control_param()

screen.log_dict(
OrderedDict([
("Worker", self.task_id),
Expand All @@ -135,7 +142,7 @@ def log_to_screen(self, phase):
("steps", self.total_steps_counter),
("training iteration", self.training_iteration)
]),
prefix="Heatup" if self.in_heatup else "Training" if phase == RunPhase.TRAIN else "Testing"
prefix=phase
)

def update_log(self, phase=RunPhase.TRAIN):
Expand All @@ -146,7 +153,7 @@ def update_log(self, phase=RunPhase.TRAIN):
# log all the signals to file
logger.set_current_time(self.current_episode)
logger.create_signal_value('Training Iter', self.training_iteration)
logger.create_signal_value('In Heatup', int(self.in_heatup))
logger.create_signal_value('In Heatup', int(phase == RunPhase.HEATUP))
logger.create_signal_value('ER #Transitions', self.memory.num_transitions())
logger.create_signal_value('ER #Episodes', self.memory.length())
logger.create_signal_value('Episode Length', self.current_episode_steps_counter)
Expand Down Expand Up @@ -197,24 +204,6 @@ def reset_game(self, do_not_reset_env=False):
network.curr_rnn_c_in = network.middleware_embedder.c_init
network.curr_rnn_h_in = network.middleware_embedder.h_init

def stack_observation(self, curr_stack, observation):
"""
Adds a new observation to an existing stack of observations from previous time-steps.
:param curr_stack: The current observations stack.
:param observation: The new observation
:return: The updated observation stack
"""

if curr_stack == []:
# starting an episode
curr_stack = np.vstack(np.expand_dims([observation] * self.tp.env.observation_stack_size, 0))
curr_stack = self.switch_axes_order(curr_stack, from_type='channels_first', to_type='channels_last')
else:
curr_stack = np.append(curr_stack, np.expand_dims(np.squeeze(observation), axis=-1), axis=-1)
curr_stack = np.delete(curr_stack, 0, -1)

return curr_stack

def preprocess_observation(self, observation):
"""
Preprocesses the given observation.
Expand Down Expand Up @@ -335,26 +324,6 @@ def preprocess_reward(self, reward):
reward = max(reward, self.tp.env.reward_clipping_min)
return reward

def switch_axes_order(self, observation, from_type='channels_first', to_type='channels_last'):
"""
transpose an observation axes from channels_first to channels_last or vice versa
:param observation: a numpy array
:param from_type: can be 'channels_first' or 'channels_last'
:param to_type: can be 'channels_first' or 'channels_last'
:return: a new observation with the requested axes order
"""
if from_type == to_type or len(observation.shape) == 1:
return observation
assert 2 <= len(observation.shape) <= 3, 'num axes of an observation must be 2 for a vector or 3 for an image'
assert type(observation) == np.ndarray, 'observation must be a numpy array'
if len(observation.shape) == 3:
if from_type == 'channels_first' and to_type == 'channels_last':
return np.transpose(observation, (1, 2, 0))
elif from_type == 'channels_last' and to_type == 'channels_first':
return np.transpose(observation, (2, 0, 1))
else:
return np.transpose(observation, (1, 0))

def act(self, phase=RunPhase.TRAIN):
"""
Take one step in the environment according to the network prediction and store the transition in memory
Expand All @@ -370,15 +339,15 @@ def act(self, phase=RunPhase.TRAIN):
is_first_transition_in_episode = (self.curr_state == [])
if is_first_transition_in_episode:
observation = self.preprocess_observation(self.env.observation)
observation = self.stack_observation([], observation)
observation = stack_observation([], observation, self.tp.env.observation_stack_size)

self.curr_state = {'observation': observation}
if self.tp.agent.use_measurements:
self.curr_state['measurements'] = self.env.measurements
if self.tp.agent.use_accumulated_reward_as_measurement:
self.curr_state['measurements'] = np.append(self.curr_state['measurements'], 0)

if self.in_heatup: # we do not have a stacked curr_state yet
if phase == RunPhase.HEATUP and not self.tp.heatup_using_network_decisions:
action = self.env.get_random_action()
else:
action, action_info = self.choose_action(self.curr_state, phase=phase)
Expand All @@ -394,11 +363,11 @@ def act(self, phase=RunPhase.TRAIN):
observation = self.preprocess_observation(result['observation'])

# plot action values online
if self.tp.visualization.plot_action_values_online and not self.in_heatup:
if self.tp.visualization.plot_action_values_online and phase != RunPhase.HEATUP:
self.plot_action_values_online()

# initialize the next state
observation = self.stack_observation(self.curr_state['observation'], observation)
observation = stack_observation(self.curr_state['observation'], observation, self.tp.env.observation_stack_size)

next_state = {'observation': observation}
if self.tp.agent.use_measurements and 'measurements' in result.keys():
Expand All @@ -407,7 +376,7 @@ def act(self, phase=RunPhase.TRAIN):
next_state['measurements'] = np.append(next_state['measurements'], self.total_reward_in_current_episode)

# store the transition only if we are training
if phase == RunPhase.TRAIN:
if phase == RunPhase.TRAIN or phase == RunPhase.HEATUP:
transition = Transition(self.curr_state, result['action'], shaped_reward, next_state, result['done'])
for key in action_info.keys():
transition.info[key] = action_info[key]
Expand All @@ -427,7 +396,7 @@ def act(self, phase=RunPhase.TRAIN):
self.update_log(phase=phase)
self.log_to_screen(phase=phase)

if phase == RunPhase.TRAIN:
if phase == RunPhase.TRAIN or phase == RunPhase.HEATUP:
self.reset_game()

self.current_episode += 1
Expand Down Expand Up @@ -462,11 +431,12 @@ def evaluate(self, num_episodes, keep_networks_synced=False):
for network in self.networks:
network.sync()

if self.tp.visualization.dump_gifs and self.total_reward_in_current_episode > max_reward_achieved:
if self.total_reward_in_current_episode > max_reward_achieved:
max_reward_achieved = self.total_reward_in_current_episode
frame_skipping = int(5/self.tp.env.frame_skip)
logger.create_gif(self.last_episode_images[::frame_skipping],
name='score-{}'.format(max_reward_achieved), fps=10)
if self.tp.visualization.dump_gifs:
logger.create_gif(self.last_episode_images[::frame_skipping],
name='score-{}'.format(max_reward_achieved), fps=10)

average_evaluation_reward += self.total_reward_in_current_episode
self.reset_game()
Expand Down Expand Up @@ -496,7 +466,7 @@ def improve(self):
screen.log_title("Starting heatup {}".format(self.task_id))
num_steps_required_for_one_training_batch = self.tp.batch_size * self.tp.env.observation_stack_size
for step in range(max(self.tp.num_heatup_steps, num_steps_required_for_one_training_batch)):
self.act()
self.act(phase=RunPhase.HEATUP)

# training phase
self.in_heatup = False
Expand All @@ -509,7 +479,12 @@ def improve(self):
# evaluate
evaluate_agent = (self.last_episode_evaluation_ran is not self.current_episode) and \
(self.current_episode % self.tp.evaluate_every_x_episodes == 0)
evaluate_agent = evaluate_agent or \
(self.imitation and self.training_iteration > 0 and
self.training_iteration % self.tp.evaluate_every_x_training_iterations == 0)

if evaluate_agent:
self.env.reset()
self.last_episode_evaluation_ran = self.current_episode
self.evaluate(self.tp.evaluation_episodes)

Expand All @@ -522,21 +497,24 @@ def improve(self):
self.save_model(model_snapshots_periods_passed)

# play and record in replay buffer
if self.tp.agent.step_until_collecting_full_episodes:
step = 0
while step < self.tp.agent.num_consecutive_playing_steps or self.memory.get_episode(-1).length() != 0:
self.act()
step += 1
else:
for step in range(self.tp.agent.num_consecutive_playing_steps):
self.act()
if self.tp.agent.collect_new_data:
if self.tp.agent.step_until_collecting_full_episodes:
step = 0
while step < self.tp.agent.num_consecutive_playing_steps or self.memory.get_episode(-1).length() != 0:
self.act()
step += 1
else:
for step in range(self.tp.agent.num_consecutive_playing_steps):
self.act()

# train
if self.tp.train:
for step in range(self.tp.agent.num_consecutive_training_steps):
loss = self.train()
self.loss.add_sample(loss)
self.training_iteration += 1
if self.imitation:
self.log_to_screen(RunPhase.TRAIN)
self.post_training_commands()

def save_model(self, model_id):
Expand Down
Loading

0 comments on commit 125c7ee

Please sign in to comment.