This repository contains an implementation of the representation balancing MDP (RepBM) OPPE estimator in paper Representation Balancing MDPs for Off-Policy Policy Evaluation. The code is implemented in Python 3.6 using pytorch 0.4.1 and numpy 1.14.2.
We implemented the RepBM using neural networks as function approximator and focus on deterministic
transition case (or stochastic transition in tabular state space). The model of RepBM is defined in
. The core components of the RepBM algorithm such as the loss functin is implemented
in mdpmodel_train
function in
. Hyper-parameters of the nn model is specified
in src/
Example domains from the experiment section in the paper in included in this repository.
We use the CartPole-v0 domain and MountainCar-v0 domain in OpenAI Gym.
An example of running the experiment in CartPole domain:
$ python
$ python
will learn a near-optimal value function and save it in directory
. The greedy policy based on learned value function will be used as evaluation
policy and the epsilon-greedy policy will serve as behavior policy. As an example, we also include the policies used in
the experiment section of the paper, in target_policies
To run experiment across several different values of hyper-parameter alpha:
$ python
The HIV simulator and the a FQI learning algorithm is implemented in directory hiv_domain
The code modified based on RLPy and Harvard DTAK group's implementation. To run the experiment:
$ python hiv_domain/
$ python
$ python
$ python
title = {Representation Balancing MDPs for Off-policy Policy Evaluation},
author = {Liu, Yao and Gottesman, Omer and Raghu, Aniruddh and Komorowski, Matthieu and Faisal, Aldo A and Doshi-Velez, Finale and Brunskill, Emma},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {2645--2654},
year = {2018},
publisher = {Curran Associates, Inc.},
url = {}