Learning to Assign Credit in Input-driven Environment (LACIE) reduce the variance of estimation of advantages value in noisy MDP with hindsight distribution.
Input-driven MDP are the Markov processes governed by not only agent's actions but also stochastic, exogenous input processes [1]. These environments have high variance inheritantly making it hard to learn optimal policy.
This repository implemented:
-
Input-dependence baseline as in proposed in [1].
-
Lacie - an algorithm that learn to weight the advantages of each rollout in hindsight with respect to future input sequences.
- Install Pytorch
pip install torch torchvision
- install Tensorflow 2
pip install tensorflow=2.2
or
pip install tensorflow-gpu=2.2
- Install OpenAI baseline (Tensorflow 2 version)
git clone https://github.com/openai/baselines.git -b tf2 && \
cd baselines && \
pip install -e .
Note: I haven't tested the code on Tensorflow 1 yet but it should work as well.
- Install Park Platform. I modified the platform slightly to make it compatible with OpenAI's baseline.
git clone https://github.com/lehduong/park &&\
cd park && \
pip install -e .
See scripts
for examples.
Reward of A2C+Lacie (yellow) vs A2C (blue)
Value loss of A2C+Lacie (yellow) vs A2C (blue) during training:
[1] Variance Reduction for Reinforcement Learning in Input-Driven Environments.
The started code is based on ikostrikov's repository.