A simple and easy to use Python library to solve Markov decision processes and robust Markov decision processes. The library includes mostly helper functions for the CRAAM python interface. In particular, it contains basic simulation routines, approximate dynamic programming through state aggregation, and the construction of MDPs from simulation.
The library supports standard finite or infinite horizon discounted MDPs [Puterman2005]. The library assumes maximization over actions. The states and actions must be finite.
The robust model extends the regular MDPs [Iyengar2005]. The library allows to model uncertainty in both the transition and rewards, unlike some published papers on this topic. This is modeled by adding an outcome to each action. The outcome is assumed to be minimized by nature, similar to [Filar1997].
In summary, the robust MDP problem being solved is:
v(s) = \max_{a \in \mathcal{A}} \min_{o \in \mathcal{O}} \sum_{s\in\mathcal{S}} ( r(s,a,o,s') + \gamma P(s,a,o,s') v(s') ) ~.
Here, \mathcal{S} are the states, \mathcal{A} are the actions, \mathcal{O} are the outcomes.
The included algorithms are value iteration and modified policy iteration. The library support both the plain worst-case outcome method and a worst case with respect to a base distribution. The precise algorithms are implemented in C++ in CRAAM; see the project website for a detailed description.
The algorithms that approximate MDPs though robust state aggregation are described in [Petrik2014]. The robust algorithm generalizes standard state aggregation by capturing the introduced model error through robust models.
Requirements:
- craam 1.0+ C++ implementation of methods for solving MDPs
- Python 3.5+ (Python 2 is NOT supported)
- Setuptools 7.0
- Numpy 1.8+
- Scipy 0.13
- Cython 0.21+
The package has been tested only on Linux.
Optional dependencies:
- Matplotlib 1.0+ for plotting support
To install, run (use --user
to install locally):
$ python setup.py install
To install in a development mode, execute:
$ python setup.py develop
The development installation will not copy project files to site-packages
---any changes to the Python code will be reflected without the need to reinstall.
To test the installation, run the following python code:
import raam
import raam.test
raam.test.test()
It is also possible to run the tests from the command line as:
python -mraam.test
The project consists of the following main modules:
raam.robust
- a pure python implementation of selected robust optimization methodsraam.simulator
- framework code for implementing a simulation-based MDP formulation and optimizationraam.samples
- methods for handling samplesraam.features
- methods for defining state featuresraam.plotting
- basic plotting supportraam.examples
- example MDP domainsraam.test
- code unit tests
Methods for solving robust MDPs are provided by craam.robust
.
See library CRAAM for a simple example.
First, define a simulator for a simple MDP chain and sample from it.
import raam
import random
import itertools
class StatefulCounter(raam.simulator.Simulator):
"""
State: position in chain
Initial (decision) state: 0
Actions: {plus, minus}
Rewards: 90%: next position, 10% this position in chain
"""
def __init__(self):
self.state = 0
@property
def discount(self):
return 0.9
def transition(self,state,action):
decstate = self.state
if action == 'plus': act = 1
elif action == 'minus': act = -1
else: raise ValueError('Invalid action')
if random.random() <= 0.9: self.state = state + act
else: self.state = state
return state,self.state
return self.state
def end_condition(self,state):
return False
def initstates(self):
return itertools.repeat(0)
def actions(self,state):
return ['plus','minus']
The next step is to generate samples as follows:
horizon = 100
runs = 5
sim = StatefulCounter()
samples = sim.simulate(horizon,sim.random_policy(),runs)
print("Samples:\n", list(samples.samples()))
These samples use the raw state and action representation. The state is in integer in this case, but it could be in principle any python object. So to formulate an MDP, we need to assign unique indices to the states as follows:
from craam import crobust
from raam import features
# transform the samples to discrete samples (build a discrete view)
# each state (-infinity, +infinity) is assigned a uniques index by IdCache
sv = raam.SampleView(samples=samples,statemap=features.IdCache(), actmap = features.IdCache(), readonly=True)
ds = crobust.DiscreteSamples()
for i in sv.initialsamples():
ds.add_initial(i)
for s in sv.samples():
ds.add_sample(s.statefrom, s.action, s.stateto, s.reward, s.weight, s.step, s.run)
And finally, the following code will actually solve the MDP.
# construct MDP from the samples
smdp = crobust.SampledMDP()
smdp.add_samples(ds)
mdp = smdp.get_mdp(0.9)
print(mdp.mpi_jac())
Note that it is important to map the value function and policy in the last two lines. This is because the sampled robust MDP uses an internal representation that separates decision and expectation states in order to improve computational efficiency.
More examples are provided in the subdirectory examples
.
[Filar1997] | Filar, J., & Vrieze, K. (1997). Competitive Markov decision processes. Springer. |
[Puterman2005] | Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. Handbooks in operations research and management …. John Wiley & Sons, Inc. |
[Iyengar2005] | Iyengar, G. N. G. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2), 1–29. |
[Petrik2014] | Petrik, M., & Subramanian, D. (2014). RAAM : The benefits of robustness in approximating aggregated MDPs in reinforcement learning. In Neural Information Processing Systems (NIPS). |