Implementations and examples of common offline policy evaluation methods in Python. For more information on offline policy evaluation see this tutorial.
pip install offline-evaluation
from ope.methods import doubly_robust
Get some historical logs generated by a previous policy:
df = pd.DataFrame([
{"context": {"p_fraud": 0.08}, "action": "blocked", "action_prob": 0.90, "reward": 0},
{"context": {"p_fraud": 0.03}, "action": "allowed", "action_prob": 0.90, "reward": 20},
{"context": {"p_fraud": 0.02}, "action": "allowed", "action_prob": 0.90, "reward": 10},
{"context": {"p_fraud": 0.01}, "action": "allowed", "action_prob": 0.90, "reward": 20},
{"context": {"p_fraud": 0.09}, "action": "allowed", "action_prob": 0.10, "reward": -20},
{"context": {"p_fraud": 0.40}, "action": "allowed", "action_prob": 0.10, "reward": -10},
])
Define a function that computes P(action | context)
under the new policy:
def action_probabilities(context):
epsilon = 0.10
if context["p_fraud"] > 0.10:
return {"allowed": epsilon, "blocked": 1 - epsilon}
return {"allowed": 1 - epsilon, "blocked": epsilon}
Conduct the evaluation:
doubly_robust.evaluate(df, action_probabilities)
> {'expected_reward_logging_policy': 3.33, 'expected_reward_new_policy': -28.47}
This means the new policy is significantly worse than the logging policy. Instead of A/B testing this new policy online, it would be better to test some other policies offline first.
See examples for more detailed tutorials.
- Inverse propensity scoring
- Direct method
- Doubly robust (paper)