Skip to content

Commit

Permalink
Offline Evaluation Metrics Implementations (#9)
Browse files Browse the repository at this point in the history
Adds inverse propensity scoring and doubly robust evaluation metrics
  • Loading branch information
Emily Strong authored Jul 30, 2021
1 parent e7813ad commit 1a6797e
Show file tree
Hide file tree
Showing 17 changed files with 536 additions and 194 deletions.
10 changes: 8 additions & 2 deletions CHANGELOG.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
CHANGELOG
=========

-------------------------------------------------------------------------------
July 29, 2021 1.3.0
-------------------------------------------------------------------------------

- Added Inverse Propensity Scoring (IPS) and Doubly Robust Estimation (DR) CTR estimation methods.

-------------------------------------------------------------------------------
July 12, 2021 1.2.2
-------------------------------------------------------------------------------
Expand All @@ -18,7 +24,7 @@ June 23, 2021 1.2.1
April 16, 2021 1.2.0
-------------------------------------------------------------------------------

- Fixed deprecation warning of numpy 1.20 dtype
- Fixed deprecation warning of numpy 1.20 dtype

-------------------------------------------------------------------------------
April 13, 2021 1.1.0
Expand All @@ -37,4 +43,4 @@ February 1, 2021 1.0.0
December 1, 2020
-------------------------------------------------------------------------------

- Development starts.
- Development starts.
14 changes: 10 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,12 @@ Jurity is developed by the Artificial Intelligence Center of Excellence at Fidel
## Recommenders Metrics
* [AUC: Area Under the Curve](https://fidelity.github.io/jurity/about_reco.html#auc-area-under-the-curve)
* [CTR: Click-through rate](https://fidelity.github.io/jurity/about_reco.html#ctr-click-through-rate)
* [Precision@K](https://fidelity.github.io/jurity/about_reco.html#precision)
* [Recall@K](https://fidelity.github.io/jurity/about_reco.html#recall)
* [DR: Doubly robust estimation](https://fidelity.github.io/jurity/about_reco.html#ctr-click-through-rate)
* [IPS: Inverse propensity scoring](https://fidelity.github.io/jurity/about_reco.html#ctr-click-through-rate)
* [MAP@K: Mean Average Precision](https://fidelity.github.io/jurity/about_reco.html#map-mean-average-precision)
* [NDCG: Normalized discounted cumulative gain](https://fidelity.github.io/jurity/about_reco.html#ndcg-normalized-discounted-cumulative-gain)
* [Precision@K](https://fidelity.github.io/jurity/about_reco.html#precision)
* [Recall@K](https://fidelity.github.io/jurity/about_reco.html#recall)

## Classification Metrics
* [Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
Expand Down Expand Up @@ -104,18 +106,22 @@ predicted = pd.DataFrame({"user_id": [1, 2, 3, 4], "item_id": [1, 2, 2, 3], "cli
# Metrics
auc = BinaryRecoMetrics.AUC(click_column="clicks")
ctr = BinaryRecoMetrics.CTR(click_column="clicks")
dr = BinaryRecoMetrics.CTR(click_column="clicks", estimation='dr')
ips = BinaryRecoMetrics.CTR(click_column="clicks", estimation='ips')
map_k = RankingRecoMetrics.MAP(click_column="clicks", k=2)
ncdg_k = RankingRecoMetrics.NDCG(click_column="clicks", k=3)
precision_k = RankingRecoMetrics.Precision(click_column="clicks", k=2)
recall_k = RankingRecoMetrics.Recall(click_column="clicks", k=2)
map_k = RankingRecoMetrics.MAP(click_column="clicks", k=2)

# Scores
print("AUC:", auc.get_score(actual, predicted))
print("CTR:", ctr.get_score(actual, predicted))
print("Doubly Robust:", dr.get_score(actual, predicted))
print("IPS:", ips.get_score(actual, predicted))
print("MAP@K:", map_k.get_score(actual, predicted))
print("NCDG:", ncdg_k.get_score(actual, predicted))
print("Precision@K:", precision_k.get_score(actual, predicted))
print("Recall@K:", recall_k.get_score(actual, predicted))
print("MAP@K:", map_k.get_score(actual, predicted))
```

## Quick Start: Classification Evaluation
Expand Down
2 changes: 1 addition & 1 deletion docs/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 2e1ad5d9e655c410a8bf3d73cfd7b84d
config: 01e4225d941cab66da9ab9ff047a8f1f
tags: 645f666f9bcd5a90fca523b33c5a78b7
32 changes: 31 additions & 1 deletion docs/_sources/about_reco.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,43 @@ Binary recommender metrics directly measure the click interaction.
CTR: Click-through Rate
^^^^^^^^^^^^^^^^^^^^^^^

CTR measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.
CTR offers three reward estimation methods.

Direct estimation ("matching") measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.

Let :math:`M` denote the set of user-item pairs that appear in both actual ratings and recommendations, and :math:`C(M_i)` be an indicator function that produces :math:`1` if the user clicked on the item, and :math:`0` if they didn't.

.. math::
CTR = \frac{1}{\left | M \right |}\sum_{i=1}^{\left | M \right |} C(M_i)
Inverse propensity scoring (IPS) weights the items by how likely they were to be recommended by the historic policy
if the user saw the item in the historic data. Due to the probability inversion, less likely items are given more weight.

.. math::
IPS = \frac{1}{n} \sum r_a \times \frac{I(\hat{a} = a)}{P(a|x,h)}
In this calculation: n is the total size of the test data; :math:`r_a` is the observed reward;
:math:`\hat{a}` is the recommended item; :math:`I(\hat{a} = a}` is a boolean of whether the user-item pair has
historic data; and :math:`P(a|x,h)` is the probability of the item being recommended for the test context given
the historic data.

Doubly robust estimation (DR) combines the directly predicted values with a correction based on how
likely an item was to be recommended by the historic policy if the user saw the item in the historic data.

.. math::
DR = \frac{1}{n} \sum \hat{r}_a + \frac{(r_a -\hat{r}_a) I(\hat{a} = a}{p(a|x,h)}
In this calculation, :math:`\hat{r}_a` is the predicted reward.

At a high level, doubly robust estimation combines a direct estimate with an IPS-like correction if historic data is
available. If historic data is not available, the second term is 0 and only the predicted reward is used for the
user-item pair.


The IPS and DR implementations are based on: Dudík, Miroslav, John Langford, and Lihong Li.
"Doubly robust policy evaluation and learning." Proceedings of the 28th International Conference on International
Conference on Machine Learning. 2011. Available as arXiv preprint arXiv:1103.4601 

AUC: Area Under the Curve
^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
7 changes: 1 addition & 6 deletions docs/_static/pygments.css
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
pre { line-height: 125%; }
td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
.highlight .hll { background-color: #ffffcc }
.highlight { background: #f8f8f8; }
.highlight { background: #f8f8f8; }
.highlight .c { color: #408080; font-style: italic } /* Comment */
.highlight .err { border: 1px solid #FF0000 } /* Error */
.highlight .k { color: #008000; font-weight: bold } /* Keyword */
Expand Down
22 changes: 21 additions & 1 deletion docs/about_reco.html
Original file line number Diff line number Diff line change
Expand Up @@ -184,10 +184,30 @@ <h2>Binary Recommender Metrics<a class="headerlink" href="#binary-recommender-me
<p>Binary recommender metrics directly measure the click interaction.</p>
<div class="section" id="ctr-click-through-rate">
<h3>CTR: Click-through Rate<a class="headerlink" href="#ctr-click-through-rate" title="Permalink to this headline"></a></h3>
<p>CTR measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.</p>
<p>CTR offers three reward estimation methods.</p>
<p>Direct estimation (“matching”) measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.</p>
<p>Let <span class="math notranslate nohighlight">\(M\)</span> denote the set of user-item pairs that appear in both actual ratings and recommendations, and <span class="math notranslate nohighlight">\(C(M_i)\)</span> be an indicator function that produces <span class="math notranslate nohighlight">\(1\)</span> if the user clicked on the item, and <span class="math notranslate nohighlight">\(0\)</span> if they didn’t.</p>
<div class="math notranslate nohighlight">
\[CTR = \frac{1}{\left | M \right |}\sum_{i=1}^{\left | M \right |} C(M_i)\]</div>
<p>Inverse propensity scoring (IPS) weights the items by how likely they were to be recommended by the historic policy
if the user saw the item in the historic data. Due to the probability inversion, less likely items are given more weight.</p>
<div class="math notranslate nohighlight">
\[IPS = \frac{1}{n} \sum r_a \times \frac{I(\hat{a} = a)}{P(a|x,h)}\]</div>
<p>In this calculation: n is the total size of the test data; <span class="math notranslate nohighlight">\(r_a\)</span> is the observed reward;
<span class="math notranslate nohighlight">\(\hat{a}\)</span> is the recommended item; <span class="math notranslate nohighlight">\(I(\hat{a} = a}\)</span> is a boolean of whether the user-item pair has
historic data; and <span class="math notranslate nohighlight">\(P(a|x,h)\)</span> is the probability of the item being recommended for the test context given
the historic data.</p>
<p>Doubly robust estimation (DR) combines the directly predicted values with a correction based on how
likely an item was to be recommended by the historic policy if the user saw the item in the historic data.</p>
<div class="math notranslate nohighlight">
\[DR = \frac{1}{n} \sum \hat{r}_a + \frac{(r_a -\hat{r}_a) I(\hat{a} = a}{p(a|x,h)}\]</div>
<p>In this calculation, <span class="math notranslate nohighlight">\(\hat{r}_a\)</span> is the predicted reward.</p>
<p>At a high level, doubly robust estimation combines a direct estimate with an IPS-like correction if historic data is
available. If historic data is not available, the second term is 0 and only the predicted reward is used for the
user-item pair.</p>
<p>The IPS and DR implementations are based on: Dudík, Miroslav, John Langford, and Lihong Li.
“Doubly robust policy evaluation and learning.” Proceedings of the 28th International Conference on International
Conference on Machine Learning. 2011. Available as arXiv preprint arXiv:1103.4601</p>
</div>
<div class="section" id="auc-area-under-the-curve">
<h3>AUC: Area Under the Curve<a class="headerlink" href="#auc-area-under-the-curve" title="Permalink to this headline"></a></h3>
Expand Down
Loading

0 comments on commit 1a6797e

Please sign in to comment.