Offline Evaluation Metrics Implementations (#9)

Adds inverse propensity scoring and doubly robust evaluation metrics
fidelity · Jul 30, 2021 · 1a6797e · 1a6797e
1 parent e7813ad
commit 1a6797e
Show file tree

Hide file tree

Showing 17 changed files with 536 additions and 194 deletions.
diff --git a/CHANGELOG.txt b/CHANGELOG.txt
@@ -2,6 +2,12 @@
 CHANGELOG
 =========
 
+-------------------------------------------------------------------------------
+July 29, 2021 1.3.0
+-------------------------------------------------------------------------------
+
+- Added Inverse Propensity Scoring (IPS) and Doubly Robust Estimation (DR) CTR estimation methods.
+
 -------------------------------------------------------------------------------
 July 12, 2021 1.2.2
 -------------------------------------------------------------------------------
@@ -18,7 +24,7 @@ June 23, 2021 1.2.1
 April 16, 2021 1.2.0
 -------------------------------------------------------------------------------
 
-- Fixed deprecation warning of numpy 1.20 dtype 
+- Fixed deprecation warning of numpy 1.20 dtype
 
 -------------------------------------------------------------------------------
 April 13, 2021 1.1.0
@@ -37,4 +43,4 @@ February 1, 2021 1.0.0
 December 1, 2020
 -------------------------------------------------------------------------------
 
-- Development starts.
+- Development starts.
diff --git a/README.md b/README.md
@@ -23,10 +23,12 @@ Jurity is developed by the Artificial Intelligence Center of Excellence at Fidel
 ## Recommenders Metrics
 * [AUC: Area Under the Curve](https://fidelity.github.io/jurity/about_reco.html#auc-area-under-the-curve)
 * [CTR: Click-through rate](https://fidelity.github.io/jurity/about_reco.html#ctr-click-through-rate)
-* [Precision@K](https://fidelity.github.io/jurity/about_reco.html#precision)
-* [Recall@K](https://fidelity.github.io/jurity/about_reco.html#recall)
+* [DR: Doubly robust estimation](https://fidelity.github.io/jurity/about_reco.html#ctr-click-through-rate)
+* [IPS: Inverse propensity scoring](https://fidelity.github.io/jurity/about_reco.html#ctr-click-through-rate)
 * [MAP@K: Mean Average Precision](https://fidelity.github.io/jurity/about_reco.html#map-mean-average-precision)
 * [NDCG: Normalized discounted cumulative gain](https://fidelity.github.io/jurity/about_reco.html#ndcg-normalized-discounted-cumulative-gain)
+* [Precision@K](https://fidelity.github.io/jurity/about_reco.html#precision)
+* [Recall@K](https://fidelity.github.io/jurity/about_reco.html#recall)
 
 ## Classification Metrics
 * [Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
@@ -104,18 +106,22 @@ predicted = pd.DataFrame({"user_id": [1, 2, 3, 4], "item_id": [1, 2, 2, 3], "cli
 # Metrics
 auc = BinaryRecoMetrics.AUC(click_column="clicks")
 ctr = BinaryRecoMetrics.CTR(click_column="clicks")
+dr = BinaryRecoMetrics.CTR(click_column="clicks", estimation='dr')
+ips = BinaryRecoMetrics.CTR(click_column="clicks", estimation='ips')
+map_k = RankingRecoMetrics.MAP(click_column="clicks", k=2)
 ncdg_k = RankingRecoMetrics.NDCG(click_column="clicks", k=3)
 precision_k = RankingRecoMetrics.Precision(click_column="clicks", k=2)
 recall_k = RankingRecoMetrics.Recall(click_column="clicks", k=2)
-map_k = RankingRecoMetrics.MAP(click_column="clicks", k=2)
 
 # Scores
 print("AUC:", auc.get_score(actual, predicted))
 print("CTR:", ctr.get_score(actual, predicted))
+print("Doubly Robust:", dr.get_score(actual, predicted))
+print("IPS:", ips.get_score(actual, predicted))
+print("MAP@K:", map_k.get_score(actual, predicted))
 print("NCDG:", ncdg_k.get_score(actual, predicted))
 print("Precision@K:", precision_k.get_score(actual, predicted))
 print("Recall@K:", recall_k.get_score(actual, predicted))
-print("MAP@K:", map_k.get_score(actual, predicted))
 ```
 
 ## Quick Start: Classification Evaluation

diff --git a/docs/.buildinfo b/docs/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 2e1ad5d9e655c410a8bf3d73cfd7b84d
+config: 01e4225d941cab66da9ab9ff047a8f1f
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/_sources/about_reco.rst.txt b/docs/_sources/about_reco.rst.txt
@@ -17,13 +17,43 @@ Binary recommender metrics directly measure the click interaction.
 CTR: Click-through Rate
 ^^^^^^^^^^^^^^^^^^^^^^^
 
-CTR measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.
+CTR offers three reward estimation methods.
+
+Direct estimation ("matching") measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.
 
 Let :math:`M` denote the set of user-item pairs that appear in both actual ratings and recommendations, and :math:`C(M_i)` be an indicator function that produces :math:`1` if the user clicked on the item, and :math:`0` if they didn't.
 
 .. math::
     CTR = \frac{1}{\left | M \right |}\sum_{i=1}^{\left | M \right |} C(M_i)
 
+Inverse propensity scoring (IPS) weights the items by how likely they were to be recommended by the historic policy
+if the user saw the item in the historic data. Due to the probability inversion, less likely items are given more weight.
+
+.. math::
+    IPS = \frac{1}{n} \sum r_a \times \frac{I(\hat{a} = a)}{P(a|x,h)}
+
+In this calculation: n is the total size of the test data; :math:`r_a` is the observed reward;
+:math:`\hat{a}` is the recommended item; :math:`I(\hat{a} = a}` is a boolean of whether the user-item pair has
+historic data; and :math:`P(a|x,h)` is the probability of the item being recommended for the test context given
+the historic data.
+
+Doubly robust estimation (DR) combines the directly predicted values with a correction based on how
+likely an item was to be recommended by the historic policy if the user saw the item in the historic data.
+
+.. math::
+    DR = \frac{1}{n} \sum \hat{r}_a + \frac{(r_a -\hat{r}_a) I(\hat{a} = a}{p(a|x,h)}
+
+In this calculation, :math:`\hat{r}_a` is the predicted reward.
+
+At a high level, doubly robust estimation combines a direct estimate with an IPS-like correction if historic data is
+available. If historic data is not available, the second term is 0 and only the predicted reward is used for the
+user-item pair.
+
+
+The IPS and DR implementations are based on: Dudík, Miroslav, John Langford, and Lihong Li.
+"Doubly robust policy evaluation and learning." Proceedings of the 28th International Conference on International
+Conference on Machine Learning. 2011. Available as arXiv preprint arXiv:1103.4601 
+
 AUC: Area Under the Curve
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 

diff --git a/docs/_static/pygments.css b/docs/_static/pygments.css
@@ -1,10 +1,5 @@
-pre { line-height: 125%; }
-td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
-span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
-td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
-span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
 .highlight .hll { background-color: #ffffcc }
-.highlight { background: #f8f8f8; }
+.highlight  { background: #f8f8f8; }
 .highlight .c { color: #408080; font-style: italic } /* Comment */
 .highlight .err { border: 1px solid #FF0000 } /* Error */
 .highlight .k { color: #008000; font-weight: bold } /* Keyword */

diff --git a/docs/about_reco.html b/docs/about_reco.html
@@ -184,10 +184,30 @@ <h2>Binary Recommender Metrics<a class="headerlink" href="#binary-recommender-me
 <p>Binary recommender metrics directly measure the click interaction.</p>
 <div class="section" id="ctr-click-through-rate">
 <h3>CTR: Click-through Rate<a class="headerlink" href="#ctr-click-through-rate" title="Permalink to this headline">¶</a></h3>
-<p>CTR measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.</p>
+<p>CTR offers three reward estimation methods.</p>
+<p>Direct estimation (“matching”) measures the accuracy of the recommendations over the subset of user-item pairs that appear in both actual ratings and recommendations.</p>
 <p>Let <span class="math notranslate nohighlight">\(M\)</span> denote the set of user-item pairs that appear in both actual ratings and recommendations, and <span class="math notranslate nohighlight">\(C(M_i)\)</span> be an indicator function that produces <span class="math notranslate nohighlight">\(1\)</span> if the user clicked on the item, and <span class="math notranslate nohighlight">\(0\)</span> if they didn’t.</p>
 <div class="math notranslate nohighlight">
 \[CTR = \frac{1}{\left | M \right |}\sum_{i=1}^{\left | M \right |} C(M_i)\]</div>
+<p>Inverse propensity scoring (IPS) weights the items by how likely they were to be recommended by the historic policy
+if the user saw the item in the historic data. Due to the probability inversion, less likely items are given more weight.</p>
+<div class="math notranslate nohighlight">
+\[IPS = \frac{1}{n} \sum r_a \times \frac{I(\hat{a} = a)}{P(a|x,h)}\]</div>
+<p>In this calculation: n is the total size of the test data; <span class="math notranslate nohighlight">\(r_a\)</span> is the observed reward;
+<span class="math notranslate nohighlight">\(\hat{a}\)</span> is the recommended item; <span class="math notranslate nohighlight">\(I(\hat{a} = a}\)</span> is a boolean of whether the user-item pair has
+historic data; and <span class="math notranslate nohighlight">\(P(a|x,h)\)</span> is the probability of the item being recommended for the test context given
+the historic data.</p>
+<p>Doubly robust estimation (DR) combines the directly predicted values with a correction based on how
+likely an item was to be recommended by the historic policy if the user saw the item in the historic data.</p>
+<div class="math notranslate nohighlight">
+\[DR = \frac{1}{n} \sum \hat{r}_a + \frac{(r_a -\hat{r}_a) I(\hat{a} = a}{p(a|x,h)}\]</div>
+<p>In this calculation, <span class="math notranslate nohighlight">\(\hat{r}_a\)</span> is the predicted reward.</p>
+<p>At a high level, doubly robust estimation combines a direct estimate with an IPS-like correction if historic data is
+available. If historic data is not available, the second term is 0 and only the predicted reward is used for the
+user-item pair.</p>
+<p>The IPS and DR implementations are based on: Dudík, Miroslav, John Langford, and Lihong Li.
+“Doubly robust policy evaluation and learning.” Proceedings of the 28th International Conference on International
+Conference on Machine Learning. 2011. Available as arXiv preprint arXiv:1103.4601</p>
 </div>
 <div class="section" id="auc-area-under-the-curve">
 <h3>AUC: Area Under the Curve<a class="headerlink" href="#auc-area-under-the-curve" title="Permalink to this headline">¶</a></h3>