Merge pull request #443 from scikit-learn-contrib/249-clarification-o…

…f-the-symmetry-argument-in-cqr-and-more-general-documentation-about-cqr 249 clarification of the symmetry argument in cqr and more general documentation about cqr
scikit-learn-contrib · May 27, 2024 · be78d7a · be78d7a
2 parents 44afae4 + a78b57e
commit be78d7a
Show file tree

Hide file tree

Showing 3 changed files with 145 additions and 17 deletions.
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -9,6 +9,7 @@ History
 * Reduce precision for test in `MapieCalibrator`.
 * Fix invalid certificate when downloading data.
 * Add citations utility to the documentation.
+* Add explanation and example for symmetry argument in CQR.
 
 0.8.3 (2024-03-01)
 ------------------

diff --git a/doc/theoretical_description_regression.rst b/doc/theoretical_description_regression.rst
@@ -245,30 +245,43 @@ uncertainty is higher than :math:`CV+`, because the models' prediction spread
 is then higher.
 
 
-9. The conformalized quantile regression (CQR) method
+9. The Conformalized Quantile Regression (CQR) Method
 =====================================================
 
-The conformalized quantile method allows for better interval widths with
-heteroscedastic data. It uses quantile regressors with different quantile
-values to estimate the prediction bounds and the residuals of these methods are
-used to create the guaranteed coverage value.
+The conformalized quantile regression (CQR) method allows for better interval widths with
+heteroscedastic data. It uses quantile regressors with different quantile values to estimate
+the prediction bounds. The residuals of these methods are used to create the guaranteed
+coverage value.
+
+Notations and Definitions
+-------------------------
+- :math:`\mathcal{I}_1` is the set of indices of the data in the training set.
+- :math:`\mathcal{I}_2` is the set of indices of the data in the calibration set.
+- :math:`\hat{q}_{\alpha_{\text{low}}}`: Lower quantile model trained on :math:`{(X_i, Y_i) : i \in \mathcal{I}_1}`.
+- :math:`\hat{q}_{\alpha_{\text{high}}}`: Upper quantile model trained on :math:`{(X_i, Y_i) : i \in \mathcal{I}_1}`.
+- :math:`E_i`: Residuals for the i-th sample in the calibration set.
+- :math:`E_{\text{low}}`: Residuals from the lower quantile model.
+- :math:`E_{\text{high}}`: Residuals from the upper quantile model.
+- :math:`Q_{1-\alpha}(E, \mathcal{I}_2)`: The :math:`(1-\alpha)(1+1/|\mathcal{I}_2|)`-th empirical quantile of the set :math:`{E_i : i \in \mathcal{I}_2}`.
+
+Mathematical Formulation
+------------------------
+The prediction interval :math:`\hat{C}_{n, \alpha}^{\text{CQR}}(X_{n+1})` for a new sample :math:`X_{n+1}` is given by:
 
-.. math:: 
+.. math::
+
+    \hat{C}_{n, \alpha}^{\text{CQR}}(X_{n+1}) = 
+    [\hat{q}_{\alpha_{\text{lo}}}(X_{n+1}) - Q_{1-\alpha}(E_{\text{low}}, \mathcal{I}_2),
+    \hat{q}_{\alpha_{\text{hi}}}(X_{n+1}) + Q_{1-\alpha}(E_{\text{high}}, \mathcal{I}_2)]
 
-    \hat{C}_{n, \alpha}^{\rm CQR}(X_{n+1}) = 
-    [\hat{q}_{\alpha_{lo}}(X_{n+1}) - Q_{1-\alpha}(E_{low}, \mathcal{I}_2),
-    \hat{q}_{\alpha_{hi}}(X_{n+1}) + Q_{1-\alpha}(E_{high}, \mathcal{I}_2)]
+Where:
 
-Where :math:`Q_{1-\alpha}(E, \mathcal{I}_2) := (1-\alpha)(1+1/ |\mathcal{I}_2|)`-th
-empirical quantile of :math:`{E_i : i \in \mathcal{I}_2}` and :math:`\mathcal{I}_2` is the
-residuals of the estimator fitted on the calibration set. Note that in the symmetric method, 
-:math:`E_{low}` and :math:`E_{high}` are equal.
+- :math:`\hat{q}_{\alpha_{\text{lo}}}(X_{n+1})` is the predicted lower quantile for the new sample.
+- :math:`\hat{q}_{\alpha_{\text{hi}}}(X_{n+1})` is the predicted upper quantile for the new sample.
 
-As justified by [3], this method offers a theoretical guarantee of the target coverage 
-level :math:`1-\alpha`.
+Note: In the symmetric method, :math:`E_{\text{low}}` and :math:`E_{\text{high}}` sets are no longer distinct. We consider directly the union set :math:`E_{\text{all}} = E_{\text{low}} \cup E_{\text{high}}` and the empirical quantile is then calculated on all the absolute (positive) residuals.
 
-Note that only the split method has been implemented and that it will run three separate 
-regressions when using :class:`mapie.quantile_regression.MapieQuantileRegressor`.
+As justified by the literature, this method offers a theoretical guarantee of the target coverage level :math:`1-\alpha`.
 
 
 10. The ensemble batch prediction intervals (EnbPI) method

diff --git a/examples/regression/1-quickstart/plot_cqr_symmetry_difference.py b/examples/regression/1-quickstart/plot_cqr_symmetry_difference.py
@@ -0,0 +1,114 @@
+"""
+====================================
+Plotting CQR with symmetric argument
+====================================
+An example plot of :class:`~mapie.quantile_regression.MapieQuantileRegressor`
+illustrating the impact of the symmetry parameter.
+"""
+import numpy as np
+from matplotlib import pyplot as plt
+from sklearn.datasets import make_regression
+from sklearn.ensemble import GradientBoostingRegressor
+
+from mapie.metrics import regression_coverage_score
+from mapie.quantile_regression import MapieQuantileRegressor
+
+random_state = 2
+
+##############################################################################
+# We generate a synthetic data.
+
+X, y = make_regression(n_samples=500, n_features=1, noise=20, random_state=59)
+
+# Define alpha level
+alpha = 0.2
+
+# Fit a Gradient Boosting Regressor for quantile regression
+gb_reg = GradientBoostingRegressor(
+    loss="quantile", alpha=0.5, random_state=random_state
+)
+
+# MAPIE Quantile Regressor
+mapie_qr = MapieQuantileRegressor(estimator=gb_reg, alpha=alpha)
+mapie_qr.fit(X, y, random_state=random_state)
+y_pred_sym, y_pis_sym = mapie_qr.predict(X, symmetry=True)
+y_pred_asym, y_pis_asym = mapie_qr.predict(X, symmetry=False)
+y_qlow = mapie_qr.estimators_[0].predict(X)
+y_qup = mapie_qr.estimators_[1].predict(X)
+
+# Calculate coverage scores
+coverage_score_sym = regression_coverage_score(
+    y, y_pis_sym[:, 0], y_pis_sym[:, 1]
+)
+coverage_score_asym = regression_coverage_score(
+    y, y_pis_asym[:, 0], y_pis_asym[:, 1]
+)
+
+# Sort the values for plotting
+order = np.argsort(X[:, 0])
+X_sorted = X[order]
+y_pred_sym_sorted = y_pred_sym[order]
+y_pis_sym_sorted = y_pis_sym[order]
+y_pred_asym_sorted = y_pred_asym[order]
+y_pis_asym_sorted = y_pis_asym[order]
+y_qlow = y_qlow[order]
+y_qup = y_qup[order]
+
+##############################################################################
+# We will plot the predictions and prediction intervals for both symmetric
+# and asymmetric intervals. The line represents the predicted values, the
+# dashed lines represent the prediction intervals, and the shaded area
+# represents the symmetric and asymmetric prediction intervals.
+
+plt.figure(figsize=(14, 7))
+
+plt.subplot(1, 2, 1)
+plt.xlabel("x")
+plt.ylabel("y")
+plt.scatter(X, y, alpha=0.3)
+plt.plot(X_sorted, y_qlow, color="C1")
+plt.plot(X_sorted, y_qup, color="C1")
+plt.plot(X_sorted, y_pis_sym_sorted[:, 0], color="C1", ls="--")
+plt.plot(X_sorted, y_pis_sym_sorted[:, 1], color="C1", ls="--")
+plt.fill_between(
+    X_sorted.ravel(),
+    y_pis_sym_sorted[:, 0].ravel(),
+    y_pis_sym_sorted[:, 1].ravel(),
+    alpha=0.2,
+)
+plt.title(
+    f"Symmetric Intervals\n"
+    f"Target and effective coverages for "
+    f"alpha={alpha:.2f}: ({1-alpha:.3f}, {coverage_score_sym:.3f})"
+)
+
+# Plot asymmetric prediction intervals
+plt.subplot(1, 2, 2)
+plt.xlabel("x")
+plt.ylabel("y")
+plt.scatter(X, y, alpha=0.3)
+plt.plot(X_sorted, y_qlow, color="C2")
+plt.plot(X_sorted, y_qup, color="C2")
+plt.plot(X_sorted, y_pis_asym_sorted[:, 0], color="C2", ls="--")
+plt.plot(X_sorted, y_pis_asym_sorted[:, 1], color="C2", ls="--")
+plt.fill_between(
+    X_sorted.ravel(),
+    y_pis_asym_sorted[:, 0].ravel(),
+    y_pis_asym_sorted[:, 1].ravel(),
+    alpha=0.2,
+)
+plt.title(
+    f"Asymmetric Intervals\n"
+    f"Target and effective coverages for "
+    f"alpha={alpha:.2f}: ({1-alpha:.3f}, {coverage_score_asym:.3f})"
+)
+plt.tight_layout()
+plt.show()
+
+##############################################################################
+# The symmetric intervals (`symmetry=True`) use a combined set of residuals
+# for both bounds, while the asymmetric intervals use distinct residuals for
+# each bound, allowing for more flexible and accurate intervals that reflect
+# the heteroscedastic nature of the data. The resulting effective coverages
+# demonstrate the theoretical guarantee of the target coverage level
+# :math:`1 - \alpha`.