Skip to content

Commit

Permalink
Merge pull request #443 from scikit-learn-contrib/249-clarification-o…
Browse files Browse the repository at this point in the history
…f-the-symmetry-argument-in-cqr-and-more-general-documentation-about-cqr

249 clarification of the symmetry argument in cqr and more general documentation about cqr
  • Loading branch information
LacombeLouis authored May 27, 2024
2 parents 44afae4 + a78b57e commit be78d7a
Show file tree
Hide file tree
Showing 3 changed files with 145 additions and 17 deletions.
1 change: 1 addition & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ History
* Reduce precision for test in `MapieCalibrator`.
* Fix invalid certificate when downloading data.
* Add citations utility to the documentation.
* Add explanation and example for symmetry argument in CQR.

0.8.3 (2024-03-01)
------------------
Expand Down
47 changes: 30 additions & 17 deletions doc/theoretical_description_regression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -245,30 +245,43 @@ uncertainty is higher than :math:`CV+`, because the models' prediction spread
is then higher.


9. The conformalized quantile regression (CQR) method
9. The Conformalized Quantile Regression (CQR) Method
=====================================================

The conformalized quantile method allows for better interval widths with
heteroscedastic data. It uses quantile regressors with different quantile
values to estimate the prediction bounds and the residuals of these methods are
used to create the guaranteed coverage value.
The conformalized quantile regression (CQR) method allows for better interval widths with
heteroscedastic data. It uses quantile regressors with different quantile values to estimate
the prediction bounds. The residuals of these methods are used to create the guaranteed
coverage value.

Notations and Definitions
-------------------------
- :math:`\mathcal{I}_1` is the set of indices of the data in the training set.
- :math:`\mathcal{I}_2` is the set of indices of the data in the calibration set.
- :math:`\hat{q}_{\alpha_{\text{low}}}`: Lower quantile model trained on :math:`{(X_i, Y_i) : i \in \mathcal{I}_1}`.
- :math:`\hat{q}_{\alpha_{\text{high}}}`: Upper quantile model trained on :math:`{(X_i, Y_i) : i \in \mathcal{I}_1}`.
- :math:`E_i`: Residuals for the i-th sample in the calibration set.
- :math:`E_{\text{low}}`: Residuals from the lower quantile model.
- :math:`E_{\text{high}}`: Residuals from the upper quantile model.
- :math:`Q_{1-\alpha}(E, \mathcal{I}_2)`: The :math:`(1-\alpha)(1+1/|\mathcal{I}_2|)`-th empirical quantile of the set :math:`{E_i : i \in \mathcal{I}_2}`.

Mathematical Formulation
------------------------
The prediction interval :math:`\hat{C}_{n, \alpha}^{\text{CQR}}(X_{n+1})` for a new sample :math:`X_{n+1}` is given by:

.. math::
.. math::
\hat{C}_{n, \alpha}^{\text{CQR}}(X_{n+1}) =
[\hat{q}_{\alpha_{\text{lo}}}(X_{n+1}) - Q_{1-\alpha}(E_{\text{low}}, \mathcal{I}_2),
\hat{q}_{\alpha_{\text{hi}}}(X_{n+1}) + Q_{1-\alpha}(E_{\text{high}}, \mathcal{I}_2)]
\hat{C}_{n, \alpha}^{\rm CQR}(X_{n+1}) =
[\hat{q}_{\alpha_{lo}}(X_{n+1}) - Q_{1-\alpha}(E_{low}, \mathcal{I}_2),
\hat{q}_{\alpha_{hi}}(X_{n+1}) + Q_{1-\alpha}(E_{high}, \mathcal{I}_2)]
Where:

Where :math:`Q_{1-\alpha}(E, \mathcal{I}_2) := (1-\alpha)(1+1/ |\mathcal{I}_2|)`-th
empirical quantile of :math:`{E_i : i \in \mathcal{I}_2}` and :math:`\mathcal{I}_2` is the
residuals of the estimator fitted on the calibration set. Note that in the symmetric method,
:math:`E_{low}` and :math:`E_{high}` are equal.
- :math:`\hat{q}_{\alpha_{\text{lo}}}(X_{n+1})` is the predicted lower quantile for the new sample.
- :math:`\hat{q}_{\alpha_{\text{hi}}}(X_{n+1})` is the predicted upper quantile for the new sample.

As justified by [3], this method offers a theoretical guarantee of the target coverage
level :math:`1-\alpha`.
Note: In the symmetric method, :math:`E_{\text{low}}` and :math:`E_{\text{high}}` sets are no longer distinct. We consider directly the union set :math:`E_{\text{all}} = E_{\text{low}} \cup E_{\text{high}}` and the empirical quantile is then calculated on all the absolute (positive) residuals.

Note that only the split method has been implemented and that it will run three separate
regressions when using :class:`mapie.quantile_regression.MapieQuantileRegressor`.
As justified by the literature, this method offers a theoretical guarantee of the target coverage level :math:`1-\alpha`.


10. The ensemble batch prediction intervals (EnbPI) method
Expand Down
114 changes: 114 additions & 0 deletions examples/regression/1-quickstart/plot_cqr_symmetry_difference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
"""
====================================
Plotting CQR with symmetric argument
====================================
An example plot of :class:`~mapie.quantile_regression.MapieQuantileRegressor`
illustrating the impact of the symmetry parameter.
"""
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor

from mapie.metrics import regression_coverage_score
from mapie.quantile_regression import MapieQuantileRegressor

random_state = 2

##############################################################################
# We generate a synthetic data.

X, y = make_regression(n_samples=500, n_features=1, noise=20, random_state=59)

# Define alpha level
alpha = 0.2

# Fit a Gradient Boosting Regressor for quantile regression
gb_reg = GradientBoostingRegressor(
loss="quantile", alpha=0.5, random_state=random_state
)

# MAPIE Quantile Regressor
mapie_qr = MapieQuantileRegressor(estimator=gb_reg, alpha=alpha)
mapie_qr.fit(X, y, random_state=random_state)
y_pred_sym, y_pis_sym = mapie_qr.predict(X, symmetry=True)
y_pred_asym, y_pis_asym = mapie_qr.predict(X, symmetry=False)
y_qlow = mapie_qr.estimators_[0].predict(X)
y_qup = mapie_qr.estimators_[1].predict(X)

# Calculate coverage scores
coverage_score_sym = regression_coverage_score(
y, y_pis_sym[:, 0], y_pis_sym[:, 1]
)
coverage_score_asym = regression_coverage_score(
y, y_pis_asym[:, 0], y_pis_asym[:, 1]
)

# Sort the values for plotting
order = np.argsort(X[:, 0])
X_sorted = X[order]
y_pred_sym_sorted = y_pred_sym[order]
y_pis_sym_sorted = y_pis_sym[order]
y_pred_asym_sorted = y_pred_asym[order]
y_pis_asym_sorted = y_pis_asym[order]
y_qlow = y_qlow[order]
y_qup = y_qup[order]

##############################################################################
# We will plot the predictions and prediction intervals for both symmetric
# and asymmetric intervals. The line represents the predicted values, the
# dashed lines represent the prediction intervals, and the shaded area
# represents the symmetric and asymmetric prediction intervals.

plt.figure(figsize=(14, 7))

plt.subplot(1, 2, 1)
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(X, y, alpha=0.3)
plt.plot(X_sorted, y_qlow, color="C1")
plt.plot(X_sorted, y_qup, color="C1")
plt.plot(X_sorted, y_pis_sym_sorted[:, 0], color="C1", ls="--")
plt.plot(X_sorted, y_pis_sym_sorted[:, 1], color="C1", ls="--")
plt.fill_between(
X_sorted.ravel(),
y_pis_sym_sorted[:, 0].ravel(),
y_pis_sym_sorted[:, 1].ravel(),
alpha=0.2,
)
plt.title(
f"Symmetric Intervals\n"
f"Target and effective coverages for "
f"alpha={alpha:.2f}: ({1-alpha:.3f}, {coverage_score_sym:.3f})"
)

# Plot asymmetric prediction intervals
plt.subplot(1, 2, 2)
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(X, y, alpha=0.3)
plt.plot(X_sorted, y_qlow, color="C2")
plt.plot(X_sorted, y_qup, color="C2")
plt.plot(X_sorted, y_pis_asym_sorted[:, 0], color="C2", ls="--")
plt.plot(X_sorted, y_pis_asym_sorted[:, 1], color="C2", ls="--")
plt.fill_between(
X_sorted.ravel(),
y_pis_asym_sorted[:, 0].ravel(),
y_pis_asym_sorted[:, 1].ravel(),
alpha=0.2,
)
plt.title(
f"Asymmetric Intervals\n"
f"Target and effective coverages for "
f"alpha={alpha:.2f}: ({1-alpha:.3f}, {coverage_score_asym:.3f})"
)
plt.tight_layout()
plt.show()

##############################################################################
# The symmetric intervals (`symmetry=True`) use a combined set of residuals
# for both bounds, while the asymmetric intervals use distinct residuals for
# each bound, allowing for more flexible and accurate intervals that reflect
# the heteroscedastic nature of the data. The resulting effective coverages
# demonstrate the theoretical guarantee of the target coverage level
# :math:`1 - \alpha`.

0 comments on commit be78d7a

Please sign in to comment.