Feat/kpcovr fitted regressor #113

bhelfrecht · 2021-05-04T16:46:07Z

Addresses #112. It's a little messier to use pre-fitted regressors with KPCovR because of all the extra kernel arguments, but I think it's worth it to ensure that X, K, Y, and Yhat are always consistent.

The wrinkle is how to pass the kernel arguments cleanly. The way I have it set up right now is that all of the kernel arguments come in through the regressor, pre-fitted or not. This means when instantiating the class you don't have to specify kernel parameters for both the regressor and for KPCovR. However, it also means that if you want to use non-default kernel parameters for mixing = 1, you still have to import and pass a KernelRidge instance, which seems a bit silly.

On the other hand, we could have both KPCovR and the regressor accept kernel arguments, and have one set override the other depending on the particular situation. The kernel parameters associated with a pre-fitted regressor would override the ones passed directly to KPCovR (since they need to be consistent), and maybe in the case of a non-fitted regressor the opposite would be true. The downside is that this clutters up the __init__ call. Thoughts?

…egressors

…essors as in PCovR

agoscinski · 2021-05-04T17:31:39Z

However, it also means that if you want to use non-default kernel parameters for mixing = 1, you still have to import and pass a KernelRidge instance, which seems a bit silly.

I see that its conceptually more nice to separate the regressor and the kernel parameters, but it seems to be more complicated in terms of the amount of code to write to cover the different cases, when one wants to also allow prefitted regressors, and also more complicated for the user to differ the cases for the different input parameters. So it is fine for me. If it is written in the doc that the kernel parameters from the regressor are used, I am okay as a user.

I really think you should change the default arguments to None and do a checker in sklearn style. Something could be happen, when two instances of PCovR are used, since you consider prefitted regressors. I know it is a bit constructed, but still something which could happen

class Regressor: 
    def __init__(self): 
        self.fitted = False

    def fit(self, X): 
        self.fitted=True 

class PCovR: 
    def __init__(self, regressor=Regressor()): 
        self.regressor = regressor 

    def fit(self, X): 
        self.regressor.fit(X)

a = PCovR()
print("a.regressor.fitted", a.regressor.fitted)
a.fit(None)
print("a.regressor.fitted", a.regressor.fitted)

b = PCovR()
print("b.regressor.fitted", b.regressor.fitted)

Output:

a.regressor.fitted False
a.regressor.fitted True
b.regressor.fitted True

regressor in b was not intentionally prefitted

bhelfrecht · 2021-05-05T05:41:04Z

That is some interesting behavior. My intuition would be that the instantiation of b would use a fresh instance of Regressor given the __init__ call of PCovR, but apparently not!

Either way, would it not be enough to just deepcopy or clone self.regressor during the fit? For instance,

class PCovR: 
    def __init__(self, regressor=Regressor()): 
        self.regressor = regressor 

    def fit(self, X):
        self.regressor_ = deepcopy(self.regressor)
        self.regressor_.fit(X)

This is what I have currently done through check_krr_fit (and check_lr_fit in PCovR) -- if the regressor is unfitted, it is cloned (https://scikit-learn.org/stable/modules/generated/sklearn.base.clone.html), and if it is fitted, a full deepcopy is made.

If I understand correctly, clone-ing estimators during the fit is how sklearn tends to handle estimators-containing-estimators like TransformedTargetRegressor, OneVsRestClassifier, ColumnTransformer, Pipeline with caching, etc.

agoscinski · 2021-05-05T07:52:31Z

Yes deepcopy should prevent this to happen when the user uses only the public function interface of the PCovR class, but the user could just access member variables. They are not even marked as private.

a = PCovR()
print("a.regressor.alpha", a.regressor.alpha)
# want to update only one parameter instead of defining the whole KernelRidge
a.regressor.alpha = 0.5
print("a.regressor.alpha", a.regressor.alpha)

b = PCovR()
print("b.regressor.alpha", b.regressor.alpha)

a.regressor.alpha 0.8
a.regressor.alpha 0.5
b.regressor.alpha 0.5

Luthaf · 2021-05-05T08:37:48Z

That is some interesting behavior. My intuition would be that the instantiation of b would use a fresh instance of Regressor given the init call of PCovR, but apparently not!

That's an issue with using mutable instances as default values. The regressor is created when parsing code, and stored within the class; and then the same instance is used for all new instances of PCovR. Usually you want to avoid mutable values as default parameters. Something like this should work fine:

class PCovR: 
    def __init__(self, regressor=None): 
        if regressor is None:
            regressor = Regressor()

        self.regressor = regressor 

    def fit(self, X):
        self.regressor_.fit(X)

Using a deep copy could also work, although I think it makes the code more complex.

agoscinski · 2021-05-05T08:51:49Z

I think we should use None as default argument and deepcopy in the fit function, it is very close what sklearn does
https://github.com/scikit-learn/scikit-learn/blob/15a949460/sklearn/compose/_target.py#L128-L140
We don't use clone because we allow also prefitted regressor

EDIT: and keep the default KernelRidge as global variable in the module, something like DEFAULT_KPCOVR_KERNEL_RIDGE. I did not do it for reconstruction measures, and it is a bit annoying now when I only want to adapt one parameter of the default parameter

…e regressor; use None as the default argument for the regressor

…ed on tol instead of the regularization

…mentation

rosecers

I have some thoughts on this whole KRR params vs KPCovR params -- may take the conversation offline to hash it out

skcosmo/decomposition/_kernel_pcovr.py

skcosmo/utils/_pcovr_utils.py

bhelfrecht · 2021-05-10T10:47:22Z

I've reorganized the regressor handling a bit based on @rosecers feedback. all of the parameter checking is now done in fit, and KernelPCovR now takes kernel parameters, as does the regressor. If regressor is None, the KernelPCovR kernel parameters are used to initialize the regressor. If regressor is not None and the kernel parameters are inconsistent, an error is raised.

Additionally, KernelPCovR and PCovR now also accept regressor_params, additional keyword arguments that are passed to the regressor if regressor is set to None.

rosecers

Hey! Logic looks good, although I'd drastically streamline the regressor checks, see the comments.

rosecers · 2021-05-12T07:07:19Z

skcosmo/decomposition/_kernel_pcovr.py

-        Kernel. Default="linear".
+                Kernel. Default="linear".


extraneous change -- please remove

rosecers · 2021-05-12T07:08:24Z

skcosmo/decomposition/_kernel_pcovr.py

+        [[-0.55119827, -0.21793572],
+         [ 0.3768726 ,  0.31208068],
+         [-0.76898956,  0.08511876],
+         [ 0.92488574, -0.18627707]]


Why do the result change so drastically if the data stays the same?

The example was broken as written. You'll notice that the (old) example input initializes with gamma=2, but the corresponding example output shows gamma=0.01. So I ran the example myself with gamma=2, and this is what came out. I can try with gamma=0.01 to see if the results are the same as the old.

rosecers · 2021-05-12T07:08:59Z

skcosmo/decomposition/_kernel_pcovr.py

    >>> kpcovr.score(X, Y)
-        (0.5312320029915978, 0.06254540655698511)
+        1.0000774522028972


This seems bad -- the example is designed to be better than this

The score output was returning a tuple instead of the single value that is the current output and predates even this PR. Part of the issue might be the broken example (see comment above). I'll re-run with a different gamma and see what happens.

Oh I don't have an issue with the single value -- it's more that a loss of 1.0 seems awful.

I agree. I'll try re-doing this example with a different gamma

rosecers · 2021-05-12T07:10:48Z

skcosmo/decomposition/_kernel_pcovr.py

        Returns
        -------
        self: object
            Returns the instance itself.

        """

+        if not any([self.regressor is None, isinstance(self.regressor, KernelRidge)]):


Why this syntax? For simple two-component logic, it would be more typical to write

Suggested change

if not any([self.regressor is None, isinstance(self.regressor, KernelRidge)]):

if self.regressor is not None and not isinstance(self.regressor, KernelRidge):

I find it much easier to read and understand

I just used this syntax because it's consistent with what I had in PCovR, where there are more valid regressor possibilities. I have no problem changing it.

rosecers · 2021-05-12T07:19:24Z

skcosmo/utils/_pcovr_utils.py

+    Checks that the coefficients of a fitted
+    regression model is compatible with the shapes
+    of X and y
+
+    :param regressor: sklearn-style regressor
+    :type regressor: object
+    :param X: feature matrix with which to compare the
+        regression coefficients
+    :type X: array
+    :param y: target values with which to compare the
+        regression coefficients
+    :type y: array
+
+    """
+    if fitted_regressor.coef_.ndim != y.ndim:
+        raise ValueError(
+            "The target regressor has a shape incompatible "
+            "with the supplied target space"
+        )
+    elif fitted_regressor.coef_.ndim == 1:
+        if fitted_regressor.coef_.shape[0] != X.shape[1]:
+            raise ValueError(
+                "The target regressor has a shape incompatible "
+                "with the supplied feature space"
+            )
+    else:
+        if fitted_regressor.coef_.shape[0] != y.shape[1]:
+            raise ValueError(
+                "The target regressor has a shape incompatible "
+                "with the supplied target space"
+            )
+        elif fitted_regressor.coef_.shape[1] != X.shape[1]:
+            raise ValueError(
+                "The target regressor has a shape incompatible "
+                "with the supplied feature space"
+            )
+
+
+def _check_dual_coefs(fitted_regressor, K, y):
+    r"""
+    Checks that the dual coefficients of a fitted
+    regression model is compatible with the shapes
+    of K and y
+
+    :param regressor: sklearn-style regressor
+    :type regressor: object
+    :param K: kernel matrix with which to compare the
+        regression coefficients
+    :type K: array
+    :param y: target values with which to compare the
+        regression coefficients
+    :type y: array
+
+    """
+    if fitted_regressor.dual_coef_.ndim != y.ndim:
+        raise ValueError(
+            "The target regressor has a shape incompatible "
+            "with the supplied target space"
+        )
+    elif fitted_regressor.dual_coef_.ndim == 1:
+        if fitted_regressor.dual_coef_.shape[0] != K.shape[0]:
+            raise ValueError(
+                "The target regressor has a shape incompatible "
+                "with the supplied sample space"
+            )
+    else:
+        if fitted_regressor.dual_coef_.shape[0] != K.shape[0]:
+            raise ValueError(
+                "The target regressor has a shape incompatible "
+                "with the supplied sample space"
+            )
+        elif fitted_regressor.dual_coef_.shape[1] != y.shape[1]:
+            raise ValueError(
+                "The target regressor has a shape incompatible "
+                "with the supplied target space"
+            )


Shouldn't this all be caught if you try and pass X to regressor.predict? I feel like you could avoid all of these raised errors by relying on the errors already thrown by sklearn.

Suggested change

Checks that the coefficients of a fitted

regression model is compatible with the shapes

of X and y

:param regressor: sklearn-style regressor

:type regressor: object

:param X: feature matrix with which to compare the

regression coefficients

:type X: array

:param y: target values with which to compare the

regression coefficients

:type y: array

"""

if fitted_regressor.coef_.ndim != y.ndim:

raise ValueError(

"The target regressor has a shape incompatible "

"with the supplied target space"

)

elif fitted_regressor.coef_.ndim == 1:

if fitted_regressor.coef_.shape[0] != X.shape[1]:

raise ValueError(

"The target regressor has a shape incompatible "

"with the supplied feature space"

)

else:

if fitted_regressor.coef_.shape[0] != y.shape[1]:

raise ValueError(

"The target regressor has a shape incompatible "

"with the supplied target space"

)

elif fitted_regressor.coef_.shape[1] != X.shape[1]:

raise ValueError(

"The target regressor has a shape incompatible "

"with the supplied feature space"

)

def _check_dual_coefs(fitted_regressor, K, y):

r"""

Checks that the dual coefficients of a fitted

regression model is compatible with the shapes

of K and y

:param regressor: sklearn-style regressor

:type regressor: object

:param K: kernel matrix with which to compare the

regression coefficients

:type K: array

:param y: target values with which to compare the

regression coefficients

:type y: array

"""

if fitted_regressor.dual_coef_.ndim != y.ndim:

raise ValueError(

"The target regressor has a shape incompatible "

"with the supplied target space"

)

elif fitted_regressor.dual_coef_.ndim == 1:

if fitted_regressor.dual_coef_.shape[0] != K.shape[0]:

raise ValueError(

"The target regressor has a shape incompatible "

"with the supplied sample space"

)

else:

if fitted_regressor.dual_coef_.shape[0] != K.shape[0]:

raise ValueError(

"The target regressor has a shape incompatible "

"with the supplied sample space"

)

elif fitted_regressor.dual_coef_.shape[1] != y.shape[1]:

raise ValueError(

"The target regressor has a shape incompatible "

"with the supplied target space"

)

You can use self._validate_data(self, X, y) from anything inheriting from BaseEstimator to check the dimensionality of X with respect to the coefficients. Checking y now.

It doesn't look like there is a way to validate that the regressor outputs a y consistent with the supplied y, but the self._validate_data(self, X, y) will flag an error if dim(y) is unexpected.

btw all of this applies to kernel regressors as well, no need for separate functions. Btw, I would be happy if you just called fitted_regressor._validate_data(X, y) wherever you intended to call this function. That seems to be the implementation in sklearn.

rosecers · 2021-05-12T07:21:42Z

tests/test_kernel_pcovr.py

+    def test_incompatible_coef_shape(self):
+
+        # 1D properties (self.Y is 2D with two targets)
+        # X shape doesn't match
+        regressor = KernelRidge(alpha=1e-8, kernel="linear")
+        regressor.fit(self.X, self.Y[:, 0])
+        kpcovr = self.model(mixing=0.5, regressor=regressor)
+
+        with self.assertRaises(ValueError) as cm:
+            kpcovr.fit(self.X[0:-1], self.Y[0:-1, 0])
+            self.assertTrue(
+                str(cm.message),
+                "The target regressor has a shape incompatible "
+                "with the supplied sample space",
+            )
+
+        # >= 2D properties
+        # Y shape doesn't match
+        regressor = KernelRidge(alpha=1e-8, kernel="linear")
+        regressor.fit(self.X, self.Y[:, 0][:, np.newaxis])
+        kpcovr = self.model(mixing=0.5, regressor=regressor)
+
+        with self.assertRaises(ValueError) as cm:
+            kpcovr.fit(self.X, self.Y[:, 0])
+            self.assertTrue(
+                str(cm.message),
+                "The target regressor has a shape incompatible "
+                "with the supplied target space",
+            )
+
+        with self.assertRaises(ValueError) as cm:
+            kpcovr.fit(self.X, self.Y)
+            self.assertTrue(
+                str(cm.message),
+                "The target regressor has a shape incompatible "
+                "with the supplied target space",
+            )
+
+        # X shape doesn't match
+        regressor = KernelRidge(alpha=1e-8, kernel="linear")
+        regressor.fit(self.X, self.Y)
+        kpcovr = self.model(mixing=0.5, regressor=regressor)
+
+        with self.assertRaises(ValueError) as cm:
+            kpcovr.fit(self.X[0:-1], self.Y[0:-1])
+            self.assertTrue(
+                str(cm.message),
+                "The target regressor has a shape incompatible "
+                "with the supplied sample space",
+            )
+


So much of this would be avoided by the suggestion above. All you'd have to do is check the regressor output shape

rosecers · 2021-05-12T07:22:09Z

tests/test_pcovr.py

+    def test_regressor_modifications(self):
+        regressor = Ridge(alpha=1e-8)
+        pcovr = self.model(mixing=0.5, regressor=regressor)
+
+        # PCovR regressor matches the original
+        self.assertTrue(regressor.get_params() == pcovr.regressor.get_params())
+
+        # PCovR regressor updates its parameters
+        # to match the original regressor
+        regressor.set_params(alpha=1e-6)
+        self.assertTrue(regressor.get_params() == pcovr.regressor.get_params())
+
+        # Fitting regressor outside PCovR fits the PCovR regressor
+        regressor.fit(self.X, self.Y)
+        self.assertTrue(hasattr(pcovr.regressor, "coef_"))
+
+        # PCovR regressor doesn't change after fitting
+        pcovr.fit(self.X, self.Y)
+        regressor.set_params(alpha=1e-4)
+        self.assertTrue(hasattr(pcovr.regressor_, "coef_"))
+        self.assertTrue(regressor.get_params() != pcovr.regressor_.get_params())
+


This I think should stay since it's different than checking the regressor shape. These tests are to ensure consistency between the regressor and KPCovR kernel parameters.

Oops! I read this wrong -- my comment is more for lines 475-524

Yes, I agree. If there already exists sklearn utilities that we could adopt to do these checks, then we could dramatically simplify these tests. We still might need something to make the codecov bot happy, but the tests could be simplified.

rosecers · 2021-05-12T08:33:03Z

skcosmo/utils/_pcovr_utils.py

-                    "The target regressor has a shape incompatible "
-                    "with the supplied feature space"
-                )
+        _check_coefs(regressor, X, y)


Suggested change

_check_coefs(regressor, X, y)

regressor._validate_data(X, y)

rosecers · 2021-05-12T08:33:24Z

skcosmo/utils/_pcovr_utils.py

+    try:
+        check_is_fitted(regressor)
+        fitted_regressor = deepcopy(regressor)
+        _check_dual_coefs(regressor, K, y)


Suggested change

_check_dual_coefs(regressor, K, y)

fitted_regressor._validate_data(K, y)

bhelfrecht added 5 commits May 4, 2021 18:31

Add shape checking utilities for coefficients of precomputed kernel r…

e79de49

…egressors

Modify instantiation and fit call of KPCovR to accept pre-fitted regr…

2c9f4dd

…essors as in PCovR

Update KPCovR tests to be compatible with new regressor usage

456166b

Update PCovR example notebook to be compatible with new regressor usage

a458d37

Remove alpha from PCovR docstring

70188b6

bhelfrecht added 13 commits May 5, 2021 19:07

Reorganize regressor usage to pull kernel parameters directly from th…

65b9296

…e regressor; use None as the default argument for the regressor

Pull alpha from the KPCovR regressor

41aaff6

Update KPCovR documentation

bfc04fa

Update utilities documentation

a57bf7d

Make regressor default argument None, assign default within __init__

672fcf6

Change inversions to use least squares with singular value cutoff bas…

3536cf5

…ed on tol instead of the regularization

Compute Yhat directly from the dual coefficients

e96ccfc

Move regressor checking to occur immediately

46a56e7

Add more details about pre-fitted regressors to PCovR and KPCovR docu…

36d2794

…mentation

Fix docstring of _check_dual_coefs

e8b13c5

Use KPCovR tolerance in matrix inversion instead of regularization

9fa4988

Add tests for KPCovR to cover the pre-fitted regressors

bc69f4f

Add PCovR test to check for regressor modifications

585e9e5

bhelfrecht marked this pull request as ready for review May 6, 2021 11:24

bhelfrecht requested review from Luthaf and rosecers May 7, 2021 14:33

bhelfrecht linked an issue May 7, 2021 that may be closed by this pull request

Pre-fitted regressors with KPCovR #112

Closed

rosecers reviewed May 7, 2021

View reviewed changes

skcosmo/decomposition/_kernel_pcovr.py Outdated Show resolved Hide resolved

skcosmo/decomposition/_kernel_pcovr.py Outdated Show resolved Hide resolved

skcosmo/decomposition/_kernel_pcovr.py Outdated Show resolved Hide resolved

skcosmo/utils/_pcovr_utils.py Outdated Show resolved Hide resolved

bhelfrecht added 2 commits May 10, 2021 12:28

Move default regressor assignment to fit and accept regressor params

a368bf5

Reorganize KPCovR regressor infrastructure

1308835

bhelfrecht added 4 commits May 10, 2021 12:31

Make PCovR example compatible with new KPCovR regressor infrastructure

c53fee2

Add PCovR test for None regressor

8e3c12c

Modify KPCovR tests for compatibility with new regressor infrastructure

5cd66bb

Add KPCovR test for None regressor

a27faa1

bhelfrecht requested a review from rosecers May 10, 2021 10:47

rosecers reviewed May 12, 2021

View reviewed changes

bhelfrecht added 4 commits May 12, 2021 17:27

Fix KPCovR docstring example

95e0bab

Consolidate regressor checking

8da0385

Simplify tests for pre-fitted regressors

5aac7d1

Negate KPCovR score according to sklearn guidelines

350c15b

rosecers approved these changes May 12, 2021

View reviewed changes

bhelfrecht merged commit b4e06ae into scikit-learn-contrib:main May 12, 2021

bhelfrecht deleted the feat/kpcovr_fitted_regressor branch May 12, 2021 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/kpcovr fitted regressor #113

Feat/kpcovr fitted regressor #113

bhelfrecht commented May 4, 2021

agoscinski commented May 4, 2021 •

edited

Loading

bhelfrecht commented May 5, 2021

agoscinski commented May 5, 2021 •

edited

Loading

Luthaf commented May 5, 2021

agoscinski commented May 5, 2021 •

edited

Loading

rosecers left a comment

bhelfrecht commented May 10, 2021

rosecers left a comment

rosecers May 12, 2021

rosecers May 12, 2021

bhelfrecht May 12, 2021

rosecers May 12, 2021

bhelfrecht May 12, 2021

rosecers May 12, 2021

bhelfrecht May 12, 2021

rosecers May 12, 2021

bhelfrecht May 12, 2021

rosecers May 12, 2021 •

edited

Loading

rosecers May 12, 2021

rosecers May 12, 2021

rosecers May 12, 2021

rosecers May 12, 2021

bhelfrecht May 12, 2021

rosecers May 12, 2021

bhelfrecht May 12, 2021

rosecers May 12, 2021

rosecers May 12, 2021

	if not any([self.regressor is None, isinstance(self.regressor, KernelRidge)]):
	if self.regressor is not None and not isinstance(self.regressor, KernelRidge):

	_check_dual_coefs(regressor, K, y)
	fitted_regressor._validate_data(K, y)

Feat/kpcovr fitted regressor #113

Feat/kpcovr fitted regressor #113

Conversation

bhelfrecht commented May 4, 2021

agoscinski commented May 4, 2021 • edited Loading

bhelfrecht commented May 5, 2021

agoscinski commented May 5, 2021 • edited Loading

Luthaf commented May 5, 2021

agoscinski commented May 5, 2021 • edited Loading

rosecers left a comment

Choose a reason for hiding this comment

bhelfrecht commented May 10, 2021

rosecers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rosecers May 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agoscinski commented May 4, 2021 •

edited

Loading

agoscinski commented May 5, 2021 •

edited

Loading

agoscinski commented May 5, 2021 •

edited

Loading

rosecers May 12, 2021 •

edited

Loading