Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/kpcovr fitted regressor #113

Merged

Conversation

bhelfrecht
Copy link
Contributor

Addresses #112. It's a little messier to use pre-fitted regressors with KPCovR because of all the extra kernel arguments, but I think it's worth it to ensure that X, K, Y, and Yhat are always consistent.

The wrinkle is how to pass the kernel arguments cleanly. The way I have it set up right now is that all of the kernel arguments come in through the regressor, pre-fitted or not. This means when instantiating the class you don't have to specify kernel parameters for both the regressor and for KPCovR. However, it also means that if you want to use non-default kernel parameters for mixing = 1, you still have to import and pass a KernelRidge instance, which seems a bit silly.

On the other hand, we could have both KPCovR and the regressor accept kernel arguments, and have one set override the other depending on the particular situation. The kernel parameters associated with a pre-fitted regressor would override the ones passed directly to KPCovR (since they need to be consistent), and maybe in the case of a non-fitted regressor the opposite would be true. The downside is that this clutters up the __init__ call. Thoughts?

@agoscinski
Copy link
Collaborator

agoscinski commented May 4, 2021

However, it also means that if you want to use non-default kernel parameters for mixing = 1, you still have to import and pass a KernelRidge instance, which seems a bit silly.

I see that its conceptually more nice to separate the regressor and the kernel parameters, but it seems to be more complicated in terms of the amount of code to write to cover the different cases, when one wants to also allow prefitted regressors, and also more complicated for the user to differ the cases for the different input parameters. So it is fine for me. If it is written in the doc that the kernel parameters from the regressor are used, I am okay as a user.

I really think you should change the default arguments to None and do a checker in sklearn style. Something could be happen, when two instances of PCovR are used, since you consider prefitted regressors. I know it is a bit constructed, but still something which could happen

class Regressor: 
    def __init__(self): 
        self.fitted = False

    def fit(self, X): 
        self.fitted=True 

class PCovR: 
    def __init__(self, regressor=Regressor()): 
        self.regressor = regressor 

    def fit(self, X): 
        self.regressor.fit(X)

a = PCovR()
print("a.regressor.fitted", a.regressor.fitted)
a.fit(None)
print("a.regressor.fitted", a.regressor.fitted)

b = PCovR()
print("b.regressor.fitted", b.regressor.fitted)

Output:

a.regressor.fitted False
a.regressor.fitted True
b.regressor.fitted True

regressor in b was not intentionally prefitted

@bhelfrecht
Copy link
Contributor Author

That is some interesting behavior. My intuition would be that the instantiation of b would use a fresh instance of Regressor given the __init__ call of PCovR, but apparently not!

Either way, would it not be enough to just deepcopy or clone self.regressor during the fit? For instance,

class PCovR: 
    def __init__(self, regressor=Regressor()): 
        self.regressor = regressor 

    def fit(self, X):
        self.regressor_ = deepcopy(self.regressor)
        self.regressor_.fit(X)

This is what I have currently done through check_krr_fit (and check_lr_fit in PCovR) -- if the regressor is unfitted, it is cloned (https://scikit-learn.org/stable/modules/generated/sklearn.base.clone.html), and if it is fitted, a full deepcopy is made.

If I understand correctly, clone-ing estimators during the fit is how sklearn tends to handle estimators-containing-estimators like TransformedTargetRegressor, OneVsRestClassifier, ColumnTransformer, Pipeline with caching, etc.

@agoscinski
Copy link
Collaborator

agoscinski commented May 5, 2021

Yes deepcopy should prevent this to happen when the user uses only the public function interface of the PCovR class, but the user could just access member variables. They are not even marked as private.

a = PCovR()
print("a.regressor.alpha", a.regressor.alpha)
# want to update only one parameter instead of defining the whole KernelRidge
a.regressor.alpha = 0.5
print("a.regressor.alpha", a.regressor.alpha)

b = PCovR()
print("b.regressor.alpha", b.regressor.alpha)
a.regressor.alpha 0.8
a.regressor.alpha 0.5
b.regressor.alpha 0.5

@Luthaf
Copy link
Collaborator

Luthaf commented May 5, 2021

That is some interesting behavior. My intuition would be that the instantiation of b would use a fresh instance of Regressor given the init call of PCovR, but apparently not!

That's an issue with using mutable instances as default values. The regressor is created when parsing code, and stored within the class; and then the same instance is used for all new instances of PCovR. Usually you want to avoid mutable values as default parameters. Something like this should work fine:

class PCovR: 
    def __init__(self, regressor=None): 
        if regressor is None:
            regressor = Regressor()

        self.regressor = regressor 

    def fit(self, X):
        self.regressor_.fit(X)

Using a deep copy could also work, although I think it makes the code more complex.

@agoscinski
Copy link
Collaborator

agoscinski commented May 5, 2021

I think we should use None as default argument and deepcopy in the fit function, it is very close what sklearn does
https://github.com/scikit-learn/scikit-learn/blob/15a949460/sklearn/compose/_target.py#L128-L140
We don't use clone because we allow also prefitted regressor

EDIT: and keep the default KernelRidge as global variable in the module, something like DEFAULT_KPCOVR_KERNEL_RIDGE. I did not do it for reconstruction measures, and it is a bit annoying now when I only want to adapt one parameter of the default parameter

@bhelfrecht bhelfrecht marked this pull request as ready for review May 6, 2021 11:24
@bhelfrecht bhelfrecht requested review from Luthaf and rosecers May 7, 2021 14:33
@bhelfrecht bhelfrecht linked an issue May 7, 2021 that may be closed by this pull request
Copy link
Collaborator

@rosecers rosecers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some thoughts on this whole KRR params vs KPCovR params -- may take the conversation offline to hash it out

skcosmo/decomposition/_kernel_pcovr.py Outdated Show resolved Hide resolved
skcosmo/decomposition/_kernel_pcovr.py Outdated Show resolved Hide resolved
skcosmo/decomposition/_kernel_pcovr.py Outdated Show resolved Hide resolved
skcosmo/utils/_pcovr_utils.py Outdated Show resolved Hide resolved
@bhelfrecht
Copy link
Contributor Author

I've reorganized the regressor handling a bit based on @rosecers feedback. all of the parameter checking is now done in fit, and KernelPCovR now takes kernel parameters, as does the regressor. If regressor is None, the KernelPCovR kernel parameters are used to initialize the regressor. If regressor is not None and the kernel parameters are inconsistent, an error is raised.

Additionally, KernelPCovR and PCovR now also accept regressor_params, additional keyword arguments that are passed to the regressor if regressor is set to None.

@bhelfrecht bhelfrecht requested a review from rosecers May 10, 2021 10:47
Copy link
Collaborator

@rosecers rosecers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Logic looks good, although I'd drastically streamline the regressor checks, see the comments.

Comment on lines 79 to 92
Kernel. Default="linear".
Kernel. Default="linear".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extraneous change -- please remove

Comment on lines 182 to 185
[[-0.55119827, -0.21793572],
[ 0.3768726 , 0.31208068],
[-0.76898956, 0.08511876],
[ 0.92488574, -0.18627707]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the result change so drastically if the data stays the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example was broken as written. You'll notice that the (old) example input initializes with gamma=2, but the corresponding example output shows gamma=0.01. So I ran the example myself with gamma=2, and this is what came out. I can try with gamma=0.01 to see if the results are the same as the old.

>>> kpcovr.score(X, Y)
(0.5312320029915978, 0.06254540655698511)
1.0000774522028972
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems bad -- the example is designed to be better than this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The score output was returning a tuple instead of the single value that is the current output and predates even this PR. Part of the issue might be the broken example (see comment above). I'll re-run with a different gamma and see what happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I don't have an issue with the single value -- it's more that a loss of 1.0 seems awful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'll try re-doing this example with a different gamma

Returns
-------
self: object
Returns the instance itself.

"""

if not any([self.regressor is None, isinstance(self.regressor, KernelRidge)]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this syntax? For simple two-component logic, it would be more typical to write

Suggested change
if not any([self.regressor is None, isinstance(self.regressor, KernelRidge)]):
if self.regressor is not None and not isinstance(self.regressor, KernelRidge):

I find it much easier to read and understand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used this syntax because it's consistent with what I had in PCovR, where there are more valid regressor possibilities. I have no problem changing it.

Comment on lines 13 to 88
Checks that the coefficients of a fitted
regression model is compatible with the shapes
of X and y

:param regressor: sklearn-style regressor
:type regressor: object
:param X: feature matrix with which to compare the
regression coefficients
:type X: array
:param y: target values with which to compare the
regression coefficients
:type y: array

"""
if fitted_regressor.coef_.ndim != y.ndim:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied target space"
)
elif fitted_regressor.coef_.ndim == 1:
if fitted_regressor.coef_.shape[0] != X.shape[1]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied feature space"
)
else:
if fitted_regressor.coef_.shape[0] != y.shape[1]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied target space"
)
elif fitted_regressor.coef_.shape[1] != X.shape[1]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied feature space"
)


def _check_dual_coefs(fitted_regressor, K, y):
r"""
Checks that the dual coefficients of a fitted
regression model is compatible with the shapes
of K and y

:param regressor: sklearn-style regressor
:type regressor: object
:param K: kernel matrix with which to compare the
regression coefficients
:type K: array
:param y: target values with which to compare the
regression coefficients
:type y: array

"""
if fitted_regressor.dual_coef_.ndim != y.ndim:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied target space"
)
elif fitted_regressor.dual_coef_.ndim == 1:
if fitted_regressor.dual_coef_.shape[0] != K.shape[0]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied sample space"
)
else:
if fitted_regressor.dual_coef_.shape[0] != K.shape[0]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied sample space"
)
elif fitted_regressor.dual_coef_.shape[1] != y.shape[1]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied target space"
)
Copy link
Collaborator

@rosecers rosecers May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this all be caught if you try and pass X to regressor.predict? I feel like you could avoid all of these raised errors by relying on the errors already thrown by sklearn.

Suggested change
Checks that the coefficients of a fitted
regression model is compatible with the shapes
of X and y
:param regressor: sklearn-style regressor
:type regressor: object
:param X: feature matrix with which to compare the
regression coefficients
:type X: array
:param y: target values with which to compare the
regression coefficients
:type y: array
"""
if fitted_regressor.coef_.ndim != y.ndim:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied target space"
)
elif fitted_regressor.coef_.ndim == 1:
if fitted_regressor.coef_.shape[0] != X.shape[1]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied feature space"
)
else:
if fitted_regressor.coef_.shape[0] != y.shape[1]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied target space"
)
elif fitted_regressor.coef_.shape[1] != X.shape[1]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied feature space"
)
def _check_dual_coefs(fitted_regressor, K, y):
r"""
Checks that the dual coefficients of a fitted
regression model is compatible with the shapes
of K and y
:param regressor: sklearn-style regressor
:type regressor: object
:param K: kernel matrix with which to compare the
regression coefficients
:type K: array
:param y: target values with which to compare the
regression coefficients
:type y: array
"""
if fitted_regressor.dual_coef_.ndim != y.ndim:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied target space"
)
elif fitted_regressor.dual_coef_.ndim == 1:
if fitted_regressor.dual_coef_.shape[0] != K.shape[0]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied sample space"
)
else:
if fitted_regressor.dual_coef_.shape[0] != K.shape[0]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied sample space"
)
elif fitted_regressor.dual_coef_.shape[1] != y.shape[1]:
raise ValueError(
"The target regressor has a shape incompatible "
"with the supplied target space"
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use self._validate_data(self, X, y) from anything inheriting from BaseEstimator to check the dimensionality of X with respect to the coefficients. Checking y now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like there is a way to validate that the regressor outputs a y consistent with the supplied y, but the self._validate_data(self, X, y) will flag an error if dim(y) is unexpected.

btw all of this applies to kernel regressors as well, no need for separate functions. Btw, I would be happy if you just called fitted_regressor._validate_data(X, y) wherever you intended to call this function. That seems to be the implementation in sklearn.

Comment on lines 256 to 306
def test_incompatible_coef_shape(self):

# 1D properties (self.Y is 2D with two targets)
# X shape doesn't match
regressor = KernelRidge(alpha=1e-8, kernel="linear")
regressor.fit(self.X, self.Y[:, 0])
kpcovr = self.model(mixing=0.5, regressor=regressor)

with self.assertRaises(ValueError) as cm:
kpcovr.fit(self.X[0:-1], self.Y[0:-1, 0])
self.assertTrue(
str(cm.message),
"The target regressor has a shape incompatible "
"with the supplied sample space",
)

# >= 2D properties
# Y shape doesn't match
regressor = KernelRidge(alpha=1e-8, kernel="linear")
regressor.fit(self.X, self.Y[:, 0][:, np.newaxis])
kpcovr = self.model(mixing=0.5, regressor=regressor)

with self.assertRaises(ValueError) as cm:
kpcovr.fit(self.X, self.Y[:, 0])
self.assertTrue(
str(cm.message),
"The target regressor has a shape incompatible "
"with the supplied target space",
)

with self.assertRaises(ValueError) as cm:
kpcovr.fit(self.X, self.Y)
self.assertTrue(
str(cm.message),
"The target regressor has a shape incompatible "
"with the supplied target space",
)

# X shape doesn't match
regressor = KernelRidge(alpha=1e-8, kernel="linear")
regressor.fit(self.X, self.Y)
kpcovr = self.model(mixing=0.5, regressor=regressor)

with self.assertRaises(ValueError) as cm:
kpcovr.fit(self.X[0:-1], self.Y[0:-1])
self.assertTrue(
str(cm.message),
"The target regressor has a shape incompatible "
"with the supplied sample space",
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much of this would be avoided by the suggestion above. All you'd have to do is check the regressor output shape

Comment on lines +434 to +455
def test_regressor_modifications(self):
regressor = Ridge(alpha=1e-8)
pcovr = self.model(mixing=0.5, regressor=regressor)

# PCovR regressor matches the original
self.assertTrue(regressor.get_params() == pcovr.regressor.get_params())

# PCovR regressor updates its parameters
# to match the original regressor
regressor.set_params(alpha=1e-6)
self.assertTrue(regressor.get_params() == pcovr.regressor.get_params())

# Fitting regressor outside PCovR fits the PCovR regressor
regressor.fit(self.X, self.Y)
self.assertTrue(hasattr(pcovr.regressor, "coef_"))

# PCovR regressor doesn't change after fitting
pcovr.fit(self.X, self.Y)
regressor.set_params(alpha=1e-4)
self.assertTrue(hasattr(pcovr.regressor_, "coef_"))
self.assertTrue(regressor.get_params() != pcovr.regressor_.get_params())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I think should stay since it's different than checking the regressor shape. These tests are to ensure consistency between the regressor and KPCovR kernel parameters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! I read this wrong -- my comment is more for lines 475-524

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. If there already exists sklearn utilities that we could adopt to do these checks, then we could dramatically simplify these tests. We still might need something to make the codecov bot happy, but the tests could be simplified.

"The target regressor has a shape incompatible "
"with the supplied feature space"
)
_check_coefs(regressor, X, y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_check_coefs(regressor, X, y)
regressor._validate_data(X, y)

try:
check_is_fitted(regressor)
fitted_regressor = deepcopy(regressor)
_check_dual_coefs(regressor, K, y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_check_dual_coefs(regressor, K, y)
fitted_regressor._validate_data(K, y)

@bhelfrecht bhelfrecht merged commit b4e06ae into scikit-learn-contrib:main May 12, 2021
@bhelfrecht bhelfrecht deleted the feat/kpcovr_fitted_regressor branch May 12, 2021 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pre-fitted regressors with KPCovR
4 participants