Adhere to scikit-learn estimator interface #94

kiudee · 2020-03-18T16:10:16Z

Rationale

Most of the learners implemented in cs-ranking already implement an interface similar to the one described in https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator,
i.e., we usually have a fit and predict method implemented.
For users to be able to use all learners effortlessly in a scikit-learn pipeline.Pipeline or to apply model_selection.GridSearchCV, we should make sure that all additional requirements are also fulfilled.

To do

Use get_params and set_params to set parameters. This is important, since GridSearchCV or BayesSearchCV call set_params for hyperparameter optimization. sklearn.base.BaseEstimator implements basic versions of these. The current way we handle hyperparameters should be deprecated.
It is recommended to not do any parameter validation in __init__, but rather in fit itself. set_params is supposed to do exactly the same thing as __init__ with respect to parameters.
Init parameters should be written without changes as attributes. All generated attributes should have a trailing _.
There should be no mandatory parameters. The user should be able to run the learner without having to provide arguments.
Implement a score method. This is helpful, since hyperparameter optimizers call this function by default. Otherwise the user has to implement a custom one.
Implement clone methods for each learner.

Most of these changes are independent of each other and could be done using separate branches.

The text was updated successfully, but these errors were encountered:

kiudee · 2020-04-15T14:00:37Z

One addendum to the above, which is related to the use of the models in pipelines:
We should think about how device placement for cpu/gpu works. How does the user specify on which device the model is to be executed? Will the model fit and predict call be wrapped with a context manager

with tf.device('/device:GPU:2'):
    # call fit here

and will this even work correctly with our current code?
Or do we call __init__ with device="..." like it is done in skorch?

timokau · 2020-04-30T18:16:27Z

I'll take a stab at this. I don't have much experience with sklearn though, so it'll probably take a while to understand the desired API.

timokau · 2020-04-30T21:50:44Z

I have familiarized myself with the scikit-learn architecture. I think the best way forward is to create a test like this one for all our estimators and then gradually make it pass for each one.

timokau · 2020-04-30T22:04:03Z

Continued in #116.

timokau · 2020-05-10T18:34:38Z

While working on #116, I noticed a pretty fundamental incompatibility between our API and scikit-learn's assumptions:

scikit-learn assumes X and Y are numpy arrays of dtype float64 / int64 and shape (n_samples, n_features) for X, (n_samples,) for Y. Since our "samples" are rankings or choice sets, that doesn't fit our task. We have an additional dimension.

We could of course try to flatten our data. The inputs could pretty straightforwardly be concatenated together, the output could be considered as classification with each possible ranking being its own class. Thats not particularly elegant however and actually loses information in the case of the output.

The other option is to just adhere to the sklearn API as much as possible, but ignore their data shape assumptions. I do not know how much functionality that will break.

kiudee · 2020-05-10T19:15:24Z

I noticed one problem with the data shape:
If you use the StandardScaler in a pipeline, it only works if you have a 2d array.

I am inclined to ignore the shape requirement. If need be, we can write appropriate transformers, which can transform the data in a pipeline.

kiudee · 2021-03-20T11:09:50Z

Is there anything still missing for this issue, or can we close it?

timokau · 2021-03-22T12:56:55Z

Our scikit-learn API conformance is not perfect. Support for get_params(deep=True) is limited. Some of the scikit-learn estimator checks are not passing (see #116 for more context). We cannot pass all of them with our current API design, but as far as I recall there are still some things that could be improved.

I think the first four points of this issue are addressed. The score functions are not implemented. Cloning should be covered "for free" with the state-less initialization.

In summary: The most important parts should be covered and I expect that we will get better conformance by using skorch in the pytorch migration (#164). The biggest limitation is probably the get_params implementation.

kiudee · 2021-03-22T13:15:12Z

Then I would say we can close this particular issue, since the issues with get_params can be worked around and we are going to switch to PyTorch soon anyway.

kiudee added enhancement New feature or request Priority: Medium labels Mar 18, 2020

kiudee added this to the 1.2 milestone Mar 18, 2020

kiudee added Priority: High and removed Priority: Medium labels Apr 7, 2020

timokau self-assigned this Apr 30, 2020

timokau mentioned this issue Apr 30, 2020

[WIP] Make our estimators compatible with scikit-learn #116

Draft

7 tasks

This was referenced May 12, 2020

Delay random state validation #117

Merged

Determine data dimensions lazily on fit instead on init #118

Merged

timokau mentioned this issue May 22, 2020

Require uninitialized optimizers for our learners #119

Merged

7 tasks

kiudee modified the milestones: 1.2, 1.3 Jun 5, 2020

timokau mentioned this issue Jul 11, 2020

Remove tuning #147

Merged

7 tasks

This was referenced Aug 21, 2020

Remove clear memory #150

Merged

Do not initialize the model in init #152

Merged

Move logger initialization to the module level #154

Merged

timokau mentioned this issue Aug 28, 2020

Do not store kwargs #155

Merged

7 tasks

timokau mentioned this issue Sep 7, 2020

Take care of the long tail of attributes in init #157

Merged

7 tasks

timokau mentioned this issue Sep 16, 2020

Move stateful initialization to a pre_fit function #159

Merged

7 tasks

timokau mentioned this issue Sep 25, 2020

Misc estimator fixes #161

Merged

7 tasks

kiudee modified the milestones: 1.3, 2.0 Oct 16, 2020

kiudee closed this as completed Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adhere to scikit-learn estimator interface #94

Adhere to scikit-learn estimator interface #94

kiudee commented Mar 18, 2020 •

edited

Loading

kiudee commented Apr 15, 2020

timokau commented Apr 30, 2020

timokau commented Apr 30, 2020

timokau commented Apr 30, 2020

timokau commented May 10, 2020

kiudee commented May 10, 2020

kiudee commented Mar 20, 2021

timokau commented Mar 22, 2021

kiudee commented Mar 22, 2021

Adhere to scikit-learn estimator interface #94

Adhere to scikit-learn estimator interface #94

Comments

kiudee commented Mar 18, 2020 • edited Loading

Rationale

To do

kiudee commented Apr 15, 2020

timokau commented Apr 30, 2020

timokau commented Apr 30, 2020

timokau commented Apr 30, 2020

timokau commented May 10, 2020

kiudee commented May 10, 2020

kiudee commented Mar 20, 2021

timokau commented Mar 22, 2021

kiudee commented Mar 22, 2021

kiudee commented Mar 18, 2020 •

edited

Loading