Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adhere to scikit-learn estimator interface #94

Closed
4 of 6 tasks
kiudee opened this issue Mar 18, 2020 · 9 comments · May be fixed by #116
Closed
4 of 6 tasks

Adhere to scikit-learn estimator interface #94

kiudee opened this issue Mar 18, 2020 · 9 comments · May be fixed by #116
Assignees
Labels
enhancement New feature or request Priority: High
Milestone

Comments

@kiudee
Copy link
Owner

kiudee commented Mar 18, 2020

Rationale

Most of the learners implemented in cs-ranking already implement an interface similar to the one described in https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator,
i.e., we usually have a fit and predict method implemented.
For users to be able to use all learners effortlessly in a scikit-learn pipeline.Pipeline or to apply model_selection.GridSearchCV, we should make sure that all additional requirements are also fulfilled.

To do

  • Use get_params and set_params to set parameters. This is important, since GridSearchCV or BayesSearchCV call set_params for hyperparameter optimization. sklearn.base.BaseEstimator implements basic versions of these. The current way we handle hyperparameters should be deprecated.
  • It is recommended to not do any parameter validation in __init__, but rather in fit itself. set_params is supposed to do exactly the same thing as __init__ with respect to parameters.
  • Init parameters should be written without changes as attributes. All generated attributes should have a trailing _.
  • There should be no mandatory parameters. The user should be able to run the learner without having to provide arguments.
  • Implement a score method. This is helpful, since hyperparameter optimizers call this function by default. Otherwise the user has to implement a custom one.
  • Implement clone methods for each learner.

Most of these changes are independent of each other and could be done using separate branches.

@kiudee kiudee added enhancement New feature or request Priority: Medium labels Mar 18, 2020
@kiudee kiudee added this to the 1.2 milestone Mar 18, 2020
@kiudee
Copy link
Owner Author

kiudee commented Apr 15, 2020

One addendum to the above, which is related to the use of the models in pipelines:
We should think about how device placement for cpu/gpu works. How does the user specify on which device the model is to be executed? Will the model fit and predict call be wrapped with a context manager

with tf.device('/device:GPU:2'):
    # call fit here

and will this even work correctly with our current code?
Or do we call __init__ with device="..." like it is done in skorch?

@timokau
Copy link
Collaborator

timokau commented Apr 30, 2020

I'll take a stab at this. I don't have much experience with sklearn though, so it'll probably take a while to understand the desired API.

@timokau timokau self-assigned this Apr 30, 2020
@timokau
Copy link
Collaborator

timokau commented Apr 30, 2020

I have familiarized myself with the scikit-learn architecture. I think the best way forward is to create a test like this one for all our estimators and then gradually make it pass for each one.

@timokau
Copy link
Collaborator

timokau commented Apr 30, 2020

Continued in #116.

@timokau
Copy link
Collaborator

timokau commented May 10, 2020

While working on #116, I noticed a pretty fundamental incompatibility between our API and scikit-learn's assumptions:

scikit-learn assumes X and Y are numpy arrays of dtype float64 / int64 and shape (n_samples, n_features) for X, (n_samples,) for Y. Since our "samples" are rankings or choice sets, that doesn't fit our task. We have an additional dimension.

We could of course try to flatten our data. The inputs could pretty straightforwardly be concatenated together, the output could be considered as classification with each possible ranking being its own class. Thats not particularly elegant however and actually loses information in the case of the output.

The other option is to just adhere to the sklearn API as much as possible, but ignore their data shape assumptions. I do not know how much functionality that will break.

@kiudee
Copy link
Owner Author

kiudee commented May 10, 2020

I noticed one problem with the data shape:
If you use the StandardScaler in a pipeline, it only works if you have a 2d array.

I am inclined to ignore the shape requirement. If need be, we can write appropriate transformers, which can transform the data in a pipeline.

@kiudee
Copy link
Owner Author

kiudee commented Mar 20, 2021

Is there anything still missing for this issue, or can we close it?

@timokau
Copy link
Collaborator

timokau commented Mar 22, 2021

Our scikit-learn API conformance is not perfect. Support for get_params(deep=True) is limited. Some of the scikit-learn estimator checks are not passing (see #116 for more context). We cannot pass all of them with our current API design, but as far as I recall there are still some things that could be improved.

I think the first four points of this issue are addressed. The score functions are not implemented. Cloning should be covered "for free" with the state-less initialization.

In summary: The most important parts should be covered and I expect that we will get better conformance by using skorch in the pytorch migration (#164). The biggest limitation is probably the get_params implementation.

@kiudee
Copy link
Owner Author

kiudee commented Mar 22, 2021

Then I would say we can close this particular issue, since the issues with get_params can be worked around and we are going to switch to PyTorch soon anyway.

@kiudee kiudee closed this as completed Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority: High
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants