Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Make our estimators compatible with scikit-learn #116

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

timokau
Copy link
Collaborator

@timokau timokau commented Apr 30, 2020

Description

This is a work-in-progress of fixing #94. See that issue for motivation and context.

Currently the one added test is failing since FETADiscreteChoice does not yet conform to the interface. The commit is mostly just to get the PR started. I will incrementally fix the estimators and add them to the list of tested estimators.

Collecting and testing all of our estimators will be easier after #115 is done.

How Has This Been Tested?

Does this close/impact existing issues?

Fixes #94.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.

@timokau
Copy link
Collaborator Author

timokau commented Apr 30, 2020

Currently our FETANetwork base class requires the mandatory n_objects and n_object_features parameters. According to sklearns guidelines, those should be passed to fit instead. That effectively means that we should only construct the network on fit too. That will probably require quite some refactoring.

@codecov
Copy link

codecov bot commented Apr 30, 2020

Codecov Report

Merging #116 into master will increase coverage by 0.30%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #116      +/-   ##
==========================================
+ Coverage   57.04%   57.35%   +0.30%     
==========================================
  Files         113      114       +1     
  Lines        6560     7027     +467     
==========================================
+ Hits         3742     4030     +288     
- Misses       2818     2997     +179     
Impacted Files Coverage Δ
csrank/tests/test_estimators.py 0.00% <0.00%> (ø)
csrank/discretechoice/ranknet_discrete_choice.py 83.78% <0.00%> (-16.22%) ⬇️
csrank/objectranking/baseline.py 55.00% <0.00%> (-14.24%) ⬇️
csrank/discretechoice/baseline.py 55.00% <0.00%> (-14.24%) ⬇️
csrank/discretechoice/cmpnet_discrete_choice.py 86.11% <0.00%> (-13.89%) ⬇️
csrank/objectranking/cmp_net.py 86.48% <0.00%> (-13.52%) ⬇️
csrank/objectranking/rank_net.py 86.48% <0.00%> (-13.52%) ⬇️
csrank/choicefunction/ranknet_choice.py 80.85% <0.00%> (-12.01%) ⬇️
csrank/core/cmpnet_core.py 87.50% <0.00%> (-10.21%) ⬇️
csrank/core/ranknet_core.py 87.50% <0.00%> (-10.21%) ⬇️
... and 62 more

@timokau timokau changed the title [WIP] Test FETADiscreteChoice estimator interface [WIP] Make our estimators compatible with scikit-learn May 2, 2020
@timokau
Copy link
Collaborator Author

timokau commented May 2, 2020

I've adapted FETALinear's init. I'll have to do the same for our other cores and estimators and then take care of the other parts of the scikit-learn estimator interface.

@timokau
Copy link
Collaborator Author

timokau commented May 12, 2020

This is turning out to be a lot more effort than expected. To make it somewhat manageable and reviewable I'll proceed as follows:

  • Gradually make all estimators pass the "default constructible" test. There is some highly repetitive work to be done here. The first part is moving all random state validation out of __init__. I'll split that repetitive work into separate PRs for easier review and faster merge.
  • Create a new test that calls check_estimator but blacklists all currently failing sub-checks.
  • Gradually fix the failing sub-checks, save for some which are inherently incompatible with our library

@timokau timokau mentioned this pull request May 12, 2020
7 tasks
@timokau
Copy link
Collaborator Author

timokau commented May 12, 2020

The first part is ready for review: #117

@timokau
Copy link
Collaborator Author

timokau commented May 22, 2020

I have rebased this on top of #118. After #118, the biggest remaining blocker for the default-constructible is the optimizer parameter of our estimators. See #119 for that.

@timokau
Copy link
Collaborator Author

timokau commented May 26, 2020

Rebased, now that #118 is merged.

@kiudee kiudee added this to the 1.3 milestone Jun 5, 2020
@timokau timokau force-pushed the sklearn-compatibility branch 2 times, most recently from 45267a0 to 352cba0 Compare June 27, 2020 17:25
@timokau
Copy link
Collaborator Author

timokau commented Jun 27, 2020

I have continued work on top of #119 to avoid merge conflicts. We're getting very close to passing the default_constructible test! Besides scikit-learn/scikit-learn#17756, with the fixes in this PR and #119 the only remaining failure is due to the kernel_regularizer parameter. We'll need to give it the same treatment as optimizer.

I already started by removing all default values for the regularizer parameters (82a2869). @kiudee do you think that is the right call for the regularizers as well?

I had to patch scikit-learn to get more useful/actionable error messages in the test. I'll try to upstream those improvements once scikit-learn/scikit-learn#17756 is resolved.

@timokau
Copy link
Collaborator Author

timokau commented Jul 16, 2020

All our estimators are default constructible now and the test passes with scikit-learn/scikit-learn#17936 🎉

That means that we cannot add the test before the PR is merged and a new sklearn release is out though.

@timokau timokau mentioned this pull request Aug 1, 2020
7 tasks
@timokau
Copy link
Collaborator Author

timokau commented Aug 7, 2020

The PR was merged upstream. I'm not sure if it'll make it into 0.23.3 or if we have to wait for 0.24. Either way, there should be a stable version with the necessary patch at some point.

@timokau timokau mentioned this pull request Aug 21, 2020
7 tasks
@timokau
Copy link
Collaborator Author

timokau commented Sep 17, 2020

Good news and bad news.

Good news: After #159, we are now compliant with "no attributes in init" 🎉

Bad news: Basically all other checks are unusable as long as our fit function doesn't match the data dimensionality that scikit-learn assumes. I have updated this PR and blacklisted all the checks that do not work. So we either have to think again about how we can make our fit function fit (pun intended) or ignore all those checks.

@timokau
Copy link
Collaborator Author

timokau commented Sep 17, 2020

Just to re-state the problem (we talked about this in some other PR/issue, but I can't remember which):

According to scikit-learn, fit should expect an X of the shape (n_samples, n_features). However, for us every sample consists of a raking which has multiple objects. Therefore, X has the shape (n_instances, n_objects, n_features).

It should be possible to flatten each ranking instance into a single feature array to please scikit-learn. We could write functions to transform between the two formats. The question is if that is what we want.

CC @kiudee @prithagupta

@kiudee
Copy link
Owner

kiudee commented Sep 17, 2020

Good news: After #159, we are now compliant with "no attributes in init" 🎉

🥳

Bad news: Basically all other checks are unusable as long as our fit function doesn't match the data dimensionality that scikit-learn assumes. I have updated this PR and blacklisted all the checks that do not work. So we either have to think again about how we can make our fit function fit (pun intended) or ignore all those checks.

I think that (writing converters and having flat format be default) would compromise a bit too much. Our setting is a different one and it would not make sense to press it into that format. It would only buy us compatibility with a few preprocessors (let me know if I forgot an important use case), which for our setting likely do not make sense. Anything important we lose?

@timokau
Copy link
Collaborator Author

timokau commented Sep 18, 2020

I tend to agree. I will look into writing a wrapper anyway, which will at least enable us to make use of the remaining checks.

@timokau
Copy link
Collaborator Author

timokau commented Sep 18, 2020

Okay, I have done that. Just for rankers for now, I'll fix the checks there first (of course they don't all pass). I already added some little fixes to this PR. The checks are very slow though, so we'll probably not run them with the regular testsuite.

@timokau
Copy link
Collaborator Author

timokau commented Oct 1, 2020

Now that we use poetry & nix, I have included the environment with the patched scikit-learn that is necessary to run these tests. A simple nix-shell will load it (though it will require some build time). It includes a backport of scikit-learn/scikit-learn#17936.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adhere to scikit-learn estimator interface
2 participants