Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine data dimensions lazily on fit instead on init #118

Merged
merged 16 commits into from
May 26, 2020

Conversation

timokau
Copy link
Collaborator

@timokau timokau commented May 14, 2020

Description

Continuation of #94.

This is a work in progress. I'm putting this out there since it is likely I won't have any time left to finish it this week.

Things to do:

  • Unify all the changes. Initially I started of by just inferring the data dimensionality from the current data without storing it in an instance attribute. This requires more invasive refactorings however, since it means the dimension will only be available from the fit function and the fit function has to pass it along as parameters. It doesn't always work either, since sometimes the dimensionality is needed at prediction time and then it sometimes matters if the dimensionality of the data we're predicting on is different from the dimensionality of the data we have trained on. Therefore I started storing the dimensionality the current model was fit on in instance attributes. I should adapt the earlier changes to this style to increase uniformity and reduce the chance of error.
  • Verify that pre-commit hooks and the full test suite pass after every commit.
  • Find all initializations in which the dimensionality is passed to the estimator. Currently this doesn't throw an error since the estimators accept **kwargs. Still, the calls should be adapted.
  • Double-check if any more documentation needs to be modified to remove references to n_objects and n_object_features.

Motivation and Context

The scikit-learn estimator API only allows parameters with default values for __init__. There are no sensible defaults for the data dimension. Further, its just unnecessary to fix the data dimension on init. The developer experience is much nicer when it can simply be inferred from the data passed to fit.

How Has This Been Tested?

Ran the full test suite.

Does this close/impact existing issues?

Impacts #94, #116

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.

@timokau timokau marked this pull request as draft May 14, 2020 15:52
@timokau timokau changed the title [WIP] determine data dimensions lazily on fit instead on init [WIP] Determine data dimensions lazily on fit instead on init May 14, 2020
@timokau
Copy link
Collaborator Author

timokau commented May 20, 2020

Okay, I think/hope I found and fixed all instances now and removed mentions of the data dimensionality from the init docstrings. Tests pass. Still needs cleanup.

@timokau timokau force-pushed the lazy-data-dimension branch 3 times, most recently from befafc2 to e053a05 Compare May 20, 2020 21:12
@codecov
Copy link

codecov bot commented May 20, 2020

Codecov Report

Merging #118 into master will decrease coverage by 0.17%.
The diff coverage is 84.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #118      +/-   ##
==========================================
- Coverage   60.18%   60.01%   -0.18%     
==========================================
  Files         116      116              
  Lines        7656     7642      -14     
==========================================
- Hits         4608     4586      -22     
- Misses       3048     3056       +8     
Impacted Files Coverage Δ
csrank/choicefunction/fate_choice.py 91.30% <ø> (ø)
csrank/choicefunction/fatelinear_choice.py 92.85% <ø> (ø)
csrank/choicefunction/fetalinear_choice.py 92.85% <ø> (ø)
csrank/discretechoice/fate_discrete_choice.py 94.44% <ø> (ø)
...srank/discretechoice/fatelinear_discrete_choice.py 100.00% <ø> (ø)
...srank/discretechoice/fetalinear_discrete_choice.py 100.00% <ø> (ø)
csrank/objectranking/expected_rank_regression.py 91.93% <ø> (-0.26%) ⬇️
csrank/objectranking/fate_object_ranker.py 93.33% <ø> (ø)
csrank/objectranking/fatelinear_object_ranker.py 100.00% <ø> (ø)
csrank/objectranking/feta_object_ranker.py 92.85% <ø> (ø)
... and 33 more

@timokau timokau changed the title [WIP] Determine data dimensions lazily on fit instead on init Determine data dimensions lazily on fit instead on init May 20, 2020
@timokau timokau marked this pull request as ready for review May 20, 2020 21:46
@timokau
Copy link
Collaborator Author

timokau commented May 20, 2020

I have cleaned everything up and reviewed my changes. This is ready for another pair of eyes now.

This is quite a big change. The basic idea is to get rid of the required n_obejcts and n_object_features arguments in our Learner constructors. We can infer that information from the data that is passed to fit. As a consequence, some other initializations that depend on the dimensionality have to be delayed to fit as well. Technically its a breaking change, though in practice it doesn't make much of a difference since all our learners quietly "swallow" additional unused arguments to init anyways. To keep potential for breakage as small as possible, I kept changes minimally invasive. Its tempting to do some fixups along the way, but that would be harder to do right with so many changes in so many places.

The repetitiveness of some of the changes shows that there is a lot of duplicate code. I haven't introduced any new abstractions for now because of the risk of introducing errors and because some big refactor / partial rewrite is on the horizon anyway (either the tf2 update or a switch to pytorch).

I hope we can get this merged relatively quickly, since it touches a lot of code and will otherwise likely cause merge conflicts. Keep in mind that I've already reviewed all the changes myself, so a second review doesn't have to be too detailed. I have left two comments on things that may need more review in the diff.

We can ignore CodeCov. Looks like there is still some configuration issue. If anything, this should increase coverage by decreasing the lines of code.

Copy link
Owner

@kiudee kiudee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. The only thing which needs to be fixed is the n_top in ListNet.

Currently we require the data dimensionality to be passed into the
FETALinear constructor. It is then checked that it matches the
dimensionality of the data in `fit`. This is somewhat redundant and not
compatible with the scikit-learn estimator API.

We were already creating the model in the `fit` function. It was also
created when updating the hyperparameters, but that is redundant. In the
`fit` function we can simply derive the dimensionality from the data.
@timokau timokau force-pushed the lazy-data-dimension branch 2 times, most recently from d5a5679 to fdcec9f Compare May 25, 2020 16:03
@timokau timokau merged commit 5bdc0e4 into kiudee:master May 26, 2020
@timokau timokau deleted the lazy-data-dimension branch May 26, 2020 13:35
@timokau timokau mentioned this pull request Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants