Determine data dimensions lazily on fit instead on init #118

timokau · 2020-05-14T15:52:18Z

Description

Continuation of #94.

This is a work in progress. I'm putting this out there since it is likely I won't have any time left to finish it this week.

Things to do:

Unify all the changes. Initially I started of by just inferring the data dimensionality from the current data without storing it in an instance attribute. This requires more invasive refactorings however, since it means the dimension will only be available from the fit function and the fit function has to pass it along as parameters. It doesn't always work either, since sometimes the dimensionality is needed at prediction time and then it sometimes matters if the dimensionality of the data we're predicting on is different from the dimensionality of the data we have trained on. Therefore I started storing the dimensionality the current model was fit on in instance attributes. I should adapt the earlier changes to this style to increase uniformity and reduce the chance of error.
~~Verify that pre-commit hooks and the full test suite pass after every commit.~~
~~Find all initializations in which the dimensionality is passed to the estimator. Currently this doesn't throw an error since the estimators accept **kwargs. Still, the calls should be adapted.~~
~~Double-check if any more documentation needs to be modified to remove references to n_objects and n_object_features.~~

Motivation and Context

The scikit-learn estimator API only allows parameters with default values for __init__. There are no sensible defaults for the data dimension. Further, its just unnecessary to fix the data dimension on init. The developer experience is much nicer when it can simply be inferred from the data passed to fit.

How Has This Been Tested?

Ran the full test suite.

Does this close/impact existing issues?

Impacts #94, #116

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.

csrank/discretechoice/generalized_nested_logit.py

timokau · 2020-05-20T19:20:03Z

Okay, I think/hope I found and fixed all instances now and removed mentions of the data dimensionality from the init docstrings. Tests pass. Still needs cleanup.

codecov · 2020-05-20T21:34:18Z

Codecov Report

Merging #118 into master will decrease coverage by 0.17%.
The diff coverage is 84.46%.

@@            Coverage Diff             @@
##           master     #118      +/-   ##
==========================================
- Coverage   60.18%   60.01%   -0.18%     
==========================================
  Files         116      116              
  Lines        7656     7642      -14     
==========================================
- Hits         4608     4586      -22     
- Misses       3048     3056       +8

Impacted Files	Coverage Δ
csrank/choicefunction/fate_choice.py	`91.30% <ø> (ø)`
csrank/choicefunction/fatelinear_choice.py	`92.85% <ø> (ø)`
csrank/choicefunction/fetalinear_choice.py	`92.85% <ø> (ø)`
csrank/discretechoice/fate_discrete_choice.py	`94.44% <ø> (ø)`
...srank/discretechoice/fatelinear_discrete_choice.py	`100.00% <ø> (ø)`
...srank/discretechoice/fetalinear_discrete_choice.py	`100.00% <ø> (ø)`
csrank/objectranking/expected_rank_regression.py	`91.93% <ø> (-0.26%)`	⬇️
csrank/objectranking/fate_object_ranker.py	`93.33% <ø> (ø)`
csrank/objectranking/fatelinear_object_ranker.py	`100.00% <ø> (ø)`
csrank/objectranking/feta_object_ranker.py	`92.85% <ø> (ø)`
... and 33 more

csrank/objectranking/list_net.py

timokau · 2020-05-20T21:47:44Z

I have cleaned everything up and reviewed my changes. This is ready for another pair of eyes now.

This is quite a big change. The basic idea is to get rid of the required n_obejcts and n_object_features arguments in our Learner constructors. We can infer that information from the data that is passed to fit. As a consequence, some other initializations that depend on the dimensionality have to be delayed to fit as well. Technically its a breaking change, though in practice it doesn't make much of a difference since all our learners quietly "swallow" additional unused arguments to init anyways. To keep potential for breakage as small as possible, I kept changes minimally invasive. Its tempting to do some fixups along the way, but that would be harder to do right with so many changes in so many places.

The repetitiveness of some of the changes shows that there is a lot of duplicate code. I haven't introduced any new abstractions for now because of the risk of introducing errors and because some big refactor / partial rewrite is on the horizon anyway (either the tf2 update or a switch to pytorch).

I hope we can get this merged relatively quickly, since it touches a lot of code and will otherwise likely cause merge conflicts. Keep in mind that I've already reviewed all the changes myself, so a second review doesn't have to be too detailed. I have left two comments on things that may need more review in the diff.

We can ignore CodeCov. Looks like there is still some configuration issue. If anything, this should increase coverage by decreasing the lines of code.

kiudee

Looks good to me. The only thing which needs to be fixed is the n_top in ListNet.

Currently we require the data dimensionality to be passed into the FETALinear constructor. It is then checked that it matches the dimensionality of the data in `fit`. This is somewhat redundant and not compatible with the scikit-learn estimator API. We were already creating the model in the `fit` function. It was also created when updating the hyperparameters, but that is redundant. In the `fit` function we can simply derive the dimensionality from the data.

After the last few commits the learners no longer need that information at initialization time. Instead, they determine it from the data when fitting.

timokau marked this pull request as draft May 14, 2020 15:52

timokau commented May 14, 2020

View reviewed changes

csrank/discretechoice/generalized_nested_logit.py Show resolved Hide resolved

timokau changed the title ~~[WIP] determine data dimensions lazily on fit instead on init~~ [WIP] Determine data dimensions lazily on fit instead on init May 14, 2020

timokau force-pushed the lazy-data-dimension branch from 9267127 to 5c87953 Compare May 20, 2020 19:16

timokau force-pushed the lazy-data-dimension branch 3 times, most recently from befafc2 to e053a05 Compare May 20, 2020 21:12

timokau commented May 20, 2020

View reviewed changes

csrank/objectranking/list_net.py Show resolved Hide resolved

timokau changed the title ~~[WIP] Determine data dimensions lazily on fit instead on init~~ Determine data dimensions lazily on fit instead on init May 20, 2020

timokau marked this pull request as ready for review May 20, 2020 21:46

This was referenced May 22, 2020

[WIP] Make our estimators compatible with scikit-learn #116

Draft

Add pygmo to requirements-dev #120

Merged

timokau force-pushed the lazy-data-dimension branch from fa27c65 to e053a05 Compare May 22, 2020 19:15

kiudee requested changes May 25, 2020

View reviewed changes

timokau mentioned this pull request May 25, 2020

Bug in setting the number of nests in generalized_nested_logit #121

Open

timokau added 4 commits May 25, 2020 18:00

Determine data dimensionality lazily in FATELinear

afc46e8

Determine data dimensionality lazily in FATE

d3d0386

Determine data dimensionality lazily in FETA

77ede46

timokau force-pushed the lazy-data-dimension branch 2 times, most recently from d5a5679 to fdcec9f Compare May 25, 2020 16:03

timokau added 6 commits May 25, 2020 18:03

Determine data dimensionality lazily in ListNet

cad28c6

Determine data dimensionality lazily in NestedLogitModel

a47247b

Determine data dimensionality lazily in GeneralizedNestedLogitModel

2493789

Determine data dimensionality lazily in PairedCombinatorialLogit

a2712f9

Determine data dimensionality lazily in RankNet

9560cb0

Determine data dimensionality lazily in CmpNet

795f6a3

timokau added 6 commits May 25, 2020 18:03

Determine data dimensionality lazily in PairwiseSVM

393bf19

Determine data dimensionality lazily in GeneralizedLinearModel

4950b3b

Determine data dimensionality lazily in MixedLogitModel

f125f12

Determine data dimensionality lazily in MultinomialLogitModel

3e3df20

Remove unneeded init argument in ExpectedRankRegression

f6db015

Do not pass data dimensionality to learners

7c0eef0

After the last few commits the learners no longer need that information at initialization time. Instead, they determine it from the data when fitting.

timokau force-pushed the lazy-data-dimension branch from fdcec9f to 7c0eef0 Compare May 25, 2020 16:03

timokau requested a review from kiudee May 25, 2020 16:12

kiudee approved these changes May 25, 2020

View reviewed changes

timokau merged commit 5bdc0e4 into kiudee:master May 26, 2020

timokau deleted the lazy-data-dimension branch May 26, 2020 13:35

timokau mentioned this pull request Jun 4, 2020

Release 1.2 #135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine data dimensions lazily on fit instead on init #118

Determine data dimensions lazily on fit instead on init #118

timokau commented May 14, 2020 •

edited

Loading

timokau commented May 20, 2020

codecov bot commented May 20, 2020 •

edited

Loading

timokau commented May 20, 2020

kiudee left a comment

Determine data dimensions lazily on fit instead on init #118

Determine data dimensions lazily on fit instead on init #118

Conversation

timokau commented May 14, 2020 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Does this close/impact existing issues?

Types of changes

Checklist:

timokau commented May 20, 2020

codecov bot commented May 20, 2020 • edited Loading

Codecov Report

timokau commented May 20, 2020

kiudee left a comment

Choose a reason for hiding this comment

timokau commented May 14, 2020 •

edited

Loading

codecov bot commented May 20, 2020 •

edited

Loading