-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine data dimensions lazily on fit instead on init #118
Conversation
9267127
to
5c87953
Compare
Okay, I think/hope I found and fixed all instances now and removed mentions of the data dimensionality from the init docstrings. Tests pass. Still needs cleanup. |
befafc2
to
e053a05
Compare
Codecov Report
@@ Coverage Diff @@
## master #118 +/- ##
==========================================
- Coverage 60.18% 60.01% -0.18%
==========================================
Files 116 116
Lines 7656 7642 -14
==========================================
- Hits 4608 4586 -22
- Misses 3048 3056 +8
|
I have cleaned everything up and reviewed my changes. This is ready for another pair of eyes now. This is quite a big change. The basic idea is to get rid of the required The repetitiveness of some of the changes shows that there is a lot of duplicate code. I haven't introduced any new abstractions for now because of the risk of introducing errors and because some big refactor / partial rewrite is on the horizon anyway (either the tf2 update or a switch to pytorch). I hope we can get this merged relatively quickly, since it touches a lot of code and will otherwise likely cause merge conflicts. Keep in mind that I've already reviewed all the changes myself, so a second review doesn't have to be too detailed. I have left two comments on things that may need more review in the diff. We can ignore CodeCov. Looks like there is still some configuration issue. If anything, this should increase coverage by decreasing the lines of code. |
fa27c65
to
e053a05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. The only thing which needs to be fixed is the n_top in ListNet.
Currently we require the data dimensionality to be passed into the FETALinear constructor. It is then checked that it matches the dimensionality of the data in `fit`. This is somewhat redundant and not compatible with the scikit-learn estimator API. We were already creating the model in the `fit` function. It was also created when updating the hyperparameters, but that is redundant. In the `fit` function we can simply derive the dimensionality from the data.
d5a5679
to
fdcec9f
Compare
After the last few commits the learners no longer need that information at initialization time. Instead, they determine it from the data when fitting.
fdcec9f
to
7c0eef0
Compare
Description
Continuation of #94.
This is a work in progress. I'm putting this out there since it is likely I won't have any time left to finish it this week.
Things to do:
Unify all the changes. Initially I started of by just inferring the data dimensionality from the current data without storing it in an instance attribute. This requires more invasive refactorings however, since it means the dimension will only be available from thefit
function and thefit
function has to pass it along as parameters. It doesn't always work either, since sometimes the dimensionality is needed at prediction time and then it sometimes matters if the dimensionality of the data we're predicting on is different from the dimensionality of the data we have trained on. Therefore I started storing the dimensionality the current model was fit on in instance attributes. I should adapt the earlier changes to this style to increase uniformity and reduce the chance of error.Verify that pre-commit hooks and the full test suite pass after every commit.Find all initializations in which the dimensionality is passed to the estimator. Currently this doesn't throw an error since the estimators accept**kwargs
. Still, the calls should be adapted.Double-check if any more documentation needs to be modified to remove references ton_objects
andn_object_features
.Motivation and Context
The
scikit-learn
estimator API only allows parameters with default values for__init__
. There are no sensible defaults for the data dimension. Further, its just unnecessary to fix the data dimension on init. The developer experience is much nicer when it can simply be inferred from the data passed tofit
.How Has This Been Tested?
Ran the full test suite.
Does this close/impact existing issues?
Impacts #94, #116
Types of changes
Checklist: