Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require uninitialized optimizers for our learners #119

Merged
merged 2 commits into from
Jul 1, 2020

Conversation

timokau
Copy link
Collaborator

@timokau timokau commented May 22, 2020

Description

An initialized optimizer is a tensorflow object, which (at least in
graph mode in tf1) is not deepcopy-able. Even if we were able to
deeocopy it, we probably wouldn't want to since it contains state.
Scikit-learn needs to be able to deepcopy an estimators arguments so
that it can create copies and derivatives of it.

Instead we require the uninitialized optimizer and its parameters to be
passed to our learners separately. The learner can then initialize the
optimizer as needed.

I have only done this for CmpNet so far. We'd have to do it for all estimators that take an optimizer (i.e. pretty much all of them). Before I do that, I would like some feedback on the general approach. I went with a function with out arguments since that retains the most flexibility. It may be a bit weird to users that want to override the optimizer though. An alternative would be to pass the name of a keras optimizer and a configuration dict. That would be less "weird", but also less flexible since users could no longer write their own custom estimators. @kiudee @prithagupta any thoughts?

Motivation and Context

Continuation of #116.

How Has This Been Tested?

Ran the test suite and the pre-commit hooks.

Does this close/impact existing issues?

#94, #116

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.

@timokau
Copy link
Collaborator Author

timokau commented May 26, 2020

Rebased.

@codecov
Copy link

codecov bot commented May 26, 2020

Codecov Report

Merging #119 into master will increase coverage by 0.12%.
The diff coverage is 84.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #119      +/-   ##
==========================================
+ Coverage   59.96%   60.08%   +0.12%     
==========================================
  Files         116      116              
  Lines        7658     7662       +4     
==========================================
+ Hits         4592     4604      +12     
+ Misses       3066     3058       -8     
Impacted Files Coverage Δ
csrank/choicefunction/cmpnet_choice.py 84.78% <ø> (ø)
csrank/choicefunction/fate_choice.py 91.30% <ø> (ø)
csrank/choicefunction/feta_choice.py 66.00% <ø> (ø)
csrank/choicefunction/ranknet_choice.py 81.63% <ø> (ø)
csrank/discretechoice/cmpnet_discrete_choice.py 86.84% <ø> (ø)
csrank/discretechoice/fate_discrete_choice.py 94.44% <ø> (ø)
csrank/discretechoice/feta_discrete_choice.py 55.00% <ø> (ø)
csrank/discretechoice/ranknet_discrete_choice.py 84.61% <ø> (ø)
csrank/objectranking/cmp_net.py 87.17% <ø> (ø)
csrank/objectranking/fate_object_ranker.py 93.33% <ø> (ø)
... and 16 more

@kiudee
Copy link
Owner

kiudee commented May 30, 2020

I took a look at what skorch (a scikit-learn wrapper for PyTorch) is doing. There solution can be found here:
https://github.com/skorch-dev/skorch/blob/a621d0fa31c87adbdbb3b297328caefb09d42b8c/skorch/net.py#L504-L535
Basically, they allow users to pass in an uninitialized class, which will be initialized on fit. The user can specify parameters using the nested __ notation (which turns out to be the official way they do it in scikit-learn).
An alternative would be as you say to pass in a configuration dictionary.

@kiudee
Copy link
Owner

kiudee commented May 30, 2020

One big consideration here is hyperparameter optimization. The parameters of an optimizer (e.g. learning rate) are often optimized jointly with other parameters of the learner.

So an important use-case is for instance that RandomSearchCV is run on the model.
The way this is organized in skorch and scikit-learn, a user would be able to pass something like:

{
    'module__n_hidden_layers': ...,
    'module__n_hidden_units': ...,
    'criterion__param1': ...,
    'optimizer__learning_rate': ...,
}

And the hierarchy there can be nested.

@timokau
Copy link
Collaborator Author

timokau commented May 30, 2020

I took a look at what skorch (a scikit-learn wrapper for PyTorch) is doing. There solution can be found here:
https://github.com/skorch-dev/skorch/blob/a621d0fa31c87adbdbb3b297328caefb09d42b8c/skorch/net.py#L504-L535
Basically, they allow users to pass in an uninitialized class, which will be initialized on fit. The user can specify parameters using the nested __ notation (which turns out to be the official way they do it in scikit-learn).
An alternative would be as you say to pass in a configuration dictionary.

The __ approach seems hacky to me. There is a natural way to pass nested configuration in python: dicts.

One big consideration here is hyperparameter optimization. The parameters of an optimizer (e.g. learning rate) are often optimized jointly with other parameters of the learner.

Of course, I don't know why I didn't think of that. That's a very good reason not to use the initializer-function approach I demonstrated in this PR. Good thing I waited for feedback first.

So an important use-case is for instance that RandomSearchCV is run on the model.
The way this is organized in skorch and scikit-learn, a user would be able to pass something like:

{
    'module__n_hidden_layers': ...,
    'module__n_hidden_units': ...,
    'criterion__param1': ...,
    'optimizer__learning_rate': ...,
}

And the hierarchy there can be nested.

It looks to me like the __ thing is a hack around the limitations of scikit-learn's hyperparameter optimization. From a brief skim over their documentation, it seems like there is no easy way to define distributions over dictionary-nested parameters. You would have to define a custom distribution function.

If that was more easily possible, you could also define different sets of hyperparameters for different optimizers and then hierarchically pick an optimizer first, then pick the hyperparameters for the optimizer.

Still, the design of scikit-learn combined with the fact that it seems to be the standard seems like a strong argument in favor of __. So even though I think dict-based configuration would be more elegant, we probably should go ahead with the sktorch approach. Do you agree?

@timokau timokau mentioned this pull request May 30, 2020
@kiudee
Copy link
Owner

kiudee commented May 30, 2020

Indeed, that also felt very hacky to me, too.
But I also agree that it is something people are used to and which is not that difficult to adapt to as a user. So I think we should follow that.

@kiudee kiudee marked this pull request as draft June 1, 2020 13:09
@kiudee kiudee added this to the 1.3 milestone Jun 5, 2020
@timokau
Copy link
Collaborator Author

timokau commented Jun 5, 2020

I have pushed a proof of concept. It doesn't actually quite pass the tests yet, but it shows two issues with this approach when specifying default optimizer arguments:

  • adds a lot of boilerplate for passing these around
  • how should we handle the case when the user picks a different optimizer that doesn't take the nesterov argument?

The first one could probably be worked around by centralizing our defaults (which we should do anyways I think), but the second one is a bit ugly.

@@ -85,6 +88,9 @@ def __init__(
kernel_initializer=kernel_initializer,
activation=activation,
optimizer=optimizer,
optimizer__lr=optimizer__lr,
optimizer__momentum=optimizer__momentum,
optimizer__nesterov=optimizer__nesterov,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I am talking about.

@timokau
Copy link
Collaborator Author

timokau commented Jun 5, 2020

I see some possible solutions:

  1. Simply drop the nesterov and momentum defaults. That is what skorch does: They only provide a default for lr and rely on the optimizer's defaults for other parameters.
  2. Provide a parallel mechanism of passing optimizer parameters, e.g. an optimizer_config: dict attribute. We could supply a default for that which is then overridden by any manual changes. It would likely be confusing to merge & use the two configuration mechanisms though.
  3. Set the default optimizer to None and specify in the docs that None refers to SGD with our default options.
  4. Try to figure out which parameters apply to which optimizer, using introspection. This is hard to get right and brittle.

I think all in all, the most elegant approach would be to fall back to either the lambda-parameter or a tuple (optimizer, param_dict). Given that we want to use the __ approach, solution (1) seems best to me. Is there any particular reason that we're overriding the optimizers defaults here?

Are there other options I'm not seeing?

@timokau
Copy link
Collaborator Author

timokau commented Jun 5, 2020

If we want to keep the parameters, (3) is probably best (we could also chose a string instead of None to represent that default). Or we could define our own optimizer, which is just SGD with those changed defaults.

@kiudee
Copy link
Owner

kiudee commented Jun 8, 2020

I agree that the cleanest solution would be (1) here. The only reason I see why we could provide different defaults is that we think that these work better out of the box.
But that is hard to guarantee in general.

@timokau
Copy link
Collaborator Author

timokau commented Jun 9, 2020

So should I go ahead then and remove the nasterov and momentum defaults? What about the learning rate? Its always set to 1e-4. The default is 1e-2, so maybe that is too high for our purposes.

@timokau
Copy link
Collaborator Author

timokau commented Jun 9, 2020

If we want to keep lr that is probably okay, since I expect all optimizers will accept a lr argument. Our default may not make sense for any optimizer though, and it may be somewhat cleaner to be "consistent" in not overriding the optimizers defaults.

@kiudee
Copy link
Owner

kiudee commented Jun 9, 2020

I would say we can leave all of those defaults to the corresponding optimizers. Usually, you need to adapt the parameters to your specific problem anyways.

timokau added a commit to timokau/cs-ranking that referenced this pull request Jun 10, 2020
As discussed in kiudee#119. The
reasoning is that we don't have good reason to override those, the
library user will likely have to tune them anyways (or use a different
optimizer altogether). At the same time, they make the design proposed
in kiudee#119 (passing uninitialized optimizers and their parameters
separately) more difficult.
timokau added a commit to timokau/cs-ranking that referenced this pull request Jun 10, 2020
As discussed in kiudee#119. The
reasoning is that we don't have good reason to override those, the
library user will likely have to tune them anyways (or use a different
optimizer altogether). At the same time, they make the design proposed
in kiudee#119 (passing uninitialized optimizers and their parameters
separately) more difficult.
@timokau
Copy link
Collaborator Author

timokau commented Jun 10, 2020

I moved that step to #142.

@timokau timokau force-pushed the no-direct-optimizer branch 4 times, most recently from ec39655 to e90699d Compare June 11, 2020 16:06
@timokau
Copy link
Collaborator Author

timokau commented Jun 11, 2020

I started with the transition, but there is still quite a bit of work to be done. Still WIP.

@timokau
Copy link
Collaborator Author

timokau commented Jun 14, 2020

The testsuite passes now. I had to disable one test though where the reason for failure was not obvious (FATELINEAR_DC). I'll have to look into that in more detail. Some cleanup is needed as well.

@timokau
Copy link
Collaborator Author

timokau commented Jun 14, 2020

The FATELINEAR_DC issue was straightforward after all. Only documentation & a self-review remaining.

An initialized optimizer is a tensorflow object, which (at least in
graph mode in tf1) is not deepcopy-able. Even if we were able to
deeocopy it, we probably wouldn't want to since it contains state.
Scikit-learn needs to be able to deepcopy an estimators arguments so
that it can create copies and derivatives of it.

Instead we require the uninitialized optimizer and its parameters to be
passed to our learners separately. The learner can then initialize the
optimizer as needed.
@timokau timokau changed the title [WIP] Do not directly pass an optimizer to our estimators Require uninitialized optimizers for our learners Jun 17, 2020
@timokau timokau marked this pull request as ready for review June 17, 2020 16:01
@timokau
Copy link
Collaborator Author

timokau commented Jun 17, 2020

Ready for review. This is a breaking change.

@timokau timokau requested a review from kiudee June 17, 2020 16:02
@timokau timokau modified the milestones: 1.3, 2.0 Jun 17, 2020
@timokau
Copy link
Collaborator Author

timokau commented Jun 17, 2020

Switched the milestone following semver.

kiudee
kiudee previously approved these changes Jun 24, 2020
Copy link
Owner

@kiudee kiudee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Since we now break backward compatibility, but still might want to backport patches for 1.x.y, we have to think about establishing legacy branches, which travis-ci also deploys from.

@timokau
Copy link
Collaborator Author

timokau commented Jun 24, 2020

Thanks for the review!

Since we now break backward compatibility, but still might want to backport patches for 1.x.y, we have to think about establishing legacy branches, which travis-ci also deploys from.

Yes, that'd be something to explore. Branches could always be created retroactively though once we actually have to ship some fix.

@timokau
Copy link
Collaborator Author

timokau commented Jun 24, 2020

Speaking of travis, there seems to be something wrong with the github integration. The check in github is still marked as in progress, but when clicking on "Details" travis tells you that its long finished. Is it possible to restart the job?

@kiudee
Copy link
Owner

kiudee commented Jun 25, 2020

After restarting the job, it fails now with:

'Keras requires TensorFlow 2.2 or higher. '

Newer keras versions delegate to tf.keras and therefore need tf2. See
https://github.com/keras-team/keras/releases/tag/2.4.0.
@timokau
Copy link
Collaborator Author

timokau commented Jun 26, 2020

I have pinned keras and travis is happy. Please have another look.

@timokau
Copy link
Collaborator Author

timokau commented Jun 27, 2020

It turns out that the optimizer=uninitialized_optimizer pattern still doesn't pass scikit-learn's estimator check. There seems to be no good reason for this though, I did some digging and opened an upstream issue: scikit-learn/scikit-learn#17756

I think this is still good to merge. The test is likely just overly restrictive.

@timokau timokau merged commit 8c85937 into kiudee:master Jul 1, 2020
@timokau timokau deleted the no-direct-optimizer branch July 1, 2020 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants