Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make eps a parameter of optimisers #1819

Merged
merged 1 commit into from
Dec 29, 2021
Merged

Conversation

cossio
Copy link
Contributor

@cossio cossio commented Dec 28, 2021

Makes epsilon a parameter of each optimiser. Closes #1818.

PR Checklist

  • Tests are added

@cossio cossio mentioned this pull request Dec 28, 2021
Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only other comment is to call the field epsilon since that's the name we chose in Optimisers.jl (to avoid confusion with the function of the same name).

src/optimise/optimisers.jl Show resolved Hide resolved
src/optimise/optimisers.jl Show resolved Hide resolved
src/optimise/optimisers.jl Show resolved Hide resolved
src/optimise/optimisers.jl Show resolved Hide resolved
src/optimise/optimisers.jl Show resolved Hide resolved
src/optimise/optimisers.jl Show resolved Hide resolved
src/optimise/optimisers.jl Show resolved Hide resolved
src/optimise/optimisers.jl Show resolved Hide resolved
src/optimise/optimisers.jl Show resolved Hide resolved
src/optimise/optimisers.jl Show resolved Hide resolved
@cossio cossio force-pushed the eps branch 3 times, most recently from 9a21a7a to bb0e1e7 Compare December 28, 2021 14:40
@cossio
Copy link
Contributor Author

cossio commented Dec 28, 2021

@darsnack I didn't want to change the value of eps Flux is currently using, so as to not change behavior. So I prefer not to use eps(typeof(eta)) as you suggest. I continue to use to the value of 1e-8 here.

Using eps(typeof(eta)) (which changes the value of epsilong used) is technically a breaking change so it could be adressed in a separate PR.

Also note that eta is a Float64 in all the optimizers, so using epsilon with type typeof(eta) is not really helping.

I renamed eps -> epsilon following your other suggestion.

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I didn't realize the type was hard coded to Float64.

@cossio
Copy link
Contributor Author

cossio commented Dec 28, 2021

Merge?

@darsnack
Copy link
Member

bors r+

bors bot added a commit that referenced this pull request Dec 28, 2021
1819: make eps a parameter of optimisers r=darsnack a=cossio

Makes epsilon a parameter of each optimiser. Closes #1818.

### PR Checklist

- [X] Tests are added


Co-authored-by: cossio <j.cossio.diaz@gmail.com>
@bors
Copy link
Contributor

bors bot commented Dec 28, 2021

Build failed:

@darsnack
Copy link
Member

Failure seems related to #1808 (@ToucheSir)?

@ToucheSir
Copy link
Member

Yes, #1804 specifically. I still haven't been able to repro the issue locally (will try replicating the Buildkite env when I have time in a couple of days). Feel free to @test_skip that one if you need any PRs in sooner, otherwise help would be much appreciated on figuring out the CI failure.

@cossio
Copy link
Contributor Author

cossio commented Dec 29, 2021

Merge this? buildkite failure looks unrelated to this PR, and all other tests pass.

@darsnack
Copy link
Member

@CarloLucibello @DhairyaLGandhi I think an org owner needs to merge.

@CarloLucibello CarloLucibello merged commit 7f375aa into FluxML:master Dec 29, 2021
@cossio cossio deleted the eps branch December 29, 2021 16:56
@DhairyaLGandhi
Copy link
Member

This was a bit too wide a change for the api and can break a bunch of code...

@darsnack
Copy link
Member

darsnack commented Dec 29, 2021

can break a bunch of code

How so? It adds an optional positional argument at the end that has the same default value as before.

@DhairyaLGandhi
Copy link
Member

It adds a field to the optimisers, any code using the default "struct" constructors would break.

@ToucheSir
Copy link
Member

Sorry, I should've caught that after #1778. Shall we quarantine both off for 0.13?

@DhairyaLGandhi
Copy link
Member

#1778 is fine still because the disruption is limited to one type with a clear way of making a case for a non breaking type.

@cossio
Copy link
Contributor Author

cossio commented Dec 29, 2021

Sorry I didn't realize that. To be clear, what you mean is that a call like:

RMSProp(η, ρ, IdDict())

no longer works, because it now needs to be replaced by:

RMSProp(η, ρ, ϵ, IdDict())

right?

A possible solution is to make another PR (overwriting this one), which makes ϵ an optional keyword argument. Would that work?

RMSProp(η, ρ, IdDict(); ϵ = ϵ)

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Dec 29, 2021 via email

@ToucheSir
Copy link
Member

ToucheSir commented Dec 29, 2021

Well, I'm still not sure if adding struct fields constitutes a SemVer break, as (de)serialization (among other things) is affected. If anyone is in the know about this, please do leave a comment.

@darsnack
Copy link
Member

darsnack commented Dec 29, 2021

@DhairyaLGandhi thanks for the explanation, sorry I missed that.

How frequently is the IdDict manually being passed in though? I'd rather have a breaking change for v0.13 than complicate the API by adding a keyword just to make it not breaking.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Dec 29, 2021

It's got less to do with frequency and more about necessity and API compactness. In cases where we need literal copies of gradients (for synchronisation for example) over different models it's helpful to store states and gradients as the same type as literal Dicts for simplicity, and making users add in an epsilon fiels (especially when not needed in the optimisation step) is unnecessary.

@darsnack
Copy link
Member

I get that, but I'm saying if the frequency is low enough (i.e. a handful of users), then it's fair to ask that handful to add the new field into their code on a breaking release. Anyone manually passing in the IdDict is already passing in a bunch of parameters other than the LR. This seems reasonable to me in order to not have a global constant in the package affecting the behavior of many functions.

@cossio
Copy link
Contributor Author

cossio commented Dec 29, 2021

How frequently is the IdDict manually being passed in though? I'd rather have a breaking change for v0.13 than complicate the API by adding a keyword just to make it not breaking.

I don't see why making epsilon a keyword would complicate the API? It would be an optional keyword that most users need not even notice.

@DhairyaLGandhi
Copy link
Member

Passing in Dict explicitly is more than passing a bunch of arguments though, it is arguably more related to the use case of the optimisers than the value of an epsilon. It's already very rare to need to change epsilon.

@darsnack
Copy link
Member

I meant that if you are manually passing in the dict, then you are already writing ADAM(lr, (0.9, 0.999), IdDict(...)). No one changes the (0.9, 0.999) either, but these are extra arguments the user is copy-pasting in order to pass in the IdDict. So having to update the code to ADAM(lr, (0.9, 0.999), 1e-8, IdDict(...)) on a breaking release seems okay to me.

@DhairyaLGandhi
Copy link
Member

But even a long list of arguments to the constructor maintains a rational order (going by what is likely to be used and requires changing for the majority of use cases), which epsilon in the position in the pr does not since it comes before the storage. It should be after.

@darsnack
Copy link
Member

Sure, a PR to fix the order is okay for me.

@ToucheSir
Copy link
Member

I would favour grouping the hparams together. Perhaps if the IdDict was the first parameter in the default constructor?

@cossio
Copy link
Contributor Author

cossio commented Dec 29, 2021

I'd prefer

RMSProp(η, ρ, IdDict(); ϵ = ϵ)

over

RMSProp(η, ρ, IdDict(), ϵ)

In fact, having all hyper-params as keyword arguments wouldn't be bad I think,

RMSProp(IdDict(); η=η, ρ=ρ, ϵ=ϵ)

but that's another matter.

@ToucheSir
Copy link
Member

For ADAM at least, I'd argue β is not sufficiently more worthy of tuning than ϵ as to warrant being before the IdDict than after. Therein lies the challenge though: can we agree on which hyperparams are "privileged" and which are relegated to kwargs? If not, putting the IdDict first by convention prevents (most) any bikeshedding there.

@cossio
Copy link
Contributor Author

cossio commented Jan 16, 2022

Any decision here?

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Jan 16, 2022

I prefer the first proposal in #1819 (comment). Since ϵ is common amongst many optimisers, and fitting it into different slots of optimiser API might get meddlesome, having it consistently as a kwarg makes sense. Besides, while experimenting with the hyper parameters of ADAM, I'd imagine modifying β would happen more frequently than ϵ (which strictly may not be a parameter at all).

@cossio
Copy link
Contributor Author

cossio commented Jan 16, 2022

I prefer the first proposal in #1819 (comment).

I agree. I can make a PR.

@cossio cossio mentioned this pull request Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimisers epsilon
5 participants