Adam optimizer can produce NaNs with Float16 due to small epsilon #167

pevnak · 2024-02-17T19:35:00Z

Dear All,

I have found a bug in Adam's optimiser when used with Float16.
An MWE woud look like this

using Optimisers

x = Float16[0.579, -0.729, 0.5493]
δx = Float16[-0.001497, 0.0001875, -0.013176]

os = Optimisers.setup(Optimisers.Adam(Float16(1e-4)), x);
os, x = Optimisers.update(os, x, δx)

where after update x = Float16[Inf, -Inf, 0.5493]. This happens with Optimisers 0.3.2. I believe the problem is caused on this line

Optimisers.jl/src/rules.jl

Line 221 in 1908a1c

dx′ = @lazy mt / (1 - βt[1]) / (sqrt(vt / (1 - βt[2])) + ϵ) * η

because default ϵ is in Float64 and it is too small.
When I initialize Adam with different epsilon as Optimisers.Adam(1e-4, (0.9, 0.999), eps(Float16)), it works well.
A possible solution would be to make sure that the epsilon has at least some value with respect to the given type of Float. For example change T(o.epsilon) to max(T(o.epsilon), eps(T)) on line 216.

I can prepare a pull-request with this test.

The text was updated successfully, but these errors were encountered:

mcabbott · 2024-02-17T19:45:04Z

It used to use eps(typeof(eta)), which had the unfortunate effect that Adam(1e-4) and Adam(1f-4) did different things. Even if your model was all Float32, you could easily provide eta::Float64 by mistake. But fixing that means that Adam(Float16(1e-4)) also doesn't change epsilon.

ϵ = max(T(o.epsilon), eps(T)) isn't a crazy idea.

Another possibility is this: The default could be to store nothing, and compute epsilon when called, to be eps(T). Then it won't ever disobey an explicit order, such as Adam(1e-4, (0.9, 0.99), 0.0), but by default it will adjust.

This could be written ϵ = T(something(o.epsilon, eps(T))) in the rule, and some change to the macro to make struct Adam ... epsilon::Union{Nothing,Float64} with default nothing.

pevnak · 2024-02-18T20:29:20Z

Personally, I would go with easy ϵ = max(T(o.epsilon), eps(T)).
Not that I need to promote my own stuff, but the idea with nothing is a bit overengineered to my taste.
But I can implement it if you will prefer it.
Just tell me which is more in-line with the spirit of the library and I can do the PR.

CarloLucibello · 2024-04-08T07:18:46Z

The ϵ = max(T(o.epsilon), eps(T)) solution seems good. A PR would be appreciated @pevnak !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adam optimizer can produce NaNs with Float16 due to small epsilon #167

Adam optimizer can produce NaNs with Float16 due to small epsilon #167

pevnak commented Feb 17, 2024 •

edited

Loading

mcabbott commented Feb 17, 2024

pevnak commented Feb 18, 2024

CarloLucibello commented Apr 8, 2024

Adam optimizer can produce NaNs with Float16 due to small epsilon #167

Adam optimizer can produce NaNs with Float16 due to small epsilon #167

Comments

pevnak commented Feb 17, 2024 • edited Loading

mcabbott commented Feb 17, 2024

pevnak commented Feb 18, 2024

CarloLucibello commented Apr 8, 2024

pevnak commented Feb 17, 2024 •

edited

Loading