Remove Recurrent cell eltype restriction #1755

rmsrosa · 2021-10-25T19:45:47Z

Remove type restriction in the methods associated with recurrent cells RNN, LSTM, GRU and GRUv3.

Change conversion of Nil to other types, so Nil() gets converted to NaN16, NaN32, NaN for the corresponding float types and also complex types with float real and imaginary parts. Conversion errors for Integer and Rational types, in accordance to the errors when converting non integer floats.

This allows outputsize to work with recurrent cells as well.

Suitable tests were added.

This should fix #1565.

PR Checklist

[X ] Tests are added
Entry in NEWS.md
Documentation, if applicable
API changes require approval from a committer (different from the author, if applicable)

make Nil() convert to float as NaN and error when converting to other Number subtypes Allows Nil[] to be passed to recurrent layers (when type not enforced)

DhairyaLGandhi · 2021-10-25T20:18:12Z

Looks quite good to me, thanks!

DhairyaLGandhi · 2021-10-25T20:19:28Z

Could we do a few quick tests that make sure that mismatched eltypes work in the RNN layers as expected on the GPU

rmsrosa · 2021-10-25T21:29:36Z

Yeah, I didn't check any CUDA related stuff. No knowledge of that.

How bad are these unsuccessful tests above?

DhairyaLGandhi · 2021-10-25T22:15:17Z

For the RNN tests, first I'd create a regular rnn layer, and then pass it to Flux.f64 and run it with Float32 data.

And rinse and repeat for CUDA.

src/outputsize.jl

darsnack · 2021-10-26T00:08:40Z

This goes against #1483, so I don't think we should merge without a discussion. The implementation is good other than @CarloLucibello's suggestion though.

Are the errors from #1483 no longer an issue? What made them go away? Did something change on the Zygote side of thing @mcabbott?

mcabbott · 2021-10-26T00:23:41Z

No idea, have never looked at #1483. When I try, I get different errors:

julia> Flux.train!(loss, Flux.params(m), x, ADAM())
ERROR: MethodError: no method matching (::Flux.RNNCell{typeof(tanh), Matrix{Float32}, Vector{Float32}, Matrix{Float32}})(::Matrix{Float32}, ::Float64)

julia> x32 = [rand(Float32, 5) for i in 1:2]
2-element Vector{Vector{Float32}}:
 [0.19781059, 0.095588446, 0.7885516, 0.20238525, 0.21201706]
 [0.85013026, 0.44129986, 0.82300264, 0.8385034, 0.3073842]

julia> Flux.train!(loss, Flux.params(m), x32, ADAM())
ERROR: MethodError: no method matching (::Flux.RNNCell{typeof(tanh), Matrix{Float32}, Vector{Float32}, Matrix{Float32}})(::Matrix{Float32}, ::Float32)
Closest candidates are:
  (::Flux.RNNCell{F, A, V, <:AbstractMatrix{T}})(::Any, ::Union{AbstractVector{T}, AbstractMatrix{T}, Flux.OneHotArray}) where {F, A, V, T} at ~/.julia/packages/Flux/ZnXxS/src/layers/recurrent.jl:103
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Zygote/rv6db/src/compiler/interface2.jl:0 [inlined]
  [2] _pullback(::Zygote.Context, ::Flux.RNNCell{typeof(tanh), Matrix{Float32}, Vector{Float32}, Matrix{Float32}}, ::Matrix{Float32}, ::Float32)
    @ Zygote ~/.julia/packages/Zygote/rv6db/src/compiler/interface2.jl:9
  [3] _pullback
    @ ~/.julia/packages/Flux/ZnXxS/src/layers/recurrent.jl:47 [inlined]

I see that Flux.Recur is mutable, and Zygote currently has issues with some such, and almost no tests... does Flux have decent tests for these?

ToucheSir · 2021-10-26T00:27:51Z

I don't think Flux.train! does the right thing at all when given batched sequence data, so the MWEs from #1483 definitely need to be tweaked.

Also changes float(::Type{Nil}) for consistency

darsnack · 2021-10-26T01:58:32Z

src/outputsize.jl

@@ -16,10 +16,9 @@ const nil = Nil()
 Nil(::T) where T<:Number = nil
 Nil(::Nil) = nil
 (::Type{T})(::Nil) where T<:Number = T(NaN)
-Base.convert(::Type{Nil}, ::Type{T}) where T<:Number = T(NaN)
+Base.convert(::Type{T}, ::Nil) where T<:Number = T(NaN)


Let's also keep the old method.

Suggested change

Base.convert(::Type{T}, ::Nil) where T<:Number = T(NaN)

Base.convert(::Nil, ::Type{T}) where T<:Number = nil

Base.convert(::Type{T}, ::Nil) where T<:Number = T(NaN)

darsnack · 2021-10-26T01:59:08Z

src/outputsize.jl


-
-Base.float(::Type{Nil}) = Nil
+Base.float(::Type{Nil}) = Float64


Suggested change

Base.float(::Type{Nil}) = Float64

Base.float(::Nil) = NaN

Hmm, since NaN isa Float64, shouldn't it actually be both:

Base.float(::Type{Nil}) = Float64 Base.float(::Nil) = NaN

jeremiedb · 2021-11-05T07:26:03Z

I'm afraid the proposed change results in hiding an error that arise from inconsistent eltype of input (X) and cell's initial state's (state0) when getting the gradients on GPU. For example, if feeding a RNN with Float64 while the model's state0 is initialized with defaults Float32:

using Flux
using Statistics: mean
using Random: seed!
using BenchmarkTools
using CUDA

#####################################
# RNN vanilla
#####################################
seed!(123)
feat = 8
h_size = 3
seq_len = 5
batch_size = 15

rnn = RNN(feat, h_size)

X = [rand(Float64, feat, batch_size) for i in 1:seq_len]
Y = rand(Float64, batch_size, seq_len) ./ 10

#### transfer to gpu ####
rnn_gpu = rnn |> gpu
X_gpu = CuArray{Float64}.(X)
Y_gpu = Float64.(Y)

θ = Flux.params(rnn)
θ_gpu = Flux.params(rnn_gpu)

function loss_cpu(x, y)
    # Flux.reset!(rnn)
    l = sum(Flux.stack(map(rnn, x), dims=2))  # edit: dims keyword
    return l
end

function loss_gpu(x, y)
    l = sum(Flux.stack(map(rnn_gpu, x), dims=2))  # edit
    return l
end

loss_cpu(X, Y)
loss_gpu(X_gpu, Y_gpu)

opt = ADAM(1e-3)
Flux.train!(loss_cpu, θ, [(X, Y)], opt)

opt_gpu = ADAM(1e-3)
Flux.train!(loss_gpu, θ_gpu, [(X_gpu, Y_gpu)], opt_gpu)

CPU's train! works fine, while GPU's results in:

ERROR: LoadError: this intrinsic must be compiled to be called
Stacktrace:
  [1] macro expansion
    @ C:\Users\jerem\.julia\packages\Zygote\rv6db\src\compiler\interface2.jl:0 [inlined]
  [2] _pullback(::Zygote.Context, ::Core.IntrinsicFunction, ::String, ::Type{Int64}, ::Type{Tuple{Ptr{Int64}}}, ::Ptr{Int64})
...

If using Float32 instead: X_gpu = CuArray{Float32}.(X), there's no issue.

Also, under the current forced parametrization on the RNNCell forward path definition function, the issue would be catched in advance through the missmatched types:

ERROR: LoadError: MethodError: no method matching (::Flux.RNNCell{typeof(tanh), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}})(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
Closest candidates are:
  (::Flux.RNNCell{F, A, V, var"#s526"} where var"#s526"<:AbstractMatrix{T})(::Any, ::Union{AbstractMatrix{T}, AbstractVector{T}, Flux.OneHotArray}) where {F, A, V, T} at 
c:\github\Flux.jl\src\layers\recurrent.jl:104

A way to properly handle Float32 params with Float64 input could be to specify the init_state type:

rnn = RNN(feat, h_size, init_state=zeros)

which results in a valid model on CPU. However, moving to GPU, the |> gpu approach converts all model parameters to Float32, which keeps the problem in place.

I guess ideal scenario is for the GPU gradient calculation to handle where state0 type differs from input/hidden state type, but I miss the underlying CUDA implications :\

DhairyaLGandhi · 2021-11-05T08:24:16Z

Eltype promotion is how Julia handles multiple code paths and is a necessary component in (say) working with complex data, or duals etc. We also have tools to manage eltypes consistently (f64, f32), and type stability is a pretty common pattern to follow too. In conv we warn on mismatched eltypes but don't error, so this can be made consistent.

darsnack · 2021-11-05T13:18:18Z

I think having at least a warning here is important. It's a more complex issue than the kind of promotion that occurs for convs, because of the type of the storage field.

I do think the better long term solution is an API. Per Julia's promotion system, the expected state type should be the promotion of the weights and the input. Right now we "guess" the state type based on the weights alone.

jeremiedb · 2021-11-05T18:43:05Z

Maybe there's a necessary compromise here since if the Recur's state isn't parametrized (in the struct definition), then the promotion will happen successfully on the state, even on GPU. However, Recur's output type then cannot be inferred, which has resulted in reported performance issue.
But if the intrinsic must be compiled error on GPU can be resolved, it would result in the desired behavior right? Would be worth opening an issue in CUDA.jl?

As a side note, if one is aware of current gotchas and wants to initialize a model with Float32 weights and Float64 input and state, an issue is that that gpu conversion moves everything into Float32, thus losing the required Float64 for state:

rnn_cpu = RNN(feat, h_size, init=Flux.zeros32, initb=Flux.zeros32, init_state=zeros)
rnn_gpu = rnn_cpu |> gpu
julia> rnn_cpu.state
3×1 Matrix{Float64}:
 0.0
 0.0
 0.0

julia> rnn_gpu.state
3×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0

Wouldn't it be desirable for gpu to preserve the types? If such was the case, then the handling of different types for weights vs input/state would be at least more convenient as the above would result in valid coded for GPU.

DhairyaLGandhi · 2022-01-17T09:40:43Z

Wouldn't it be desirable for gpu to preserve the types?

I'm not sure on that last bit, since GPUs can have very different performance properties depending on precision. Either way, if we know ahead of time that gpu produces f32 we can much easier predict the eltypes we will get. Note that the error raised successfully prevents the user from mixing eltypes on the GPU which can have vastly larger differences in what performance we see. Is it common for users to need different eltypes for different parameters? Most cases I have seen don't seem to mind f32 everywhere, and don't actually intend to use the full precision of f64. Relaxing the eltypes makes it way easier to handle mixed precision cases without needing to hold on to too many specific kernels for different eltypes.

rmsrosa · 2023-05-05T14:11:37Z

superseded by #2234

rmsrosa added 4 commits October 25, 2021 15:34

change conversion of Nil

7ba263b

make Nil() convert to float as NaN and error when converting to other Number subtypes Allows Nil[] to be passed to recurrent layers (when type not enforced)

remove type signature on recurrence cell methods

fe35235

tests following lifting recur type restrictions

c92028b

Merge branch 'FluxML:master' into recur_eltype

e22ca84

DhairyaLGandhi previously approved these changes Oct 25, 2021

View reviewed changes

CarloLucibello reviewed Oct 25, 2021

View reviewed changes

src/outputsize.jl Outdated Show resolved Hide resolved

fixes order of args in convert method from Nil

0c2b677

Also changes float(::Type{Nil}) for consistency

rmsrosa dismissed DhairyaLGandhi’s stale review via 0c2b677 October 26, 2021 01:56

darsnack reviewed Oct 26, 2021

View reviewed changes

add convert to Nil and float

2cb3e25

mcabbott added the RNN label Oct 30, 2021

rmsrosa mentioned this pull request Mar 14, 2022

example from readme results in following error rmsrosa/ChainPlots.jl#1

Closed

ToucheSir mentioned this pull request Sep 4, 2022

Added Model for Time Series Classification FluxML/FastAI.jl#253

Merged

ToucheSir mentioned this pull request Sep 29, 2022

Opaque error caused by Float64 input to RNN #1972

Closed

mcabbott mentioned this pull request Jan 8, 2023

Match layer output to weights #2156

Merged

3 tasks

rmsrosa closed this May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Recurrent cell eltype restriction #1755

Remove Recurrent cell eltype restriction #1755

rmsrosa commented Oct 25, 2021

DhairyaLGandhi commented Oct 25, 2021

DhairyaLGandhi commented Oct 25, 2021

rmsrosa commented Oct 25, 2021

DhairyaLGandhi commented Oct 25, 2021

darsnack commented Oct 26, 2021 •

edited

Loading

mcabbott commented Oct 26, 2021

ToucheSir commented Oct 26, 2021 •

edited

Loading

darsnack Oct 26, 2021 •

edited

Loading

darsnack Oct 26, 2021 •

edited

Loading

rmsrosa Oct 26, 2021

jeremiedb commented Nov 5, 2021 •

edited by mcabbott

Loading

DhairyaLGandhi commented Nov 5, 2021

darsnack commented Nov 5, 2021

jeremiedb commented Nov 5, 2021

DhairyaLGandhi commented Jan 17, 2022

rmsrosa commented May 5, 2023

	Base.convert(::Type{T}, ::Nil) where T<:Number = T(NaN)
	Base.convert(::Nil, ::Type{T}) where T<:Number = nil
	Base.convert(::Type{T}, ::Nil) where T<:Number = T(NaN)



		Base.float(::Type{Nil}) = Nil
		Base.float(::Type{Nil}) = Float64

Remove Recurrent cell eltype restriction #1755

Remove Recurrent cell eltype restriction #1755

Conversation

rmsrosa commented Oct 25, 2021

PR Checklist

DhairyaLGandhi commented Oct 25, 2021

DhairyaLGandhi commented Oct 25, 2021

rmsrosa commented Oct 25, 2021

DhairyaLGandhi commented Oct 25, 2021

darsnack commented Oct 26, 2021 • edited Loading

mcabbott commented Oct 26, 2021

ToucheSir commented Oct 26, 2021 • edited Loading

darsnack Oct 26, 2021 • edited Loading

Choose a reason for hiding this comment

darsnack Oct 26, 2021 • edited Loading

Choose a reason for hiding this comment

rmsrosa Oct 26, 2021

Choose a reason for hiding this comment

jeremiedb commented Nov 5, 2021 • edited by mcabbott Loading

DhairyaLGandhi commented Nov 5, 2021

darsnack commented Nov 5, 2021

jeremiedb commented Nov 5, 2021

DhairyaLGandhi commented Jan 17, 2022

rmsrosa commented May 5, 2023

darsnack commented Oct 26, 2021 •

edited

Loading

ToucheSir commented Oct 26, 2021 •

edited

Loading

darsnack Oct 26, 2021 •

edited

Loading

darsnack Oct 26, 2021 •

edited

Loading

jeremiedb commented Nov 5, 2021 •

edited by mcabbott

Loading