Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behaviour of mean function in CuArray #1773

Open
gabrielpreviato opened this issue Feb 13, 2023 · 8 comments
Open

Weird behaviour of mean function in CuArray #1773

gabrielpreviato opened this issue Feb 13, 2023 · 8 comments
Labels
bug Something isn't working needs information Further information is requested

Comments

@gabrielpreviato
Copy link

Describe the bug

Sometimes when using a CuArray, I'm having some weird behaviors when using some Flux normalizations.
Trying to find the problem, I saw that this happens (at least) when using Statistics mean function.
If I execute it on the CPU it always give the expected values, but on GPU sometimes it gives this strange behavior.

To reproduce

The Minimal Working Example (MWE) for this bug:

julia> using CUDA, Statistics

julia> a = CUDA.ones((640, 640, 32, 1))

julia> b = mean(a; dims=[1])
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

[:, :, 2, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

[:, :, 3, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

;;; 

[:, :, 30, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

[:, :, 31, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

[:, :, 32, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994   

julia> b = mean(a |> cpu; dims=[1])
1×640×32×1 Array{Float32, 4}:
[:, :, 1, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   

[:, :, 2, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   

[:, :, 3, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   

;;; 

[:, :, 30, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 31, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   

[:, :, 32, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0 
Manifest.toml

[[deps.CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Preferences", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions"]
git-tree-sha1 = "edff14c60784c8f7191a62a23b15a421185bc8a8"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "4.0.1"

[[deps.CUDA_Driver_jll]]
deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg"]
git-tree-sha1 = "75d7896d1ec079ef10d3aee8f3668c11354c03a1"
uuid = "4ee394cb-3365-5eb0-8335-949819d2adfc"
version = "0.2.0+0"

[[deps.CUDA_Runtime_Discovery]]
deps = ["Libdl"]
git-tree-sha1 = "58dd8ec29f54f08c04b052d2c2fa6760b4f4b3a4"
uuid = "1af6417a-86b4-443c-805f-a4643ffb695f"
version = "0.1.1"

[[deps.CUDA_Runtime_jll]]
deps = ["Artifacts", "CUDA_Driver_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "d3e6ccd30f84936c1a3a53d622d85d7d3f9b9486"
uuid = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"
version = "0.2.3+2"

[[deps.CUDNN_jll]]
deps = ["Artifacts", "CUDA_Runtime_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "57011df4fce448828165e566af9befa2ea94350a"
uuid = "62b44479-cb7b-5706-934f-f13b2eb2e645"
version = "8.6.0+3"

[[deps.AbstractFFTs]]
deps = ["ChainRulesCore", "LinearAlgebra"]
git-tree-sha1 = "69f7020bd72f069c219b5e8c236c1fa90d2cb409"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "1.2.1"

[[deps.Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "0310e08cb19f5da31d08341c6120c047598f5b9c"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "3.5.0"

[[deps.BFloat16s]]
deps = ["LinearAlgebra", "Printf", "Random", "Test"]
git-tree-sha1 = "dbf84058d0a8cbbadee18d25cf606934b22d7c66"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.4.2"

[[deps.CEnum]]
git-tree-sha1 = "eb4cb44a499229b3b8426dcfb5dd85333951ff90"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.2"

[[deps.GPUArrays]]
deps = ["Adapt", "GPUArraysCore", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "Serialization", "Statistics"]
git-tree-sha1 = "4dfaff044eb2ce11a897fecd85538310e60b91e6"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "8.6.2"

[[deps.GPUArraysCore]]
deps = ["Adapt"]
git-tree-sha1 = "57f7cde02d7a53c9d1d28443b9f11ac5fbe7ebc9"
uuid = "46192b85-c4d5-4398-a991-12ede77f4527"
version = "0.1.3"

[[deps.GPUCompiler]]
deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "95185985a5d2388c6d0fedb06181ad4ddd40e0cb"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.17.2"

[[deps.LLVM]]
deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "df115c31f5c163697eede495918d8e85045c8f04"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "4.16.0"

Version info

Details on Julia:
Julia 1.8.2

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 4 on 16 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Details on CUDA:

julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 527.56.0

Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+527.56

Toolchain:
- Julia: 1.8.2
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce GTX 1650 Ti (sm_75, 2.901 GiB / 4.000 GiB available)
@gabrielpreviato gabrielpreviato added the bug Something isn't working label Feb 13, 2023
@maleadt maleadt added the needs information Further information is requested label Feb 13, 2023
@maleadt
Copy link
Member

maleadt commented Feb 13, 2023

Hmm, that's not enough information to help you.
I see EDITOR=code in there; are you execution this from the VScode terminal? If so, #875

@gabrielpreviato
Copy link
Author

Same happens in a REPL outside VSCode.
I think the problem may be the Statistics implementantion of the mean function, but is works pretty well when using the CPU.

Example:

julia> A = CUDA.ones((640, 640, 32, 1))

julia> B = ones((640, 640, 32, 1))

julia> mean(A; dims=1)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 2, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 3, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

;;; 

[:, :, 30, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 31, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 32, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

julia> mean(A; dims=1)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 2, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 3, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

;;; 

[:, :, 30, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 31, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 32, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994

julia> mean(A; dims=[1,2])
1×1×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 1.0000081

[:, :, 2, 1] =
 1.0000081

[:, :, 3, 1] =
 1.0000081

;;; 

[:, :, 30, 1] =
 1.0000081

[:, :, 31, 1] =
 1.0000081

[:, :, 32, 1] =
 1.0000081

julia> mean(A; dims=[1,2,3])
1×1×1×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 1.0000155

julia> mean(B; dims=[1,2,3])
1×1×1×1 Array{Float64, 4}:
[:, :, 1, 1] =
 1.0

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × AMD Ryzen 7 4800H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 16 virtual cores

@gabrielpreviato
Copy link
Author

I think I found maybe who is causing this, the Statistics mean function uses the _mean that is override in GPUArrays.
function Statistics._mean(f, A::AbstractGPUArray, dims)

It seems when you do the type conversion from the inverse of the size of the reduced dimensions (Float64) to Float32 (in the situation where you have a Float32 CuArray) there is a small error that creates the behavior I mentioned above.

I executed this _mean function statement by statement and found this:

julia> A = CUDA.ones((640, 640, 32, 1))
640×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:

julia> T = float(eltype(A))
Float32

julia> λ = convert(T, inv(_mean_denom(A, dims)))
0.0015625f0

julia> sum(Base.Fix1(*,λ), A; dims)
1×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 2, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 3, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

;;; 

[:, :, 30, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 31, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

[:, :, 32, 1] =
 0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994    0.999994  0.999994  0.999994  0.999994  0.999994  0.999994  0.999994

But with a smaller Array this does not happen.

julia> A = CUDA.ones((320, 320, 32, 1))
320×320×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:

julia> T = float(eltype(A))
Float32

julia> λ = convert(T, inv(_mean_denom(A, dims)))
0.003125f0

julia> sum(Base.Fix1(*,λ), A; dims)
1×320×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:
[:, :, 1, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 2, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 3, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

;;; 

[:, :, 30, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 31, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

[:, :, 32, 1] =
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0    1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

@mcabbott can you also check this?

@gabrielpreviato gabrielpreviato changed the title Weird behaviour of CuArray Weird behaviour of mean function in CuArray Feb 13, 2023
@mcabbott
Copy link
Contributor

I don't see this with the size above:

julia> size(a)
(640, 640, 32, 1)

julia> mean(a; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0
 1.0

(@v1.10) pkg> st CUDA GPUArrays
Status `~/.julia/environments/v1.10/Project.toml`
  [052768ef] CUDA v4.0.1
  [0c68f7d7] GPUArrays v8.6.2

julia> nextfloat(0.999994f0, 101)
1.0f0

What is the actual value you get? The compact printing may have lost some digits from 0.999994.

@gabrielpreviato
Copy link
Author

gabrielpreviato commented Feb 14, 2023

julia> mean(A; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.9999937
 0.9999937

(@v1.8) pkg> st CUDA GPUArrays
Project YOLOv7 v0.1.0
Status `C:\Users\gabri\.julia\environments\v1.8\Project.toml`
  [052768ef] CUDA v4.0.1
  [0c68f7d7] GPUArrays v8.6.2

julia> nextfloat(0.9999937f0, 101)
0.9999997f0

I made the array larger just to check, and look:

julia> size(A)
(1280, 1280, 32, 1)

julia> mean(A; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.9999892
 0.9999892

I notice this in my Flux BatchNorm layers, they were producing some wrong values, and that's how I ended up seeing this problem with larger arrays.

Another example with smaller values.

julia> B = CUDA.fill(1.0f-5, (640, 640, 32, 1))
640×640×32×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}:

julia> mean(B; dims=1)[1:2]
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0000034f-5
 1.0000034f-5

@mcabbott
Copy link
Contributor

mcabbott commented Feb 14, 2023

Some accuracy experiments (on CPU):

julia> function countepsfrom(x::T, xtrue) where {T<:AbstractFloat}
           target = T(xtrue)
           for n in Iterators.flatten(zip(0:100, -1:-1:-100))
               nextfloat(x, n) === target && return n
           end
           return (target - x) / eps(x)
       end;

julia> mean1(x, λ=convert(eltype(x), inv(length(x)))) = sum(Base.Fix1(*,λ), x);  # like CUDA at present

julia> mean1(ones(Float32, 640))
0.9999992f0

julia> countepsfrom(ans, 1)
13

julia> mean2(x, l=length(x)) = sum(Base.Fix2(/,l), x);  # divide instead, same result

julia> mean2(ones(Float32, 640))
0.9999992f0

julia> countepsfrom(ans, 1)
13

julia> mean3(x) = sum(x)/length(x);  # naiive sum then divide

julia> mean3(ones(Float32, 640))
1.0f0

Float16 to look for overflow:

julia> mean1(ones(Float16, 10^5))
Float16(1.002)

julia> countepsfrom(ans, 1)
-2

julia> mean2(ones(Float16, 10^5))
Float16(0.0)

julia> mean3(ones(Float16, 10^5))
NaN16

julia> x = rand(Float16, 10^5);

julia> xbar = mean(big.(x));  # true result

julia> countepsfrom(mean1(x), xbar)
-2

julia> countepsfrom(mean2(x), xbar)
Inf16

julia> countepsfrom(mean3(x), xbar)
Inf16

@gabrielpreviato
Copy link
Author

gabrielpreviato commented Feb 14, 2023

The solution I proposed in JuliaGPU/GPUArrays.jl#453 will fail with the Float16 but not for Float32.

julia> mean4(x, λ=convert(eltype(x), inv(length(x)))) = sum(x) .* λ;

julia> mean4(ones(Float32, 640))
1.0f0

julia> mean4(ones(Float16, 10^5))
Inf16

julia> x = rand(Float16, 10^5);

julia> xbar = mean(big.(x));

julia> countepsfrom(mean4(x), xbar)
-2

julia> mean1(ones(Float32, 10^9))
0.99999976f0

julia> countepsfrom(ans, 1)
4

julia> mean4(ones(Float32, 10^9))
1.0f0

I never worked with Float16, so I can't imagine how to deal with it in these operations.

@maleadt
Copy link
Member

maleadt commented Feb 14, 2023

The reason this doesn't reproduce consistently is the big_mapreduce_threshold cutoff that mapreduce uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs information Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants