-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible way to implement a LoopVectorization extension for conv2d & meanpool2d & activations #540
base: master
Are you sure you want to change the base?
Conversation
Judge resultBenchmark Report for /home/runner/work/FluxMLBenchmarks.jl/FluxMLBenchmarks.jl/benchmark/script/..Job Properties
ResultsA ratio greater than
Benchmark Group ListHere's a list of all the benchmark groups executed by this job:
Julia versioninfoTarget
Baseline
Target resultBenchmark Report for /home/runner/work/FluxMLBenchmarks.jl/FluxMLBenchmarks.jl/benchmark/script/..Job Properties
ResultsBelow is a table of this job's results, obtained by running the benchmarks.
Benchmark Group ListHere's a list of all the benchmark groups executed by this job:
Julia versioninfo
Baseline resultBenchmark Report for /home/runner/work/FluxMLBenchmarks.jl/FluxMLBenchmarks.jl/benchmark/script/..Job Properties
ResultsBelow is a table of this job's results, obtained by running the benchmarks.
Benchmark Group ListHere's a list of all the benchmark groups executed by this job:
Julia versioninfo
Runtime information
|
Was this script run as well? https://github.com/FluxML/FluxMLBenchmarks.jl/blob/main/benchmark/benchmark/nnlib/conv.jl |
It should've run, I've asked folks more familiar with the benchmarking why it didn't. For now, your local benchmarking results are fine. |
Just updated
|
Has anyone an idea why the results on some CI devices are correct but sometimes totally wrong? Some weird LV behavior (the version should be the same on all devices)? |
Note that Note also that
Bugs? Would be good if someone has a minimal example of being totally wrong, plus the CPU model/arch that was wrong. |
m = y_out + (y_stride - 1) * (y_out - 1) | ||
n = x_out + (x_stride - 1) * (x_out - 1) | ||
value = zero(T) | ||
for in_channel in 1:in_channels, y_w in 1:weight_height, x_w in 1:weight_width | ||
y_in = m + (y_w - 1) * y_dilation | ||
x_in = n + (x_w - 1) * x_dilation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it makes a difference, but LV's handling of indices written within an index expression might be better.
It has separate code for building indices by parsing index expressions vs parsing the loop body.
I'd have to look at the details to see the degree to which they're different. In theory, they should do the same thing.
LV would definitely benefit from the special case of x_dilation
being 1
.
Might be worth branching over (also, special case -1
or some other some other common small factors).
@chriselrod Thank you for the quick reply! Just added a minimal example in the 1): for index_batch in 1:current_batch_size # normally @threads is used here
@turbo for out_channel in 1:out_channels, y_out in 1:output_height, x_out in 1:output_width
for in_channel in 1:in_channels, y_w in 1:weight_height, x_w in 1:weight_width
input_gradient[x_out + x_w - 1, y_out + y_w - 1, in_channel, index_batch] += weight[x_w, y_w, in_channel, out_channel] * output_gradient[x_out, y_out, out_channel, index_batch]
end
end
end 2): for index_batch in 1:current_batch_size # normally @threads is used here
for out_channel in 1:out_channels, y_out in 1:output_height, x_out in 1:output_width
for in_channel in 1:in_channels, y_w in 1:weight_height, x_w in 1:weight_width
input_gradient[x_out + x_w - 1, y_out + y_w - 1, in_channel, index_batch] += weight[x_w, y_w, in_channel, out_channel] * output_gradient[x_out, y_out, out_channel, index_batch]
end
end
end On my local machine, I never got unequal results (using Julia 1.9.3 on Windows 10, running on an AMD Ryzen 9 5900X). It seems that the results are also always correct on CI devices running MacOS (maybe the MacOS device always runs on the same CPU). |
Some logs:
Worked on
Failed on
Failed on
|
Oh right, thank you, I must have apparently no longer thought about it. Thanks for linking the article! |
After a lot of benchmarking and lots of smaller attempts to raise the performance a bit more, I mainly found out that:
I also implemented LV
For benchmarking with whole models, I wrote this script which can compare inference/backward times for a customizable set of models and batch sizes. It looks like only the CI on buildkite is run on this PR anymore. This PR should now be ready for reviewing :) |
If it's not too much trouble, could you try setting |
@ToucheSir Sure, got these results. |
Just had a quick look at the results on buildkite on the AMD GPU server. I saw massive (very unusual) differences between LV (75ms) and im2col (15+seconds) (never occurred otherwise, e.g. not even on the CUDA GPU server). Does anyone know what could be reasons for this? https://buildkite.com/julialang/nnlib-dot-jl/builds/1116#018b13d7-e2fe-469f-8b5a-ac06f25b0bee/286-568 |
A quick bug report from trying this out on this example: FluxML/Flux.jl#2350 fails: (jl_fgGTBI) pkg> add https://github.com/jonas208/NNlib.jl#lv-ext2 ^C
julia> using NNlib, LoopVectorization
julia> function train_loop(model, optimizer, train_loader, test_loader; epochs=5)
for epoch ∈ 1:epochs
iter = tqdm(train_loader)
total = 0
corrects = 0
for (X, Y) ∈ iter
grads = Flux.gradient(model) do m
predicted = m(X)
ignore() do
b_size = size(X)[end]
corrects += sum(onecold(predicted, 0:9) .== onecold(Y, 0:9))
total += b_size
end
logitcrossentropy(predicted, Y)
end
optimizer, model = Flux.Optimise.update!(optimizer, model, grads[1])
set_postfix(iter, accuracy=corrects / total)
end
val_accuracy = accuracy(model, test_loader)
@info "Epoch $epoch/5 | Accuracy : $val_accuracy"
end
end
train_loop (generic function with 1 method)
julia> train_loop(model, optimizer, TRAIN_LOADER, TEST_LOADER)
0.0%┣ ┫ 0/469 [00:00<00:00, -0s/it]
ERROR: conversion to pointer not defined for Array{Float32, 4}
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] unsafe_convert(::Type{Ptr{Float32}}, a::Array{Float32, 4})
@ Base ./pointer.jl:67
[3] unsafe_convert
@ ~/.julia/packages/OffsetArrays/0MOrf/src/OffsetArrays.jl:464 [inlined]
[4] pointer(x::OffsetArrays.OffsetArray{Float32, 4, Array{Float32, 4}})
@ Base ./abstractarray.jl:1253
[5] memory_reference
@ ~/.julia/packages/LayoutPointers/qGkBo/src/stridedpointers.jl:21 [inlined]
[6] memory_reference
@ ~/.julia/packages/LayoutPointers/qGkBo/src/stridedpointers.jl:18 [inlined]
[7] stridedpointer_preserve
@ ~/.julia/packages/LayoutPointers/qGkBo/src/stridedpointers.jl:100 [inlined]
[8] ∇conv_data!(input_gradient::Array{…}, output_gradient::Array{…}, weight::Array{…}, cdims::DenseConvDims{…})
@ NNlibLoopVectorizationExt ~/.julia/packages/NNlib/HV0kw/ext/NNlibLoopVectorizationExt/conv.jl:160
[9] #∇conv_data#241
@ ~/.julia/packages/NNlib/HV0kw/src/conv.jl:99 [inlined]
[10] ∇conv_data
@ ~/.julia/packages/NNlib/HV0kw/src/conv.jl:95 [inlined]
[11] #380
@ ~/.julia/packages/NNlib/HV0kw/src/conv.jl:350 [inlined]
[12] unthunk
@ ~/.julia/packages/ChainRulesCore/7MWx2/src/tangent_types/thunks.jl:204 [inlined]
[13] wrap_chainrules_output
@ ~/.julia/packages/Zygote/YYT6v/src/compiler/chainrules.jl:110 [inlined]
[14] map
@ ./tuple.jl:283 [inlined]
[15] map
@ ./tuple.jl:284 [inlined]
[16] wrap_chainrules_output
@ ~/.julia/packages/Zygote/YYT6v/src/compiler/chainrules.jl:111 [inlined]
[17] ZBack
@ ~/.julia/packages/Zygote/YYT6v/src/compiler/chainrules.jl:211 [inlined]
[18] Conv
@ ~/.julia/packages/Flux/u7QSl/src/layers/conv.jl:202 [inlined]
[19] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::Array{Float32, 4})
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface2.jl:0
[20] macro expansion
@ ~/.julia/packages/Flux/u7QSl/src/layers/basic.jl:53 [inlined]
[21] _applychain
@ ~/.julia/packages/Flux/u7QSl/src/layers/basic.jl:53 [inlined]
[22] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::Matrix{Float32})
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface2.jl:0
[23] Chain
@ ~/.julia/packages/Flux/u7QSl/src/layers/basic.jl:51 [inlined]
[24] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::Matrix{Float32})
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface2.jl:0
[25] #25
@ ./REPL[53]:8 [inlined]
[26] (::Zygote.Pullback{Tuple{…}, Tuple{…}})(Δ::Float32)
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface2.jl:0
[27] (::Zygote.var"#75#76"{Zygote.Pullback{Tuple{…}, Tuple{…}}})(Δ::Float32)
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface.jl:45
[28] gradient(f::Function, args::Chain{Tuple{…}})
@ Zygote ~/.julia/packages/Zygote/YYT6v/src/compiler/interface.jl:97
[29] train_loop(model::Chain{…}, optimizer::@NamedTuple{…}, train_loader::DataLoader{…}, test_loader::DataLoader{…}; epochs::Int64)
@ Main ./REPL[53]:7
[30] train_loop(model::Chain{…}, optimizer::@NamedTuple{…}, train_loader::DataLoader{…}, test_loader::DataLoader{…})
@ Main ./REPL[53]:1
[31] top-level scope
@ REPL[54]:1
Some type information was truncated. Use `show(err)` to see complete types.
(jl_fgGTBI) pkg> st
Status `/private/var/folders/yq/4p2zwd614y59gszh7y9ypyhh0000gn/T/jl_fgGTBI/Project.toml`
[bdcacae8] LoopVectorization v0.12.165
[872c559c] NNlib v0.9.7 `https://github.com/jonas208/NNlib.jl#lv-ext2`
julia> versioninfo()
Julia Version 1.11.0-DEV.773
Commit 855cd5662d* (2023-10-29 15:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.6.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 5 on 4 virtual cores
Environment:
JULIA_NUM_THREADS = 4 |
That's because you're on Julia master, which removed I guess I should've been calling |
Ah thanks, I didn't know. Piratically defining One minor comment is that julia> Flux.@autosize (28,28,1,1) Chain(MeanPool((2,2)), Flux.flatten, Dense(_ => 10))
┌ Warning: #= /Users/me/.julia/packages/NNlib/HV0kw/ext/NNlibLoopVectorizationExt/pooling.jl:57 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ NNlibLoopVectorizationExt ~/.julia/packages/LoopVectorization/xHfLl/src/condense_loopset.jl:1148
Chain(
MeanPool((2, 2)),
Flux.flatten,
Dense(196 => 10), # 1_970 parameters
) |
That message is |
Yes I understand about the logging. But I think it's a sign that e.g. |
@mcabbott (and maybe @chriselrod if you're interested) Sorry for the late reply! I just saw it again now.
Yes, thanks, this would be better. I will probably change this in the next days.
Due to the ratio between the large number of channels compared to the low spatial size of 28x28, it's somewhat expected that this example isn't optimal for the current LV implementation. On the other hand, no stride/dilation/groups are used, so the most optimized path is chosen. So in general, it shouldn't be significantly slower than im2col in such cases (probably at least on x86 AVX2). On my Ryzen processor, the LV impl took ~33s and the default impl took ~38s with 24 Julia threads and 12 BLAS-threads or ~30s with 24 Julia threads and 1 BLAS-thread. I made the experience that turning off BLAS-threads can often speed up the default im2col approach, but sometimes it can also make it worse (especially the backward passes). Apart from that, if you have time, it would be interesting to see some more benchmark results on the M1, for example the tests in the benchmark script which produces this csv. table to compare LV against im2col. |
Just a side note: The pytorch timings are probably sub-optimal. One would use |
Some tests have shown that standard for-loop implementations for conv using LoopVectorization (LV) can be faster than the standard NNlib's im2col + gemm approach. There has already been some talks about a possible LV extension on Slack (and in one of the ML community meetings). For example, meanpooling can also be implemented using LV, but the small boost in performance is probably negligible in large networks (pooling is not as performance critical as conv I think). For conv I found out that:
∇conv_filter!
isn't really faster than theoriginal implementation in some situations, for that reason, I left it out for the moment. (The same applies to the backwardpass of meanpool.)
Tests were run on a Ryzen 9 5900X (on Windows 10).
I also wrote some tests (not sure if they fit properly in the NNlib's test strategy).
I hope the code is somewhat usable and helpful.