Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Almost finished #5

Merged
merged 21 commits into from
Aug 31, 2016
Merged

Almost finished #5

merged 21 commits into from
Aug 31, 2016

Conversation

timholy
Copy link
Member

@timholy timholy commented Aug 30, 2016

This isn't quite done yet, but I can't resist the opportunity for a status update. It has taken longer than expected, but I decided to take the opportunity to really push the architecture and see how flexible and performant I could make this. It turns out that quite a lot could be done, and the results are impressive if I say so myself 😄. One of the motivations was benchmarking some of @mronian's code, which pointed out that imgradients (and really, imfilter) was the bottleneck. So aside from being an exercise in testing the architecture, it seemed there might be real-world performance benefits to be had.

This isn't finished yet, so this is uglier than it will be, but check this out:

julia> img = rand(1000,1000);

julia> r = CPUThreads(Algorithm.FIR())
ComputationalResources.CPUThreads{ImagesFiltering.Algorithm.FIR}(ImagesFiltering.Algorithm.FIR())

julia> function my_sobel(r, img)
           kernel = KernelFactors.sobel()
           imfilter(r, img, kernel[1]), imfilter(r, img, kernel[2])
       end

julia> @benchmark my_sobel($r, $img)
BenchmarkTools.Trial: 
  samples:          782
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  15.72 mb
  allocs estimate:  4442
  minimum time:     5.11 ms (0.00% GC)
  median time:      6.00 ms (9.29% GC)
  mean time:        6.39 ms (13.94% GC)
  maximum time:     112.49 ms (94.52% GC)

Compare to Images:

julia> @benchmark imgradients($img, $("sobel"))
BenchmarkTools.Trial: 
  samples:          146
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  61.36 mb
  allocs estimate:  219
  minimum time:     29.41 ms (6.81% GC)
  median time:      35.32 ms (5.72% GC)
  mean time:        34.42 ms (7.85% GC)
  maximum time:     124.72 ms (72.91% GC)

And compare to a GPU-implementation (note this moves data to-and-from the GPU, but for me @benchmark crashes otherwise with a Device out of memory error):

julia> using ArrayFire

julia> af_sobel(img) = map(A->convert(Array, A), sobel(AFArray(img)))
af_sobel (generic function with 1 method)

julia> @benchmark af_sobel($img)
BenchmarkTools.Trial: 
  samples:          97
  evals/sample:     1
  time tolerance:   15.00%
  memory tolerance: 1.00%
  memory estimate:  15.26 mb
  allocs estimate:  93
  minimum time:     8.00 ms (0.00% GC)
  median time:      10.41 ms (2.65% GC)
  mean time:        10.31 ms (4.01% GC)
  maximum time:     20.84 ms (55.16% GC)

So if you want the data back on the CPU again for whatever comes next, we're now quite a bit faster in plain julia than a GPU implementation.

Notes:

  • I ran this with JULIA_NUM_THREADS=8. Yes, this is now supporting threads. Single-threaded this exhibits a median of 13ms, so it's still a considerable improvement over what we have in Images. (Spoiler: it uses a fancy cache-efficient tiled implementation, using tricks developed in TiledIteration.)
  • You have to be running a really recent master to get decent performance. Specifically, you need Add a more efficient implementation of in(::CartesianIndex, ::CartesianRange) JuliaLang/julia#18277, which should be backported for the julia-0.5 (though possibly not until julia-0.5.1; if needed, we can put a version check and add the missing method here).

timholy and others added 21 commits August 20, 2016 07:48
A good test for the eltype generality of our operations
Now it's possible to construct kernel-tuples with a mix of TS and AbstractArray kernels, and things Just Work.
No actual tests changed
Previously, filter cascades would result in undefined values on the edges of intermediate states. This tracks the edge size at each stage of the cascade to ensure that no invalid values are used.
From an efficiency standpoint it's better to do the TriggsSdika kernels first.
Taken from Images pull request 544
Switches to returning ReshapedVectors from the factored kernels, and adds optimized methods. It's not clear we strictly need all the optimized methods, so some of these may go away sometime.
The performance is very good; at this point the main avenue for improvement is to implement lazy padding.
This means that padding is no longer necessary for pure-FIR filters (and pure-IIR filters).
The most complex part of this was getting the indices-handling correct.
@mronian
Copy link

mronian commented Aug 30, 2016

Awesome \m/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants