Almost finished #5

timholy · 2016-08-30T18:32:13Z

This isn't quite done yet, but I can't resist the opportunity for a status update. It has taken longer than expected, but I decided to take the opportunity to really push the architecture and see how flexible and performant I could make this. It turns out that quite a lot could be done, and the results are impressive if I say so myself 😄. One of the motivations was benchmarking some of @mronian's code, which pointed out that imgradients (and really, imfilter) was the bottleneck. So aside from being an exercise in testing the architecture, it seemed there might be real-world performance benefits to be had.

This isn't finished yet, so this is uglier than it will be, but check this out:

julia> img = rand(1000,1000);

julia> r = CPUThreads(Algorithm.FIR())
ComputationalResources.CPUThreads{ImagesFiltering.Algorithm.FIR}(ImagesFiltering.Algorithm.FIR())

julia> function my_sobel(r, img)
           kernel = KernelFactors.sobel()
           imfilter(r, img, kernel[1]), imfilter(r, img, kernel[2])
       end

julia> @benchmark my_sobel($r, $img)
BenchmarkTools.Trial: 
  samples:          782
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  15.72 mb
  allocs estimate:  4442
  minimum time:     5.11 ms (0.00% GC)
  median time:      6.00 ms (9.29% GC)
  mean time:        6.39 ms (13.94% GC)
  maximum time:     112.49 ms (94.52% GC)

Compare to Images:

julia> @benchmark imgradients($img, $("sobel"))
BenchmarkTools.Trial: 
  samples:          146
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  61.36 mb
  allocs estimate:  219
  minimum time:     29.41 ms (6.81% GC)
  median time:      35.32 ms (5.72% GC)
  mean time:        34.42 ms (7.85% GC)
  maximum time:     124.72 ms (72.91% GC)

And compare to a GPU-implementation (note this moves data to-and-from the GPU, but for me @benchmark crashes otherwise with a Device out of memory error):

julia> using ArrayFire

julia> af_sobel(img) = map(A->convert(Array, A), sobel(AFArray(img)))
af_sobel (generic function with 1 method)

julia> @benchmark af_sobel($img)
BenchmarkTools.Trial: 
  samples:          97
  evals/sample:     1
  time tolerance:   15.00%
  memory tolerance: 1.00%
  memory estimate:  15.26 mb
  allocs estimate:  93
  minimum time:     8.00 ms (0.00% GC)
  median time:      10.41 ms (2.65% GC)
  mean time:        10.31 ms (4.01% GC)
  maximum time:     20.84 ms (55.16% GC)

So if you want the data back on the CPU again for whatever comes next, we're now quite a bit faster in plain julia than a GPU implementation.

Notes:

I ran this with JULIA_NUM_THREADS=8. Yes, this is now supporting threads. Single-threaded this exhibits a median of 13ms, so it's still a considerable improvement over what we have in Images. (Spoiler: it uses a fancy cache-efficient tiled implementation, using tricks developed in TiledIteration.)
You have to be running a really recent master to get decent performance. Specifically, you need Add a more efficient implementation of in(::CartesianIndex, ::CartesianRange) JuliaLang/julia#18277, which should be backported for the julia-0.5 (though possibly not until julia-0.5.1; if needed, we can put a version check and add the missing method here).

A good test for the eltype generality of our operations

Now it's possible to construct kernel-tuples with a mix of TS and AbstractArray kernels, and things Just Work.

No actual tests changed

Previously, filter cascades would result in undefined values on the edges of intermediate states. This tracks the edge size at each stage of the cascade to ensure that no invalid values are used.

From an efficiency standpoint it's better to do the TriggsSdika kernels first.

Taken from Images pull request 544

Switches to returning ReshapedVectors from the factored kernels, and adds optimized methods. It's not clear we strictly need all the optimized methods, so some of these may go away sometime.

… cascades

The performance is very good; at this point the main avenue for improvement is to implement lazy padding.

This means that padding is no longer necessary for pure-FIR filters (and pure-IIR filters). The most complex part of this was getting the indices-handling correct.

mronian · 2016-08-30T23:51:26Z

Awesome \m/

timholy and others added 21 commits August 20, 2016 07:48

A few fixes and performance enhancements

3f832db

Add sobel and prewitt to Kernel, and improve docs

932fc27

More bugfixes and performance improvements

0ad7e32

Test kernels with Rational coefficients

1920ced

A good test for the eltype generality of our operations

Greatly improve test coverage

72030d0

Introduce ReshapedVector to unify TriggsSdika with array kernels

4ad88aa

Now it's possible to construct kernel-tuples with a mix of TS and AbstractArray kernels, and things Just Work.

Fix and test LoG

9be7a25

Reorganize the tests a bit

543f307

No actual tests changed

Track the interior during successive stages of filtering

d41f8ac

Previously, filter cascades would result in undefined values on the edges of intermediate states. This tracks the edge size at each stage of the cascade to ensure that no invalid values are used.

A few bug fixes and more tests

7b5baf8

Ensure commutivity when using a mixture of FIR and TriggsSdika kernels

eefdadb

From an efficiency standpoint it's better to do the TriggsSdika kernels first.

Add imgradients

a635994

Taken from Images pull request 544

Modernize the imfiltering version of imgradients

bc236d0

Give ReshapedVector more AbstractArray functionality

ebf19ed

Substantial performance improvements for imgradients

0a97c83

Switches to returning ReshapedVectors from the factored kernels, and adds optimized methods. It's not clear we strictly need all the optimized methods, so some of these may go away sometime.

Get rid of size-dependent conversion to StaticArray

d8addaf

Type and dispatch cleanups

4713eed

Choose FFT/FIR algorithm based on kernel size

38970f8

Introduce expand and shrink for more intuitive handling of filter…

a72ed22

… cascades

Add tiled, multithreaded FIR filtering

d11b60b

The performance is very good; at this point the main avenue for improvement is to implement lazy padding.

Add lazy-padding (with edge-handling) for FIR filtering

4f58feb

This means that padding is no longer necessary for pure-FIR filters (and pure-IIR filters). The most complex part of this was getting the indices-handling correct.

timholy merged commit 78a0dc1 into master Aug 31, 2016

timholy deleted the teh/finish2 branch August 31, 2016 11:19

timholy mentioned this pull request Aug 31, 2016

PyPlot and ImageView JuliaImages/ImageView.jl#86

Closed

timholy mentioned this pull request Sep 7, 2016

Remove conv2 from Base? JuliaLang/julia#18384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Almost finished #5

Almost finished #5

timholy commented Aug 30, 2016

mronian commented Aug 30, 2016

Almost finished #5

Almost finished #5

Conversation

timholy commented Aug 30, 2016

mronian commented Aug 30, 2016