-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for RGB colors #67
Comments
In terms of data layout, I think it is likely that a "struct of arrays" representation would work best. Preferably, rather than creating three separate arrays, the last axis of the array would be of dim 3, where the dims correspond to That'd make it cheap to load separate vectors of This would require:
"2." could be as easy as making the existing operations on "3." is the most difficult. The most important aspect would be to allow it to model a vector as occupying multiple registers. A vector of Something else to consider would be supporting the julia> using LoopVectorization, BenchmarkTools
julia> Af32 = rand(Float32, 256, 256);
julia> Bf32 = rand(Float32, 256, 256);
julia> Au8 = rand(UInt8, 256, 256);
julia> Bu8 = rand(UInt8, 256, 256);
julia> Cf32 = similar(Af32); Cu8 = similar(Au8);
julia> function AmulBavx2!(C, A, B)
@avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
C[m,n] = zero(eltype(C))
for k ∈ 1:size(A,2)
C[m,n] += A[m,k] * B[k,n]
end
end
end
AmulBavx2! (generic function with 1 method)
julia> @benchmark AmulBavx2!($Cf32, $Af32, $Bf32)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 741.575 μs (0.00% GC)
median time: 748.385 μs (0.00% GC)
mean time: 758.588 μs (0.00% GC)
maximum time: 2.121 ms (0.00% GC)
--------------
samples: 6572
evals/sample: 1
julia> @benchmark AmulBavx2!($Cu8, $Au8, $Bu8)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.366 ms (0.00% GC)
median time: 1.400 ms (0.00% GC)
mean time: 1.404 ms (0.00% GC)
maximum time: 1.735 ms (0.00% GC)
--------------
samples: 3556
evals/sample: 1 Simple elementwise multiplication and especially addition were faster with the The case for elementwise multiplication was despite the fact that there is no assembly instruction for multiplying vectors of 8-bit integers (not with AVX2 or AVX512BW): julia> u = ntuple(Val(32)) do _ Core.VecElement(rand(UInt8)) end
(VecElement{UInt8}(0x43), VecElement{UInt8}(0xf1), VecElement{UInt8}(0x25), VecElement{UInt8}(0x21), VecElement{UInt8}(0xf6), VecElement{UInt8}(0x8e), VecElement{UInt8}(0xc1), VecElement{UInt8}(0x0d), VecElement{UInt8}(0xa4), VecElement{UInt8}(0x5f), VecElement{UInt8}(0xc0), VecElement{UInt8}(0x49), VecElement{UInt8}(0x6c), VecElement{UInt8}(0x98), VecElement{UInt8}(0xe6), VecElement{UInt8}(0xcb), VecElement{UInt8}(0x7a), VecElement{UInt8}(0xdb), VecElement{UInt8}(0x83), VecElement{UInt8}(0x83), VecElement{UInt8}(0x0f), VecElement{UInt8}(0xed), VecElement{UInt8}(0xe3), VecElement{UInt8}(0x92), VecElement{UInt8}(0xef), VecElement{UInt8}(0xed), VecElement{UInt8}(0x62), VecElement{UInt8}(0x42), VecElement{UInt8}(0x6d), VecElement{UInt8}(0x04), VecElement{UInt8}(0xf9), VecElement{UInt8}(0x93))
julia> u2 = ntuple(Val(32)) do _ Core.VecElement(rand(UInt8)) end
(VecElement{UInt8}(0xd9), VecElement{UInt8}(0x12), VecElement{UInt8}(0x69), VecElement{UInt8}(0x0c), VecElement{UInt8}(0xb4), VecElement{UInt8}(0x73), VecElement{UInt8}(0xa9), VecElement{UInt8}(0x3d), VecElement{UInt8}(0x8e), VecElement{UInt8}(0x64), VecElement{UInt8}(0x83), VecElement{UInt8}(0xa6), VecElement{UInt8}(0x49), VecElement{UInt8}(0xde), VecElement{UInt8}(0x95), VecElement{UInt8}(0x28), VecElement{UInt8}(0x27), VecElement{UInt8}(0x78), VecElement{UInt8}(0xc0), VecElement{UInt8}(0xd1), VecElement{UInt8}(0xfe), VecElement{UInt8}(0xbb), VecElement{UInt8}(0xbf), VecElement{UInt8}(0x4b), VecElement{UInt8}(0x21), VecElement{UInt8}(0x72), VecElement{UInt8}(0xf6), VecElement{UInt8}(0xa8), VecElement{UInt8}(0x58), VecElement{UInt8}(0x13), VecElement{UInt8}(0x09), VecElement{UInt8}(0xe2))
julia> using SIMDPirates
julia> vmul(u, u2)
(VecElement{UInt8}(0xcb), VecElement{UInt8}(0xf2), VecElement{UInt8}(0x2d), VecElement{UInt8}(0x8c), VecElement{UInt8}(0xf8), VecElement{UInt8}(0xca), VecElement{UInt8}(0x69), VecElement{UInt8}(0x19), VecElement{UInt8}(0xf8), VecElement{UInt8}(0x1c), VecElement{UInt8}(0x40), VecElement{UInt8}(0x56), VecElement{UInt8}(0xcc), VecElement{UInt8}(0xd0), VecElement{UInt8}(0xde), VecElement{UInt8}(0xb8), VecElement{UInt8}(0x96), VecElement{UInt8}(0xa8), VecElement{UInt8}(0x40), VecElement{UInt8}(0xf3), VecElement{UInt8}(0xe2), VecElement{UInt8}(0x1f), VecElement{UInt8}(0x5d), VecElement{UInt8}(0xc6), VecElement{UInt8}(0xcf), VecElement{UInt8}(0x8a), VecElement{UInt8}(0x2c), VecElement{UInt8}(0x50), VecElement{UInt8}(0x78), VecElement{UInt8}(0x4c), VecElement{UInt8}(0xc1), VecElement{UInt8}(0xc6)) yields #julia> @code_native debuginfo=:none vmul(u, u2)
.text
vpunpckhbw %ymm0, %ymm0, %ymm2 # ymm2 = ymm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
vpunpckhbw %ymm0, %ymm1, %ymm3 # ymm3 = ymm1[8],ymm0[8],ymm1[9],ymm0[9],ymm1[10],ymm0[10],ymm1[11],ymm0[11],ymm1[12],ymm0[12],ymm1[13],ymm0[13],ymm1[14],ymm0[14],ymm1[15],ymm0[15],ymm1[24],ymm0[24],ymm1[25],ymm0[25],ymm1[26],ymm0[26],ymm1[27],ymm0[27],ymm1[28],ymm0[28],ymm1[29],ymm0[29],ymm1[30],ymm0[30],ymm1[31],ymm0[31]
vpmullw %ymm2, %ymm3, %ymm2
movabsq $.rodata.cst32, %rax
vmovdqa (%rax), %ymm3
vpand %ymm3, %ymm2, %ymm2
vpunpcklbw %ymm0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23]
vpunpcklbw %ymm0, %ymm1, %ymm1 # ymm1 = ymm1[0],ymm0[0],ymm1[1],ymm0[1],ymm1[2],ymm0[2],ymm1[3],ymm0[3],ymm1[4],ymm0[4],ymm1[5],ymm0[5],ymm1[6],ymm0[6],ymm1[7],ymm0[7],ymm1[16],ymm0[16],ymm1[17],ymm0[17],ymm1[18],ymm0[18],ymm1[19],ymm0[19],ymm1[20],ymm0[20],ymm1[21],ymm0[21],ymm1[22],ymm0[22],ymm1[23],ymm0[23]
vpmullw %ymm0, %ymm1, %ymm0
vpand %ymm3, %ymm0, %ymm0
vpackuswb %ymm2, %ymm0, %ymm0
retq
nopw %cs:(%rax,%rax)
nopl (%rax) Versus Which does better is likely task dependent. If the 8-bit integers require a lot of multiplications and fixed-point related re-normalizing, the floating point numbers will probably be faster. If there are more additions/subtractions and memory moves, the 8-bits will have an advantage. If CPUs with the AVX512VNNI instruction set become more common, it may be worth supporting them as well if some kernels can deal with the fact that they horizontally sum some elements. That is, when performing a fused multiply-add on 8 bit integers, it would compute for i in 1:16
dest[i] = a[4i-3]*b[4i-3] + a[4i-2]*b[4i-2] + a[4i-1]*b[4i-1] + a[4i]*b[4i]
end I have such a chip and could run tests. |
Interesting. We got away from this layout because I had the impression it would be better to have all the values localized, but it would be easy to get back to. I presume the core problem is that having the fastest dimension be of size 3 is not ideal from a vectorization standpoint? Currently you can do this by combining views: julia> rgbchannels = rand(128, 128, 3);
julia> using ImageCore
julia> img = colorview(RGB, PermutedDimsArray(rgbchannels, (3, 1, 2)));
julia> println(summary(img))
128×128 reshape(reinterpret(RGB{Float64}, PermutedDimsArray(::Array{Float64,3}, (3, 1, 2))), 128, 128) with eltype RGB{Float64}
julia> img[1, 1]
RGB{Float64}(0.5340850453425747,0.1268892532095427,0.4547422783495081) EDIT: I should clarify that the original julia> println(summary(img.parent.parent.parent))
128×128×3 Array{Float64,3} Do you need something more "dedicated" or does this solve the problem? Note it's 3 "orthogonal" manipulations: an index permutation via But if the nesting is trouble, we used to have a |
This prompts me to realize the above should probably be 4 layers of nesting: we probably need something to say to the compiler that dimension 3 of |
Sorry to keep rethinking this, but for ImageFiltering's purposes we could just craft a If we go that way (which seems pretty attractive), I suspect this issue could be closed now. To finally address one of your other points,
For most mathematics you really need to convert to |
That might be the biggest problem. If this is ever not the case, the struct-of-arrays layout has an advantage. c = a::Complex * b::Complex
c.re == a.re * b.re - a.im * b.im
c.im == a.re * b.im + a.im * b.re If our inputs are two SIMD vectors are = a[1]
arevec = vbroadcast(are)
aim = a[2]
aimvec = vbroadcast(aim)
bimre = shuffle(b, [2,1])
c = arevec * b + <-1,1> * aimvec * bimre # elementwise ops Here, we spent a lot of effort reorganizing the data so that we could put it in a format amenable for vectorization. We'd also need some more shuffling to extend this to vectors of length 4, 8, or 16. Instead of broadcasts, we'd probably either use If, on the other hand, we had a struct-of-arrays layout, and had vectors of width c.re == a.re * b.re - a.im * b.im
c.im == a.re * b.im + a.im * b.re where each When/if arrays of structs are supported, my plan is actually to use shuffles after loading them to put them in the above format, as I think that is much simpler than the first example (although it is obviously worse what not needing the shuffles in the fist place). As an example more directly applicable to colors, if we were multiplying different colors by different constants (c::Number, a::Linear3) = base_color_type(a)(c*comp1(a), c*comp2(a), c*comp3(a)) if we wanted to multiply multiple colors by multiple constants I don't yet know what sorts of algorithms and operations are done with these colors, but as long as we're capable of doing
How are these operations expressed? However, if we want to support using the color abstractions, it's more complicated. That is if we want to support writing operations using colors, we'd need to
If we're just working directly on
Manually unrolling loops across those 3 dimensions would likely be best if there's any interactions across them like there is in the complex number case.
If I understand you correctly, that may be the most cache-friendly approach. Optimized matrix multiplication libraries do that sort of copying even when they don't need to apply any transformations, as it prevents TLB and cache misses. If the allocation is a concern, you could define a global buffer (per thread) that is reused.
Hmm. As a comment, the product of the largest 16-bit and 8-bit unsigned integers is still well within the range of integers exactly representable by julia> typemax(UInt16) * typemax(UInt8)
0xff01
julia> Float32(ans)
65281.0f0
julia> 1 << 23
8388608
julia> Float32(ans) - big(1) << 23
0.0 Floating point multiplication is much faster than integer multiplication, so I suspect it would be worth promoting to Float32 before multiplying by (Float32-versions of) the 16-bit integers, and the calculation will still be exact. |
Instead of specializing for RGB colors, how about a general interface for any
|
I don't know if this is directly related, but I'm currently experimenting with new color types. |
LoopVectorization should now support the |
That'd be a nice workaround until the switch to abstract interpreter (which is now #3 on the list of major changes I have planned for LoopVectorization), but we'd have to come up with a way to implement this that avoids world-age errors. That is, the operation mappings have to be in the form of type information so that they can be passed into |
Aside from #66, I think one of the last obstacles to using this in ImageFiltering is the handling of color array elements, for example
RGB(0.8f0, 0.3f0, 0.5f0)
(which is just a struct of 3Float32
s). Through TiledIteration it's possible to convert the data into a suitable format, for exampleRGB{N0f8}
can be converted toRGB{Float32}
becauseTileBuffer
takes an element type as an argument. (I note that's really old syntax, with modern Julia I should change it toTileBuffer{T}(undef, tileaxs)
.) Consequently it will be converted before it is used; because convolution/correlation uses each element multiple times, this conversion is well worth doing if it gets us better performance and allows us to use this package. (Another benefit of copying is that we can guarantee stridedness.)So the question is, what kind of format should I convert it to? Should we define explict support for RGB, or would it be better to convert to an
SVector{3}
or similar? I kind of favor the latter since it would presumably be more generic.I suspect this will alter the loop unrolling/SIMD packing, once I understand what representation you'd like here.
The text was updated successfully, but these errors were encountered: