Abstraction penalty: Wrapper types lead to much less efficient code #13104

eschnett · 2015-09-13T14:57:17Z

(This is Julia release-0.4 with LLVM 3.6.2; LLVM 3.3 is similar but slightly worse.)

I find that introducing a trivial wrapper type around Float64 leads to much less efficient code. This is an example:

immutable F elt::Float64 end
import Base.+
+(a::F, b::F) = F(a.elt+b.elt)

I would expect this type F to be as efficient as Float64. Unfortunately it isn't, as visible in this test:

function vadd!(rs,xs,ys)
    @inbounds @simd for i in eachindex(rs)
        rs[i] = xs[i] + ys[i]
    end
end

Julia produces the following code for this example:

julia> code_native(vadd!, (Vector{F}, Vector{F}, Vector{F}))
[...]
Source line: 3
L48:    movq    (%rdx), %rax
    vmovsd  (%rax,%rdi,8), %xmm0
    movq    (%rsi), %rax
    vaddsd  (%rax,%rdi,8), %xmm0, %xmm0
    movq    (%r8), %rax
    vmovsd  %xmm0, (%rax,%rdi,8)
Source line: 74
    incq    %rdi
Source line: 75
    cmpq    %rdi, %rcx
    jne L48
[...]

When I generated code for Vector{Float64} instead, then the double indirections (reloading %rax at every loop iteration) are not present, and the loop is vectorized. I wonder what causes this, and how it can be addressed.

The presence of the double indirections indicates that this may be an aliasing problem: Maybe LLVM cannot determine that the three arrays and the content of the new type F don't alias, and thus cannot hoist the respective loads out of the loop?

The text was updated successfully, but these errors were encountered:

simonster · 2015-09-13T15:30:35Z

With julia -O (which enables the LLVM basic AA pass) the array pointer loads appear to be hoisted, although the loop is still not vectorized (on LLVM 3.3).

mlubin · 2015-09-13T20:26:31Z

CC @jrevels

eschnett · 2015-09-13T23:28:11Z

I confirm that julia -O optimizes the loads and stores, and the loop carries not obvious overhead. This is the case both for LLVM 3.3 and LLVM 3.6.2.

As you say, the loop is not vectorized. I also notice that LLVM chooses questionable addressing modes, as there are 4 integer arithmetic instructions with the 3 floating point instructions, but I cannot tell whether this is actually slower on my CPU.

jrevels · 2015-09-14T13:12:01Z

This is something that caught us off guard when testing ForwardDiff.jl's wrapper number types as well.

eschnett · 2015-09-14T15:00:59Z

LLVM 3.7 doesn't help.

@jrevels ... how did you address it?

jrevels · 2015-09-14T16:09:27Z

@eschnett I mainly didn't address it, unfortunately - at least not the problems you're pointing out with additional loads. In our case, the performance hit due to some inlining-related problems we were facing dwarfed the additional loads, so while I took note of it, it wasn't a focus at the time.

eschnett · 2015-09-14T16:23:49Z

@jrevels Okay. Well, the native code generated by julia -O looks harmless enough, meaning that LLVM should be able to vectorize the bitcode that produced it. So either the annotations added by @simd somehow get lost, or all that's missing is running another LLVM pass (or running them in a different order).

simonster · 2015-09-14T16:32:45Z

The LLVM IR constructs an aggregate (%32 = insertvalue %F undef, double %31, 0) which may be what spooks the vectorizer. LLVM 3.7 seems to be capable of removing the aggregate, but the vectorizer seems to be broken in general (#13106).

yuyichao · 2016-05-06T20:59:53Z

Is this still a problem? (Somehow I don't see vectors in code_llvm but I do see simd instructions in code_native.....)

code_native I get:

Source line: 3
L112:
        vmovupd -96(%rdi), %ymm0
        vmovupd -64(%rdi), %ymm1
        vmovupd -32(%rdi), %ymm2
        vmovupd (%rdi), %ymm3
Source line: 1
        vaddpd  -96(%rax), %ymm0, %ymm0
        vaddpd  -64(%rax), %ymm1, %ymm1
        vaddpd  -32(%rax), %ymm2, %ymm2
        vaddpd  (%rax), %ymm3, %ymm3
        vmovupd %ymm0, -96(%rcx)
        vmovupd %ymm1, -64(%rcx)
        vmovupd %ymm2, -32(%rcx)
        vmovupd %ymm3, (%rcx)
Source line: 74
        addq    $128, %rdi
        addq    $128, %rax
        addq    $128, %rcx
        addq    $-16, %r11
        jne     L112

vtjnash · 2016-05-06T21:41:28Z

lgtm also

tkelman added the performance Must go faster label Sep 13, 2015

eschnett added the compiler:codegen Generation of LLVM IR and native code label Sep 14, 2015

yuyichao mentioned this issue May 6, 2016

Unnecessary GC root for getfield of SSA immutable object #15402

Closed

vtjnash closed this as completed May 6, 2016

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label May 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstraction penalty: Wrapper types lead to much less efficient code #13104

Abstraction penalty: Wrapper types lead to much less efficient code #13104

eschnett commented Sep 13, 2015

simonster commented Sep 13, 2015

mlubin commented Sep 13, 2015

eschnett commented Sep 13, 2015

jrevels commented Sep 14, 2015

eschnett commented Sep 14, 2015

jrevels commented Sep 14, 2015

eschnett commented Sep 14, 2015

simonster commented Sep 14, 2015

yuyichao commented May 6, 2016

vtjnash commented May 6, 2016

Abstraction penalty: Wrapper types lead to much less efficient code #13104

Abstraction penalty: Wrapper types lead to much less efficient code #13104

Comments

eschnett commented Sep 13, 2015

simonster commented Sep 13, 2015

mlubin commented Sep 13, 2015

eschnett commented Sep 13, 2015

jrevels commented Sep 14, 2015

eschnett commented Sep 14, 2015

jrevels commented Sep 14, 2015

eschnett commented Sep 14, 2015

simonster commented Sep 14, 2015

yuyichao commented May 6, 2016

vtjnash commented May 6, 2016