Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make f(args...) as efficient as f(args) #11248

Closed
yuyichao opened this issue May 12, 2015 · 11 comments
Closed

Make f(args...) as efficient as f(args) #11248

yuyichao opened this issue May 12, 2015 · 11 comments
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster

Comments

@yuyichao
Copy link
Contributor

It seems that for both f(args...) and f(args) (called with a tuple), the function is specialized to take a tuple as argument. However, in the first case, the argument is always boxed into a jl_value_t* before passing in while for the second case (at least after) the tuple type change, no boxing/gc frame is necessary.

From the output from the example code below, it seems that the function is already specialized on the types of the arguments although they are somehow still using the generic jl_value_t* type as the type for both input and output.

Example code

@noinline f1(args...) = args
@noinline f2(args) = args

g1() = f1(1, 2, 3)
g2() = f2((1, 2, 3))

function time_func(f::Function, args...)
    println(f)
    f(args...)
    gc()
    @time for i in 1:1000000
        f(args...)
    end
    gc()
end

@code_llvm g1()
@code_llvm f1(1, 2, 3)

@code_llvm g2()
@code_llvm f2((1, 2, 3))

time_func(g1)
time_func(g2)

Output

define [3 x i64] @julia_g1_44267() {
top:
  %0 = alloca [5 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %0, i64 0, i64 0
  %1 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 2
  %2 = bitcast [5 x %jl_value_t*]* %0 to i64*
  store i64 6, i64* %2, align 8
  %3 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 1
  %4 = bitcast %jl_value_t** %3 to %jl_value_t***
  %5 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %5, %jl_value_t*** %4, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  %6 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 3
  %7 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 4
  store %jl_value_t* inttoptr (i64 139909651832960 to %jl_value_t*), %jl_value_t** %1, align 8
  store %jl_value_t* inttoptr (i64 139909651833008 to %jl_value_t*), %jl_value_t** %6, align 8
  store %jl_value_t* inttoptr (i64 139909651833056 to %jl_value_t*), %jl_value_t** %7, align 8
  %8 = call %jl_value_t* @julia_f1_44268(%jl_value_t* inttoptr (i64 139909684716976 to %jl_value_t*), %jl_value_t** %1, i32 3)
  %9 = bitcast %jl_value_t* %8 to [3 x i64]*
  %10 = load [3 x i64]* %9, align 8
  %11 = load %jl_value_t*** %4, align 8
  store %jl_value_t** %11, %jl_value_t*** @jl_pgcstack, align 8
  ret [3 x i64] %10
}

define %jl_value_t* @julia_f1_44268(%jl_value_t*, %jl_value_t**, i32) {
top:
  %3 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*]* %3, i64 0, i64 0
  %4 = getelementptr [3 x %jl_value_t*]* %3, i64 0, i64 2
  %5 = bitcast [3 x %jl_value_t*]* %3 to i64*
  store i64 2, i64* %5, align 8
  %6 = getelementptr [3 x %jl_value_t*]* %3, i64 0, i64 1
  %7 = bitcast %jl_value_t** %6 to %jl_value_t***
  %8 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %8, %jl_value_t*** %7, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %9 = call %jl_value_t* @jl_f_tuple(%jl_value_t* null, %jl_value_t** %1, i32 %2)
  store %jl_value_t* %9, %jl_value_t** %4, align 8
  %10 = load %jl_value_t*** %7, align 8
  store %jl_value_t** %10, %jl_value_t*** @jl_pgcstack, align 8
  ret %jl_value_t* %9
}

define [3 x i64] @julia_g2_44275() {
top:
  %0 = call [3 x i64] @julia_f2_44276([3 x i64] [i64 1, i64 2, i64 3])
  ret [3 x i64] %0
}

define [3 x i64] @julia_f2_44276([3 x i64]) {
top:
  ret [3 x i64] %0
}
g1
elapsed time: 0.193401883 seconds (61 MB allocated, 14.06% gc time in 2 pauses with 0 full sweep)
g2
elapsed time: 0.025389954 seconds (30 MB allocated, 1.64% gc time in 1 pauses with 0 full sweep)

(Related to #11244)

@yuyichao
Copy link
Contributor Author

Also, if f1 is called with f1(1, 2, 3) or f2("1", "2", "3"), two identical function are generated (I suppose this should not be the case if sth non-trivial is done in the function...)

@vtjnash
Copy link
Member

vtjnash commented May 12, 2015

jeff mentioned to me recently that it is now possible, following the recent tuple change, to enable this.

the general tradeoff is that you don't want to make a new optimized function for every possible argument list that you encounter, since the codegen time would end up dominating the runtime of that function. but there is probably a happy medium where args... gets specialized to have at least one direct argument and at least 3 function arguments.

@yuyichao
Copy link
Contributor Author

@vtjnash I agree there's probably a tradeoff between efficiency of generated code and how much to generate.

As mentioned above, the point of this issue though, is that the methods are already specialized. If you call @code_llvm on f1 with different types, you would notice that it actually generate different functions for each time (although as I said, in this trivial case they are identical in content).

IMHO, what needs to be changed is probably the convention for calling vararg functions that can be specialized.

@yuyichao
Copy link
Contributor Author

And also I'm not talking about making f(args...) the same with defining f(arg1) f(arg1, arg2) f(arg1, arg2, arg3) ... (which would also be nice BTW). I'm only talking about passing the packed argument tuple in a more specialized form to the method that is already specialized.

@vtjnash
Copy link
Member

vtjnash commented May 12, 2015

ah, good call:

julia> @code_typed f1(1, 2, 3)
1-element Array{Any,1}:
 :($(Expr(:lambda, Any[:(args::Any...)], Any[Any[],Any[Any[:args,Tuple{Int64,Int64,Int64},0]],Any[],Any[]], :(begin 
        $(Expr(:meta, :noinline)) # none, line 1:
        return args::Tuple{Int64,Int64,Int64}
    end::Tuple{Int64,Int64,Int64}))))

if you want to take a shot at this, take a look at to_function in codegen.cpp, specifically the va and specsig variables, and add code to handle the case where the vararg is a bitstype.

@yuyichao
Copy link
Contributor Author

@vtjnash Thanks for the pointer. Probably not now but I'll keep it on my list and will probably try it sometime unless someone else implement it (or sth better) before I get to it (e.g. after a thesis defence).

@timholy
Copy link
Member

timholy commented May 13, 2015

This would certainly be a big deal. Currently a good number of @generated functions in Base are that way only to avoid the splatting penalty.

@yuyichao
Copy link
Contributor Author

@timholy Didn't know about that. Do you have an example of it? Not that I think it should be the long term solution but I'm interested to see what's the current work around (and that I have thought about using @generated function to work around it a little but but didn't figure out how)

@timholy
Copy link
Member

timholy commented May 13, 2015

#10337 has extensive discussion. See also

# The @generated function here is just to work around the performance hit
# of splatting
@generated function getindex(A::Array, I::Union(Real,AbstractVector)...)
N = length(I)
Isplat = Expr[:(I[$d]) for d = 1:N]
quote
checkbounds(A, $(Isplat...))
_getindex(A, to_index($(Isplat...)))
end
end
(as one of quite a few examples).

@ihnorton ihnorton added performance Must go faster compiler:codegen Generation of LLVM IR and native code labels May 13, 2015
@JeffBezanson
Copy link
Member

dup of #5402

@yuyichao
Copy link
Contributor Author

Close as dup of #5402.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants