Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@test / @testset can have a significant overhead on 0.5 compared to 0.4 #18077

Closed
KristofferC opened this issue Aug 17, 2016 · 16 comments
Closed
Labels
compiler:inference Type inference compiler:latency Compiler latency testsystem The unit testing framework and Test stdlib

Comments

@KristofferC
Copy link
Member

KristofferC commented Aug 17, 2016

We noticed that the tests in https://github.com/KristofferC/ContMechTensors.jl got a lot slower on 0.5 than on 0.4.

On 0.4 the first run takes ~50 seconds and the second takes 1.2 seconds.
On 0.5-rc1+1 the first run takes ~140 seconds and the second takes ~20 seconds.

Removing all the @testset and changing @test to @assert the second run on 0.5 takes ~2 seconds. If I only remove the @testset the second run takes ~14 seconds on 0.5. The profile in type inference that takes the most time is the calls typeinf_frame -> stupdate! -> issubstate. It seems that both @test and @testset is now causing type inference to run extra much or something?

@JeffBezanson
Copy link
Member

JeffBezanson commented Aug 17, 2016

This seems to be due to @test in 0.5 generating many in-line try-catch blocks, which drive inference a bit crazy. It's slow the second time because we're running inference on the top-level blocks containing the test statements. Here are some options I see:

  • In 25c3659 I implemented a heuristic to bail out of type inference if there are too many try blocks. We could try reinstating that.
  • We could try running inference less often (or not at all) on top-level blocks. That will make top-level code even slower (when it does a lot of actual computation), but it tends to be slow already (e.g. global variables) so this might not be a big deal.
  • Make inference of try blocks smarter somehow, e.g. sparser.

@StefanKarpinski
Copy link
Member

Accepted as a "real issue" that should be addressed somehow. Not sure if we can target 0.5.x or if this needs to be fixed on 0.6, but optimistically tagging as 0.5.x.

@ViralBShah
Copy link
Member

We do need to see if we can either fix this in 0.5.x or at least provide a reasonable recipe to work around it.

@KristofferC
Copy link
Member Author

KristofferC commented Oct 6, 2016

FWIW, putting the expression from the @test macro in an anonoymous function made the tests in OP go from 105 seconds -> 80 seconds. 25c3659 did nothing as far as I could see.

@maleadt
Copy link
Member

maleadt commented Oct 6, 2016

Does this have anything do to with it and/or is it known, or am I doing something silly?

using Base.Test

let
    @noinline child1() = return nothing
    parent1() = child1()
    println(code_typed(parent1, ()))
    code_llvm(parent1, ())
end

@testset "foobar" begin
    @noinline child2() = return nothing
    parent2() = child2()
    println(code_typed(parent2, ()))
    code_llvm(parent2, ())
end
Any[CodeInfo(:(begin 
        return $(Expr(:invoke, MethodInstance for (::#child1#1)(), :((Core.getfield)(#self#,:child1)::#child1#1)))
    end))=>Void]

define void @julia_parent1_63282() #0 !dbg !5 {
top:
  call void @julia_child1_63283() #0
  ret void
}


Any[CodeInfo(:(begin 
        return ((Core.getfield)((Core.getfield)(#self#,:child2),:contents))()
    end))=>Any]

define %jl_value_t* @julia_parent2_63305(%jl_value_t*, %jl_value_t**, i32) #0 !dbg !5 {
top:
  ...
  store %jl_value_t* inttoptr (i64 139710928829064 to %jl_value_t*), %jl_value_t** %13, align 8
  %17 = call %jl_value_t* @jl_f_getfield(%jl_value_t* null, %jl_value_t** %5, i32 2)
  store %jl_value_t* %17, %jl_value_t** %6, align 8
  %18 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %6, i32 1)
  ...
}

cc @andreasnoack

@maleadt
Copy link
Member

maleadt commented Oct 20, 2016

So I reduced the above to try..end breaking static dispatch, but the lowered code is different so I'm not sure whether this should be an inference or lowering optimization (couldn't find an existing issue)?

let
    @noinline child1() = return nothing
    parent1() = child1()    # static dispatch
    println(code_lowered(parent1, ()))
end

try
    @noinline child2() = return nothing
    parent2() = child2()    # dynamic dispatch
    println(code_lowered(parent2, ()))
end

Anyway, for a quick test similar to @KristofferC's I wrapped the testset's code block in an anonymous function, reducing eg. the math tests runtime and memory usage significantly:

Test: | Total time (s): | GC time (s): | Percent in gc: | Allocated (MB): | RSS (MB):
math  |   41.19         |    0.77      |    1.87        | 1729.61         |  374.57  
math  |   42.87         |    0.76      |    1.78        | 1729.53         |  376.22  

Test: | Total time (s): | GC time (s): | Percent in gc: | Allocated (MB): | RSS (MB):
math  |   28.57         |    0.76      |    2.68        | 1669.47         |  336.85  
math  |   28.88         |    0.75      |    2.61        | 1669.52         |  337.71  

Maybe we should just go with something like this for now?

diff --git a/base/test.jl b/base/test.jl
index 9b37e87..0a481b4 100644
--- a/base/test.jl
+++ b/base/test.jl
@@ -705,6 +705,7 @@ function testset_beginend(args, tests)
     # finally removing the testset and giving it a chance to take
     # action (such as reporting the results)
     quote
+        (()->begin
         ts = $(testsettype)($desc; $options...)
         push_testset(ts)
         try
@@ -716,6 +717,7 @@ function testset_beginend(args, tests)
         end
         pop_testset()
         finish(ts)
+        end)()
     end
 end

@@ -777,6 +779,7 @@ function testset_forloop(args, testloop)
         end
     end
     quote
+        (()->begin
         arr = Array{Any,1}(0)
         first_iteration = true
         local ts
@@ -790,6 +793,7 @@ function testset_forloop(args, testloop)
             end
         end
         arr
+        end)()
     end
 end

@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016
@aviks
Copy link
Member

aviks commented Jul 10, 2017

This is much better on 0.6, still terrible on 0.5.

@JeffBezanson
Copy link
Member

Ok to close this now?

@KristofferC
Copy link
Member Author

Let me run a comparison on 0.7 with @test vs @assert.

@KristofferC
Copy link
Member Author

KristofferC commented Jan 17, 2018

Yes, no longer any significant slowdown of @testset + @test compared to just assert.

Although not related to the issue, there is a significant compilation regression vs 0.7. These are timings from testsets on 0.6 and 0.7:

0.6

 ─────────────────────────────────────────────────────────────────────────────────────
                                              Time                   Allocations
                                      ──────────────────────   ───────────────────────
           Tot / % measured:               60.8s / 89.9%           1.98GiB / 90.5%

 Section                      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────────
 simple math                       1    11.3s  20.7%   11.3s    361MiB  19.7%   361MiB
 constructors                      1    11.1s  20.2%   11.1s    331MiB  18.0%   331MiB
 tensor ops                        1    7.31s  13.4%   7.31s    217MiB  11.8%   217MiB
   double contraction              9    7.29s  13.3%   810ms    217MiB  11.8%  24.1MiB
 constrct Arr                      1    7.24s  13.2%   7.24s    306MiB  16.7%   306MiB
 norm, trace, det, inv, eig        1    6.83s  12.5%   6.83s    197MiB  10.7%   197MiB
 tensor identities                 1    3.52s  6.43%   3.52s    213MiB  11.6%   213MiB
 promotion/conversion              1    2.43s  4.45%   2.43s    112MiB  6.12%   112MiB
 constrct func                     1    1.96s  3.58%   1.96s   21.7MiB  1.19%  21.7MiB
 base vectors                      1    1.08s  1.98%   1.08s   32.5MiB  1.77%  32.5MiB
 diagm, one                        1    977ms  1.79%   977ms   25.2MiB  1.37%  25.2MiB
 indexing                          1    959ms  1.75%   959ms   16.4MiB  0.89%  16.4MiB
 exceptions                        1   1.16ms  0.00%  1.16ms   8.22KiB  0.00%  8.22KiB
 ─────────────────────────────────────────────────────────────────────────────────────

0.7:

 ─────────────────────────────────────────────────────────────────────────────────────
                                              Time                   Allocations
                                      ──────────────────────   ───────────────────────
           Tot / % measured:                102s / 93.2%           6.91GiB / 98.4%

 Section                      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────────
 simple math                       1    17.8s  18.6%   17.8s   1.22GiB  18.0%  1.22GiB
 constructors                      1    16.4s  17.2%   16.4s   1.11GiB  16.3%  1.11GiB
 tensor ops                        1    15.0s  15.7%   15.0s    901MiB  12.9%   901MiB
   double contraction              9    14.9s  15.6%   1.66s    898MiB  12.9%   100MiB
 constrct Arr                      1    12.9s  13.5%   12.9s   1.19GiB  17.4%  1.19GiB
 norm, trace, det, inv, eig        1    9.74s  10.2%   9.74s    635MiB  9.12%   635MiB
 promotion/conversion              1    7.74s  8.11%   7.74s    719MiB  10.3%   719MiB
 tensor identities                 1    5.31s  5.56%   5.31s    478MiB  6.87%   478MiB
 constrct func                     1    3.95s  4.14%   3.95s    307MiB  4.41%   307MiB
 diagm, one                        1    2.82s  2.95%   2.82s    134MiB  1.93%   134MiB
 indexing                          1    2.32s  2.43%   2.32s    102MiB  1.46%   102MiB
 base vectors                      1    1.51s  1.58%   1.51s   83.0MiB  1.19%  83.0MiB
 exceptions                        1   50.2ms  0.05%  50.2ms    889KiB  0.01%   889KiB

So time of running the test is up by 80% and allocations with 350%.

@helgee
Copy link
Contributor

helgee commented Jan 28, 2018

I just tried adding @testset blocks to ERFA.jl. The file contains 1537 tests in 226 testsets and the performance overhead is huge.

~14x slower on v0.6.2 vs. ~21x slower on master (although I have not fixed all deprecations yet).

0.6.2

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |  x86_64-apple-darwin17.3.0

julia> @time Pkg.test("ERFA")
INFO: Testing ERFA
WARNING: year outside range(1000:3000)
INFO: ERFA tests passed
 12.574542 seconds (4.92 M allocations: 258.743 MiB, 1.19% gc time)

julia> @time Pkg.test("ERFA")
INFO: Testing ERFA
        WARNING: year outside range(1000:3000)
Test Summary: | Pass  Total
ERFA          | 1311   1311
INFO: ERFA tests passed
178.497327 seconds (413 allocations: 22.969 KiB)

Master

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.7.0-DEV.3590 (2018-01-27 08:16 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit eea727c0c (1 day old master)
|__/                   |  x86_64-apple-darwin17.4.0

julia> @time Pkg.test("ERFA")
[ Info: Testing ERFA
WARNING: year outside range(1000:3000)
[ Info: ERFA tests passed
 15.921993 seconds (8.73 M allocations: 486.941 MiB, 2.40% gc time)

julia> @time Pkg.test("ERFA")
[ Info: Testing ERFA
WARNING: year outside range(1000:3000)
Test Summary: | Pass  Total
ERFA          | 1311   1311
[ Info: ERFA tests passed
348.921708 seconds (760 allocations: 35.781 KiB)

@maleadt
Copy link
Member

maleadt commented Jan 28, 2018

FWIW, the repro I posted above still generates horrible code:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.7.0-DEV.3413 (2018-01-15 17:38 UTC)
 _/ |\__'_|_|_|\__'_|  |  master/5e2ff127f8* (fork: 2372 commits, 193 days)
|__/                   |  x86_64-pc-linux-gnu

julia> using Test


julia> let
           @noinline child1() = return nothing
           parent1() = child1()
           println(code_typed(parent1, ()))
           code_llvm(parent1, ())
       end

Any[CodeInfo(:(begin
      return
  end))=>Nothing]

; Function parent1
; Location: REPL[2]:3
define void @julia_parent1_57576() {
top:
  ret void
}


julia> @testset "foobar" begin
           @noinline child2() = return nothing
           parent2() = child2()
           println(code_typed(parent2, ()))
           code_llvm(parent2, ())
       end

Any[CodeInfo(:(begin
      SSAValue(0) = (Core.getfield)(#self#, :child2)::Core.Box
      SSAValue(1) = (Core.isdefined)(SSAValue(0), :contents)::Bool
      unless SSAValue(1) goto 5
      goto 8
      5: 
      NewvarNode(:(child2@_3))
      child2@_3
      8: 
      SSAValue(2) = (Core.getfield)(SSAValue(0), :contents)
      SSAValue(3) = (SSAValue(2))()
      return SSAValue(3)
  end))=>Any]

; Function parent2
; Location: REPL[3]:3
define nonnull %jl_value_t addrspace(10)* @japi1_parent2_57626(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  ...
  br i1 %19, label %err, label %pass

pass:                                             ; preds = %top
  ...
  %21 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)** nonnull %3, i32 1)
  ...
  ret %jl_value_t addrspace(10)* %21

err:                                              ; preds = %top
  call void @jl_undefined_var_error(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 140091673833888 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable
}

Or, using BenchmarkTools:

julia> let
           @noinline child1() = return nothing
           parent1() = child1()
           display(@benchmark $parent1())
       end

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.015 ns (0.00% GC)
  median time:      0.020 ns (0.00% GC)
  mean time:        0.020 ns (0.00% GC)
  maximum time:     0.073 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000


julia> @testset "foobar" begin
           @noinline child2() = return nothing
           parent2() = child2()
           display(@benchmark $parent2())
       end

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.524 ns (0.00% GC)
  median time:      13.663 ns (0.00% GC)
  mean time:        13.732 ns (0.00% GC)
  maximum time:     28.546 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

@JeffBezanson JeffBezanson modified the milestones: 0.5.x, 1.0.x May 21, 2018
@JeffBezanson
Copy link
Member

Hoping to target compiler performance for 1.0 and 1.0.x.

@JeffBezanson JeffBezanson added compiler:latency Compiler latency and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Sep 6, 2018
@KristofferC KristofferC removed the performance Must go faster label Oct 23, 2018
@vtjnash
Copy link
Member

vtjnash commented Apr 18, 2019

I'm not sure what's actionable here. @maleadt's example generates good code now, and the original package (ContMechTensors.jl) doesn't load for me on master

@KristofferC
Copy link
Member Author

I'll try it again at some time.

@helgee
Copy link
Contributor

helgee commented Apr 19, 2019

I retested my example above on the current master and got:

  • 3.7 seconds (without @testset)
  • 5.0 seconds (with @testset)

LGTM 🎉 Thanks y'all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:inference Type inference compiler:latency Compiler latency testsystem The unit testing framework and Test stdlib
Projects
None yet
Development

No branches or pull requests

9 participants