Own `sqrt` and `log` returning `NaN` for "correct" multi-thread behaviour #1781

DanielDoehring · 2023-12-18T14:47:30Z

Motivation: See #1766

Inspiration for implementation: https://discourse.julialang.org/t/fastest-sqrt-and-log-with-negative-check/107575

I replaced for the moment only those sqrt and log where the argument can turn negative. Not sure if we want to use the custom implementation of sqrt_ if it is really faster (for whatever reason).

Making sure we do not loose (too much performance):

Example derived from examples/tree_2d_dgsem/elixir_euler_blast_wave_amr.jl with surface_flux = flux_hllc :

Main:

 ──────────────────────────────────────────────────────────────────────────────────────
               Trixi.jl                       Time                    Allocations      
                                     ───────────────────────   ────────────────────────
          Tot / % measured:               66.3s /  97.3%           1.42GiB /  98.7%    

 Section                     ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 rhs!                         9.28k    60.5s   93.8%  6.52ms    183KiB    0.0%    20.2B
   volume integral            9.28k    47.3s   73.3%  5.10ms    174KiB    0.0%    19.2B
     blended DG-FV            9.28k    40.5s   62.7%  4.36ms     0.00B    0.0%    0.00B
     pure DG                  9.28k    5.43s    8.4%   585μs     0.00B    0.0%    0.00B
     blending factors         9.28k    1.27s    2.0%   137μs   64.1KiB    0.0%    7.08B
     ~volume integral~        9.28k    144ms    0.2%  15.5μs    110KiB    0.0%    12.1B
   interface flux             9.28k    7.97s   12.4%   859μs     0.00B    0.0%    0.00B
   mortar flux                9.28k    1.69s    2.6%   183μs     0.00B    0.0%    0.00B
   surface integral           9.28k    1.36s    2.1%   147μs     0.00B    0.0%    0.00B
   prolong2interfaces         9.28k    1.18s    1.8%   127μs     0.00B    0.0%    0.00B
   prolong2mortars            9.28k    367ms    0.6%  39.6μs     0.00B    0.0%    0.00B
   Jacobian                   9.28k    352ms    0.5%  37.9μs     0.00B    0.0%    0.00B
   reset ∂u/∂t                9.28k    261ms    0.4%  28.1μs     0.00B    0.0%    0.00B
   ~rhs!~                     9.28k   23.3ms    0.0%  2.51μs   9.33KiB    0.0%    1.03B
   prolong2boundaries         9.28k   1.92ms    0.0%   207ns     0.00B    0.0%    0.00B
   boundary flux              9.28k    195μs    0.0%  21.0ns     0.00B    0.0%    0.00B
   source terms               9.28k    175μs    0.0%  18.9ns     0.00B    0.0%    0.00B
 AMR                            371    3.75s    5.8%  10.1ms   1.40GiB  100.0%  3.88MiB
   refine                       371    1.87s    2.9%  5.03ms    472MiB   32.8%  1.27MiB
     mesh                       364    1.63s    2.5%  4.47ms   6.26MiB    0.4%  17.6KiB
       refine_unbalanced!       364    1.57s    2.4%  4.32ms    262KiB    0.0%     738B
       rebalance!               480   48.3ms    0.1%   101μs   1.57MiB    0.1%  3.34KiB
       ~mesh~                   364   6.23ms    0.0%  17.1μs   4.44MiB    0.3%  12.5KiB
     solver                     364    238ms    0.4%   653μs    465MiB   32.3%  1.28MiB
     ~refine~                   371   1.61ms    0.0%  4.34μs    713KiB    0.0%  1.92KiB
   coarsen                      371    1.82s    2.8%  4.89ms    937MiB   65.1%  2.52MiB
     mesh                       371    1.49s    2.3%  4.03ms   2.84MiB    0.2%  7.84KiB
     solver                     371    223ms    0.3%   601μs    511MiB   35.5%  1.38MiB
     ~coarsen~                  371   98.8ms    0.2%   266μs    423MiB   29.4%  1.14MiB
   indicator                    371   59.6ms    0.1%   161μs   13.6MiB    0.9%  37.6KiB
   ~AMR~                        371   11.7ms    0.0%  31.6μs   15.9MiB    1.1%  43.9KiB
 calculate dt                 1.86k    260ms    0.4%   140μs     0.00B    0.0%    0.00B
 initial condition AMR            1    365μs    0.0%   365μs    260KiB    0.0%   260KiB
   AMR                            1    364μs    0.0%   364μs    259KiB    0.0%   259KiB
     indicator                    1    312μs    0.0%   312μs    128KiB    0.0%   128KiB
     ~AMR~                        1   51.0μs    0.0%  51.0μs    131KiB    0.0%   131KiB
     coarsen                      1    266ns    0.0%   266ns     64.0B    0.0%    64.0B
     refine                       1    154ns    0.0%   154ns     64.0B    0.0%    64.0B
   ~initial condition AMR~        1   1.11μs    0.0%  1.11μs      752B    0.0%     752B
 ──────────────────────────────────────────────────────────────────────────────────────

NaNSqrt & NaNLog:

 ──────────────────────────────────────────────────────────────────────────────────────
               Trixi.jl                       Time                    Allocations      
                                     ───────────────────────   ────────────────────────
          Tot / % measured:               66.1s /  97.3%           1.42GiB /  98.7%    

 Section                     ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 rhs!                         9.28k    60.3s   93.8%  6.50ms    183KiB    0.0%    20.2B
   volume integral            9.28k    47.3s   73.6%  5.10ms    174KiB    0.0%    19.2B
     blended DG-FV            9.28k    40.5s   63.0%  4.36ms     0.00B    0.0%    0.00B
     pure DG                  9.28k    5.43s    8.4%   585μs     0.00B    0.0%    0.00B
     blending factors         9.28k    1.27s    2.0%   137μs   64.1KiB    0.0%    7.08B
     ~volume integral~        9.28k    143ms    0.2%  15.4μs    110KiB    0.0%    12.1B
   interface flux             9.28k    7.77s   12.1%   837μs     0.00B    0.0%    0.00B
   mortar flux                9.28k    1.65s    2.6%   178μs     0.00B    0.0%    0.00B
   surface integral           9.28k    1.38s    2.1%   149μs     0.00B    0.0%    0.00B
   prolong2interfaces         9.28k    1.17s    1.8%   126μs     0.00B    0.0%    0.00B
   prolong2mortars            9.28k    367ms    0.6%  39.5μs     0.00B    0.0%    0.00B
   Jacobian                   9.28k    356ms    0.6%  38.4μs     0.00B    0.0%    0.00B
   reset ∂u/∂t                9.28k    262ms    0.4%  28.2μs     0.00B    0.0%    0.00B
   ~rhs!~                     9.28k   20.1ms    0.0%  2.17μs   9.33KiB    0.0%    1.03B
   prolong2boundaries         9.28k   1.61ms    0.0%   173ns     0.00B    0.0%    0.00B
   boundary flux              9.28k    326μs    0.0%  35.1ns     0.00B    0.0%    0.00B
   source terms               9.28k    174μs    0.0%  18.7ns     0.00B    0.0%    0.00B
 AMR                            371    3.74s    5.8%  10.1ms   1.40GiB  100.0%  3.88MiB
   refine                       371    1.85s    2.9%  5.00ms    472MiB   32.8%  1.27MiB
     mesh                       364    1.60s    2.5%  4.40ms   6.26MiB    0.4%  17.6KiB
       refine_unbalanced!       364    1.55s    2.4%  4.26ms    262KiB    0.0%     738B
       rebalance!               480   47.0ms    0.1%  98.0μs   1.57MiB    0.1%  3.34KiB
       ~mesh~                   364   6.12ms    0.0%  16.8μs   4.44MiB    0.3%  12.5KiB
     solver                     364    250ms    0.4%   687μs    465MiB   32.3%  1.28MiB
     ~refine~                   371   1.67ms    0.0%  4.49μs    713KiB    0.0%  1.92KiB
   coarsen                      371    1.81s    2.8%  4.89ms    937MiB   65.1%  2.52MiB
     mesh                       371    1.47s    2.3%  3.96ms   2.84MiB    0.2%  7.84KiB
     solver                     371    263ms    0.4%   708μs    511MiB   35.5%  1.38MiB
     ~coarsen~                  371   83.4ms    0.1%   225μs    423MiB   29.4%  1.14MiB
   indicator                    371   59.7ms    0.1%   161μs   13.6MiB    0.9%  37.6KiB
   ~AMR~                        371   13.8ms    0.0%  37.3μs   15.9MiB    1.1%  43.9KiB
 calculate dt                 1.86k    258ms    0.4%   139μs     0.00B    0.0%    0.00B
 initial condition AMR            1    341μs    0.0%   341μs    260KiB    0.0%   260KiB
   AMR                            1    340μs    0.0%   340μs    259KiB    0.0%   259KiB
     indicator                    1    287μs    0.0%   287μs    128KiB    0.0%   128KiB
     ~AMR~                        1   52.7μs    0.0%  52.7μs    131KiB    0.0%   131KiB
     refine                       1    252ns    0.0%   252ns     64.0B    0.0%    64.0B
     coarsen                      1    191ns    0.0%   191ns     64.0B    0.0%    64.0B
   ~initial condition AMR~        1    947ns    0.0%   947ns      752B    0.0%     752B
 ──────────────────────────────────────────────────────────────────────────────────────

Verification using BenchmarkTools ( I repeated these couple of times)

julia> x = rand(10^4)
julia> @btime sqrt.(x)
 12.788 μs (4 allocations: 78.20 KiB)
 
julia> @btime Trixi.sqrt_.(x)
 6.534 μs (4 allocations: 78.20 KiB)
 
julia> @btime log.(x)
 33.395 μs (4 allocations: 78.20 KiB)

julia> @btime Trixi.log_.(x)
 33.763 μs (4 allocations: 78.20 KiB)

Not sure what is going on with the sqrt_, but log_ is marginally (0.5 - 0.3 micro sec per 10000 floats) slower (as one might expect)

github-actions · 2023-12-18T14:47:47Z

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

The PR has a single goal that is clear from the PR title and/or description.
All code changes represent a single set of modifications that logically belong together.
No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

The code can be understood easily.
Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
There are no redundancies that can be removed by simple modularization/refactoring.
There are no leftover debug statements or commented code sections.
The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

New functions and types are documented with a docstring or top-level comment.
Relevant publications are referenced in docstrings (see example for formatting).
Inline comments are used to document longer or unusual code sections.
Comments describe intent ("why?") and not just functionality ("what?").
If the PR introduces a significant change or new feature, it is documented in NEWS.md.

Testing

The PR passes all tests.
New or modified lines of code are covered by tests.
New or modified tests run in less then 10 seconds.

Performance

There are no type instabilities or memory allocations in performance-critical parts.
If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

The correctness of the code was verified using appropriate tests.
If new equations/methods are added, a convergence test has been run and the results
are posted in the PR.

Created with ❤️ by the Trixi.jl community.

src/equations/compressible_euler_2d.jl

src/auxiliary/math.jl

ranocha

Thanks a lot for the initial investigation!

Could you please report some performance numbers from elixirs with and without bounds checking?
How do these full elixir runs vary when executing them multiple times?
Could you please post some benchmarks like @benchmark Trixi.rhs!(...)?
Benchmarks like x = rand(10^4); @btime sqrt.(x) are not really meaningful for us since we don't perform such uniform operations on vectors. Benchmarking Trixi.rhs! would be better, I think.

src/auxiliary/math.jl

src/solvers/dgsem_tree/dg_2d_compressible_euler.jl

DanielDoehring · 2023-12-19T11:15:50Z

Some reports on @benchmark Trixi.rhs!

examples/tree_2d_dgsem/elixir_euler_blast_wave_amr.jl with surface_flux = flux_hllc

Custom implementation:

1 Thread:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min … max):  7.880 ms …  11.458 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.562 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.565 ms ± 180.060 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▁▃▅▇▇█▅▂
  ▂▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▃▃▃▄▄▄▅▆▇█████████▆▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  7.88 ms         Histogram: frequency by time        9.13 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

4 Threads:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min … max):  2.503 ms …   4.925 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.709 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.741 ms ± 161.206 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▂▇█▇▁
  ▂▂▁▂▂▂▂▃▅█████▇▇▆▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▂▂▁▂▂▁▁▂▂▂▁▂▂▂ ▃
  2.5 ms          Histogram: frequency by time         3.5 ms <

 Memory estimate: 3.73 KiB, allocs estimate: 9.

Standard sqrt and log :

1 Thread:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min … max):  8.083 ms …  11.872 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.670 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.724 ms ± 244.216 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▃▆▆█▅▃ ▁
  ▂▂▁▂▂▂▂▂▂▂▂▂▃▄▄▅▆▇█████████▆▆▄▄▄▃▄▃▃▃▃▃▂▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  8.08 ms         Histogram: frequency by time        9.65 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

4 Threads:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min … max):  2.449 ms …  4.265 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.676 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.686 ms ± 89.104 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▁▆█▆▆▅▄▃
  ▂▁▁▁▂▂▂▁▂▂▂▂▂▂▂▃▃▅▇█████████▆▅▃▂▂▂▂▁▂▂▂▂▂▂▁▂▂▁▁▁▂▂▂▁▂▁▂▂▂▂ ▃
  2.45 ms        Histogram: frequency by time        3.02 ms <

 Memory estimate: 3.73 KiB, allocs estimate: 9.

tree_3d_dgsem/elixir_mhd_ec.jl with conservative surface flux flux_hlle and initial_refinement_level = 4:

Custom sqrt, log :

1 Thread:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min … max):  50.370 ms … 70.366 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     52.456 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.799 ms ±  1.403 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁▃▇█▆▃▂
  ▂▁▁▂▂▁▂▂▂▃▅███████████▇▅▄▄▃▃▃▃▃▃▂▃▂▂▂▃▂▂▃▂▂▂▂▁▂▂▂▁▂▂▂▁▁▁▂▁▂ ▃
  50.4 ms         Histogram: frequency by time          58 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

8 Threads:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min … max):  14.900 ms …  19.075 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.890 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.838 ms ± 431.002 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▂▁▄▂▂ ▁▄ ▆█▇▆▅▃▅▁
  ▂▁▁▁▁▁▁▁▁▂▁▂▂▁▂▃▁▁▄▂▄▃▄▄▄▅▄▇██████▇████████████▇▆▆▄▃▂▂▃▃▂▂▂▂ ▄
  14.9 ms         Histogram: frequency by time           18 ms <

 Memory estimate: 1.41 KiB, allocs estimate: 5.

Standard sqrt log:

1 Thread:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min … max):  50.199 ms … 66.471 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     51.948 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.097 ms ±  1.086 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▂▃█▇█▄▄
  ▂▂▂▂▂▂▂▂▃▃▄▄▆████████▅▆▄▃▃▃▂▂▂▂▃▂▂▂▁▂▂▂▂▂▂▂▁▁▁▂▂▂▁▁▁▂▁▁▁▁▁▂ ▃
  50.2 ms         Histogram: frequency by time        56.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

8 Threads:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min … max):  14.270 ms …  18.180 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.018 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.950 ms ± 373.664 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                        ▁▁▃▄▄▇▇█▇▃▂▁
  ▂▁▁▁▁▁▁▁▁▁▂▁▂▁▁▂▂▃▃▃▃▃▃▄▅▄▄▄▄▆▅▄▆▅▇▇▆▆█████████████▆▆▅▄▃▃▂▂▃ ▄
  14.3 ms         Histogram: frequency by time         16.7 ms <

 Memory estimate: 1.41 KiB, allocs estimate: 5.

…Math

src/auxiliary/math.jl

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

DanielDoehring · 2024-01-09T15:27:42Z

Implementation of log:
https://github.com/JuliaLang/julia/blob/c0c676b3c8af3078f5cbc7da03acb8eff09f6c1d/base/special/log.jl#L261-L297

Implementation of sqrt:
https://github.com/JuliaLang/julia/blob/c0c676b3c8af3078f5cbc7da03acb8eff09f6c1d/base/math.jl#L690-L693

DanielDoehring · 2024-01-10T12:47:39Z

@ranocha Maybe I found something that suits our needs:

As for the sqrt_llvm we could call a LLVm implementation of the log via

log_(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
log_(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32, ), x)

which actually return NaN or NaN32 if called with negative arguments.
(Taken from JuliaLang/julia#8869 (comment) )

To still enable usage of algorithmic differentiation we would still provide

log_(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x)

Repeating the benchmarks from above:

examples/tree_2d_dgsem/elixir_euler_blast_wave_amr.jl with surface_flux = flux_hllc

t0 = tspan[1]
u0 = sol.u[2]
du = similar(u0)

using BenchmarkTools
b = @benchmarkable Trixi.rhs!(du, u0, semi, t0) evals=5 samples=2000 seconds=120
run(b)

Custom sqrt, log :

1 Thread:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min … max):  8.090 ms …  11.248 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.781 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.819 ms ± 234.715 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▂▄▇█▆▅▃▂▂▂▃▅▃▂▁                           
  ▂▁▂▁▁▁▂▂▃▂▂▃▂▂▂▃▄▆▇████████████████▅▄▄▄▄▃▄▃▃▃▃▃▂▂▂▂▃▂▂▂▂▂▂▂ ▄
  8.09 ms         Histogram: frequency by time        9.64 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

4 Threads:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min … max):  2.368 ms …  13.217 ms  ┊ GC (min … max): 0.00% … 14.48%
 Time  (median):     2.734 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.796 ms ± 361.907 μs  ┊ GC (mean ± σ):  0.03% ±  0.32%

           ▂▅██▁                                               
  ▂▁▂▂▃▃▃▄▇█████▆▄▄▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▂▂▂▁▁▁▂▂ ▃
  2.37 ms         Histogram: frequency by time         4.2 ms <

 Memory estimate: 3.73 KiB, allocs estimate: 9.

Base sqrt, log :

1 thread:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min … max):  8.238 ms …  10.269 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.764 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.788 ms ± 209.589 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▁▃▅█▆▅▆▆▃▄▆▅▅▄▄▁▁▁                           
  ▁▁▁▁▁▁▁▁▁▃▃▅▅▆█▇███████████████████▇▆▄▂▃▃▃▃▂▂▂▂▂▁▁▂▂▃▁▂▂▁▁▁ ▄
  8.24 ms         Histogram: frequency by time        9.49 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

4 threads:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min … max):  2.454 ms …  13.782 ms  ┊ GC (min … max): 0.00% … 13.19%
 Time  (median):     2.803 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.866 ms ± 349.961 μs  ┊ GC (mean ± σ):  0.03% ±  0.30%

      ▁ ▂▃▄▆▇███▇▆▄▄▂▂▂▁▁▁                                    ▂
  ▅▆▆▇████████████████████▇██▇█▇▆█▆▇▆▇▇▆▇█▆▅▅▅▄▅▅▁▅▅▅▄▆▅▅▄▅▅▅ █
  2.45 ms      Histogram: log(frequency) by time      4.03 ms <

 Memory estimate: 3.73 KiB, allocs estimate: 9.

tree_3d_dgsem/elixir_mhd_ec.jl with conservative surface flux flux_hlle and initial_refinement_level = 4:

t0 = tspan[1]
u0 = sol.u[2]
du = similar(u0)

using BenchmarkTools
b = @benchmarkable Trixi.rhs!(du, u0, semi, t0) evals=5 samples=2000 seconds=120
run(b)

Custom sqrt, log :

1 thread:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min … max):  51.155 ms … 67.260 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.327 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   53.673 ms ±  1.674 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▁▃▅▅█▇▆▂▁▄▅ ▂▂                                          
  ▂▃▄▄▇██████████████▇▆▇▅▆▄▅▄▃▃▃▃▂▃▂▁▃▂▂▂▃▂▃▁▂▂▁▁▃▁▁▁▁▁▂▁▁▁▂▂ ▄
  51.2 ms         Histogram: frequency by time        61.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
 
 8 threads: 
 
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min … max):  14.807 ms …  19.546 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.817 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.778 ms ± 373.889 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                      ▂▂ ▂▃▆▃█▆▆▂▂              
  ▂▁▂▁▁▂▁▁▁▁▁▁▁▂▂▂▂▃▂▃▁▃▂▂▂▃▃▂▃▆▅▅▅▆█▇██▇█████████▇▇▅▄▄▄▃▃▃▂▃▂ ▄
  14.8 ms         Histogram: frequency by time         17.7 ms <

 Memory estimate: 1.41 KiB, allocs estimate: 5.

Base sqrt, log :

1 thread:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min … max):  50.650 ms … 66.282 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.364 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   53.588 ms ±  1.332 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▃▃▇█▆▅▇▄▃▁                                       
  ▂▁▁▁▂▁▃▂▂▄▅▇███████████▇▇▆▅▅▃▄▃▄▃▃▃▂▂▂▂▂▁▂▁▁▂▁▁▁▁▁▂▁▁▁▂▁▁▁▂ ▄
  50.6 ms         Histogram: frequency by time        59.6 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

8 threads:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min … max):  15.194 ms …  19.291 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.770 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.745 ms ± 412.283 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▁▂▃▃▅█▆▄▄▃▂                       
  ▂▂▁▁▁▂▃▂▁▁▂▃▃▃▃▃▃▃▄▄▄▃▆▄▅▇▇█████████████▆▇▄▃▃▃▃▃▂▃▂▂▂▁▂▁▂▂▂▂ ▄
  15.2 ms         Histogram: frequency by time           18 ms <

 Memory estimate: 1.41 KiB, allocs estimate: 5.

These look almost identical to me, which I would consider a success.

DanielDoehring · 2024-02-02T11:39:10Z

@DanielDoehring Could you please run the automated benchmarks on this branch as described in https://trixi-framework.github.io/Trixi.jl/stable/performance/#Automated-benchmarking? You should make sure to use a workstation for this that doesn't run other expensive stuff. And you should be prepared to wait a few hours until everything finishes.

I'll see what I can do - unfortunately, for reliable performance measure I would need to block an entire node of our compute cluster for multiple hours, which might take quite some time to get scheduled. Alternatively, I can run this as a non-exclusive job at the expense of getting possibly less reliable results.

DanielDoehring · 2024-02-05T16:57:34Z

@DanielDoehring Could you please run the automated benchmarks on this branch as described in https://trixi-framework.github.io/Trixi.jl/stable/performance/#Automated-benchmarking? You should make sure to use a workstation for this that doesn't run other expensive stuff. And you should be prepared to wait a few hours until everything finishes.

Unfortunately, I get an error when (presumably) executing the benchmarks of the main branch:

ERROR: ArgumentError: Package Trixi not found in current path.
- Run `import Pkg; Pkg.add("Trixi")` to install the Trixi package.
Stacktrace:
 [1] macro expansion
   @ Base ./loading.jl:1766 [inlined]
 [2] macro expansion
   @ Base ./lock.jl:267 [inlined]
 [3] __require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1747
 [4] #invoke_in_world#3
   @ Base ./essentials.jl:921 [inlined]
 [5] invoke_in_world
   @ Base ./essentials.jl:918 [inlined]
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1740

with standard output:

PkgBenchmark: creating benchmark tuning file /rwthfs/rz/cluster/home/git/Trixi.jl/benchmark/tune.json...
(1/28) tuning "tree_2d_dgsem/elixir_euler_vortex_mortar.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 15.714234339 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 76.10260231 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 21.453400237 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 15.056647826 seconds)
done (took 131.906980299 seconds)
(2/28) tuning "tree_3d_dgsem/elixir_mhd_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 29.225925116 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 375.223802844 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 61.658343698 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 16.326966469 seconds)
done (took 485.785440819 seconds)
(3/28) tuning "structured_3d_dgsem/elixir_euler_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 21.638970102 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 145.468492825 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 49.897601131 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 14.41875533 seconds)
done (took 235.043929672 seconds)
(4/28) tuning "tree_3d_dgsem/elixir_euler_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 84.156037201 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 1140.237702335 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 317.161217237 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 38.72742346 seconds)
done (took 1583.759410713 seconds)
(5/28) tuning "unstructured_2d_dgsem/elixir_euler_wall_bc.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 12.216714106 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 22.011360655 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 12.402302703 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 11.527143089 seconds)
done (took 61.965998148 seconds)
(6/28) tuning "tree_3d_dgsem/elixir_euler_shockcapturing.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 91.256684659 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 1200.799535653 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 268.97470469 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 32.808439139 seconds)
done (took 1597.80109128 seconds)
(7/28) tuning "tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 10.464582383 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 16.305267609 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 18.016344489 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 12.809092958 seconds)
done (took 60.668745842 seconds)
(8/28) tuning "benchmark/elixir_2d_euler_vortex_p4est.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 11.264458026 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 30.218253739 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 19.14552961 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 13.664613786 seconds)
done (took 78.138848087 seconds)
(9/28) tuning "tree_3d_dgsem/elixir_advection_extended.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 18.862721663 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 204.710356465 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 155.833574637 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 21.683521389 seconds)
done (took 404.458885476 seconds)
(10/28) tuning "structured_2d_dgsem/elixir_advection_extended.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 12.846023065 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 22.663477303 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 18.438284368 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 15.56976736 seconds)
done (took 72.939499736 seconds)
(11/28) tuning "tree_2d_dgsem/elixir_advection_extended.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 9.233539848 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 17.373932421 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 11.447854133 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 12.105896462 seconds)
done (took 53.381192204 seconds)
(12/28) tuning "tree_2d_dgsem/elixir_euler_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 18.572986465 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 108.93184312 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 27.673408382 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 19.16687415 seconds)
done (took 177.845071313 seconds)
(13/28) tuning "structured_2d_dgsem/elixir_euler_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 9.990776535 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 29.410013647 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 17.955411267 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 12.113499793 seconds)
done (took 72.332992949 seconds)
(14/28) tuning "latency"...
  (1/5) tuning "polydeg_3"...
PkgBenchmark: Running benchmarks...

The script I execute is

using PkgBenchmark, Trixi

results = judge(Trixi,
             BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --project=. --check-bounds=no --threads=2`), # target
             BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --project=. --check-bounds=no --threads=2`, id="main") # baseline
       )

#export_markdown(pkgdir(Trixi, "benchmark", "results.md"), results)
export_markdown("results.md", results)

while I also tried

using PkgBenchmark, Trixi

results = judge(Trixi,
             BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --check-bounds=no --threads=2`), # target
             BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --check-bounds=no --threads=2`, id="main") # baseline
       )

#export_markdown(pkgdir(Trixi, "benchmark", "results.md"), results)
export_markdown("results.md", results)

I installed Trixi in dev mode from my fork of Trixi and switched to the to be tested branch.

ranocha · 2024-02-06T13:37:58Z

Did you install the development version of Trixi.jl also in the benchmark project as done in

Trixi.jl/.github/workflows/benchmark.yml

Lines 44 to 47 in 14151e6

    
           - name: Install dependencies 
        
             run: julia --project=benchmark/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()' 
        
           - name: Run benchmarks 
        
             run: julia --project=benchmark/ --color=yes benchmark/run_benchmarks.jl

in our GitHub action? I think the docs should be improved to describe this step in more detail (or at all 😅).

DanielDoehring · 2024-02-06T14:19:21Z

Did you install the development version of Trixi.jl also in the benchmark project as done in

Trixi.jl/.github/workflows/benchmark.yml

Lines 44 to 47 in 14151e6

- name: Install dependencies

run: julia --project=benchmark/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'

- name: Run benchmarks

run: julia --project=benchmark/ --color=yes benchmark/run_benchmarks.jl

No - I will give this a try 👍

ranocha · 2024-02-07T06:22:45Z

I just found a problem in the benchmarks config. You need to update your local main branch and the branch of this PR.
You should also run it with Julia 1.9 or delete the --check-bounds=no specification for Julia 1.10.

ranocha · 2024-02-07T07:17:59Z

I'm running some stuff locally. It looks like the benchmarks setup is a bit bit-rotten...

DanielDoehring · 2024-02-08T09:45:03Z

Did you install the development version of Trixi.jl also in the benchmark project as done in

Trixi.jl/.github/workflows/benchmark.yml

Lines 44 to 47 in 14151e6

- name: Install dependencies

run: julia --project=benchmark/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'

- name: Run benchmarks

run: julia --project=benchmark/ --color=yes benchmark/run_benchmarks.jl

in our GitHub action? I think the docs should be improved to describe this step in more detail (or at all 😅).

Hm, I still get the

ERROR: ArgumentError: Package Trixi not found in current path.
- Run `import Pkg; Pkg.add("Trixi")` to install the Trixi package.

error, even after instatiating the package in the benchmarks directory both on main and NaNMath branch.

ranocha · 2024-02-21T12:56:07Z

Here is what I get on one of our servers:

1 thread

ID	time ratio	memory ratio
`["benchmark/elixir_2d_euler_vortex_tree.jl", "p3_rhs!"]`	0.95 (5%) ✅	1.00 (1%)
`["p4est_2d_dgsem/elixir_advection_extended.jl", "p3_rhs!"]`	0.93 (5%) ✅	1.00 (1%)
`["structured_2d_dgsem/elixir_mhd_ec.jl", "p3_rhs!"]`	0.95 (5%) ✅	1.00 (1%)
`["structured_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.95 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_vortex_mortar.jl", "p3_rhs!"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.93 (5%) ✅	1.00 (1%)

2 threads

ID	time ratio	memory ratio
`["benchmark/elixir_2d_euler_vortex_structured.jl", "p3_rhs!"]`	0.91 (5%) ✅	1.00 (1%)
`["benchmark/elixir_2d_euler_vortex_unstructured.jl", "p3_rhs!"]`	1.17 (5%) ❌	1.00 (1%)
`["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl", "p3_rhs!"]`	1.09 (5%) ❌	1.00 (1%)
`["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl", "p3_rhs!"]`	0.88 (5%) ✅	1.00 (1%)
`["structured_3d_dgsem/elixir_euler_source_terms_nonperiodic_curved.jl", "p3_rhs!"]`	1.10 (5%) ❌	1.00 (1%)
`["structured_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"]`	0.95 (5%) ✅	1.00 (1%)
`["structured_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p3_rhs!"]`	1.43 (5%) ❌	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p7_analysis"]`	0.89 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_vortex_mortar_shockcapturing.jl", "p3_rhs!"]`	0.95 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_advection_extended.jl", "p3_rhs!"]`	0.88 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"]`	0.95 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)

It would be interesting to see results from another server/run.

DanielDoehring · 2024-02-22T08:00:30Z

1 Thread

Results

A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
that indicate possible regressions or improvements - are shown below (thus, an empty table means that all
benchmark results remained invariant between builds).

ID	time ratio	memory ratio
`["benchmark/elixir_2d_euler_vortex_p4est.jl", "p3_analysis"]`	0.87 (5%) ✅	1.00 (1%)
`["benchmark/elixir_2d_euler_vortex_p4est.jl", "p3_rhs!"]`	0.80 (5%) ✅	1.00 (1%)
`["benchmark/elixir_2d_euler_vortex_structured.jl", "p3_analysis"]`	1.06 (5%) ❌	1.00 (1%)
`["benchmark/elixir_2d_euler_vortex_tree.jl", "p3_rhs!"]`	0.89 (5%) ✅	1.00 (1%)
`["benchmark/elixir_2d_euler_vortex_unstructured.jl", "p3_rhs!"]`	1.06 (5%) ❌	1.00 (1%)
`["latency", "default_example"]`	1.05 (5%) ❌	1.00 (1%)
`["latency", "euler_2d"]`	1.11 (5%) ❌	1.00 (1%)
`["latency", "polydeg_3"]`	0.90 (5%) ✅	1.00 (1%)
`["latency", "polydeg_7"]`	1.08 (5%) ❌	1.00 (1%)
`["p4est_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"]`	1.24 (5%) ❌	1.00 (1%)
`["structured_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["structured_2d_dgsem/elixir_advection_nonperiodic.jl", "p3_analysis"]`	1.06 (5%) ❌	1.00 (1%)
`["structured_2d_dgsem/elixir_euler_ec.jl", "p3_analysis"]`	0.89 (5%) ✅	1.00 (1%)
`["structured_2d_dgsem/elixir_euler_ec.jl", "p3_rhs!"]`	0.95 (5%) ✅	1.00 (1%)
`["structured_2d_dgsem/elixir_euler_ec.jl", "p7_analysis"]`	0.87 (5%) ✅	1.00 (1%)
`["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl", "p3_analysis"]`	1.15 (5%) ❌	1.00 (1%)
`["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl", "p7_analysis"]`	1.05 (5%) ❌	1.00 (1%)
`["structured_2d_dgsem/elixir_mhd_ec.jl", "p3_analysis"]`	1.10 (5%) ❌	1.00 (1%)
`["structured_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	1.11 (5%) ❌	1.00 (1%)
`["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl", "p7_rhs!"]`	1.08 (5%) ❌	1.00 (1%)
`["structured_3d_dgsem/elixir_euler_ec.jl", "p3_analysis"]`	0.85 (5%) ✅	1.00 (1%)
`["structured_3d_dgsem/elixir_euler_ec.jl", "p3_rhs!"]`	0.86 (5%) ✅	1.00 (1%)
`["structured_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"]`	0.95 (5%) ✅	1.00 (1%)
`["structured_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.93 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p3_rhs!"]`	0.70 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p7_analysis"]`	0.87 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p7_rhs!"]`	0.86 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_extended.jl", "p3_analysis"]`	1.34 (5%) ❌	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_extended.jl", "p3_rhs!"]`	1.09 (5%) ❌	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"]`	0.86 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_extended.jl", "p7_rhs!"]`	1.10 (5%) ❌	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_ec.jl", "p3_analysis"]`	1.30 (5%) ❌	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_vortex_mortar.jl", "p3_analysis"]`	0.86 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_vortex_mortar.jl", "p7_rhs!"]`	0.87 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_vortex_mortar_shockcapturing.jl", "p3_analysis"]`	1.16 (5%) ❌	1.00 (1%)
`["tree_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_advection_extended.jl", "p3_analysis"]`	1.30 (5%) ❌	1.00 (1%)
`["tree_3d_dgsem/elixir_advection_extended.jl", "p7_analysis"]`	0.93 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_euler_ec.jl", "p3_analysis"]`	0.83 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_euler_ec.jl", "p7_analysis"]`	0.87 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_euler_ec.jl", "p7_rhs!"]`	0.95 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_euler_shockcapturing.jl", "p3_analysis"]`	0.85 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_euler_shockcapturing.jl", "p7_analysis"]`	0.88 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"]`	0.83 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.77 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_rhs!"]`	0.92 (5%) ✅	1.00 (1%)
`["unstructured_2d_dgsem/elixir_euler_wall_bc.jl", "p3_analysis"]`	0.92 (5%) ✅	1.00 (1%)
`["unstructured_2d_dgsem/elixir_euler_wall_bc.jl", "p3_rhs!"]`	0.94 (5%) ✅	1.00 (1%)
`["unstructured_2d_dgsem/elixir_euler_wall_bc.jl", "p7_rhs!"]`	0.94 (5%) ✅	1.00 (1%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

["benchmark/elixir_2d_euler_vortex_p4est.jl"]
["benchmark/elixir_2d_euler_vortex_structured.jl"]
["benchmark/elixir_2d_euler_vortex_tree.jl"]
["benchmark/elixir_2d_euler_vortex_unstructured.jl"]
["latency"]
["p4est_2d_dgsem/elixir_advection_extended.jl"]
["p4est_3d_dgsem/elixir_advection_basic.jl"]
["structured_2d_dgsem/elixir_advection_extended.jl"]
["structured_2d_dgsem/elixir_advection_nonperiodic.jl"]
["structured_2d_dgsem/elixir_euler_ec.jl"]
["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl"]
["structured_2d_dgsem/elixir_mhd_ec.jl"]
["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl"]
["structured_3d_dgsem/elixir_euler_ec.jl"]
["structured_3d_dgsem/elixir_euler_source_terms_nonperiodic_curved.jl"]
["structured_3d_dgsem/elixir_mhd_ec.jl"]
["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl"]
["tree_2d_dgsem/elixir_advection_extended.jl"]
["tree_2d_dgsem/elixir_euler_ec.jl"]
["tree_2d_dgsem/elixir_euler_vortex_mortar.jl"]
["tree_2d_dgsem/elixir_euler_vortex_mortar_shockcapturing.jl"]
["tree_2d_dgsem/elixir_mhd_ec.jl"]
["tree_3d_dgsem/elixir_advection_extended.jl"]
["tree_3d_dgsem/elixir_euler_ec.jl"]
["tree_3d_dgsem/elixir_euler_mortar.jl"]
["tree_3d_dgsem/elixir_euler_shockcapturing.jl"]
["tree_3d_dgsem/elixir_mhd_ec.jl"]
["unstructured_2d_dgsem/elixir_euler_wall_bc.jl"]

Julia versioninfo

Target

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      "Rocky Linux release 8.9 (Green Obsidian)"
  uname: Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Wed Jan 10 22:58:54 UTC 2024 x86_64 x86_64
  CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz: 
                 speed         user         nice          sys         idle          irq
       #1-48  2100 MHz  190265762 s       7123 s    3499305 s  335295439 s    1378880 s
  Memory: 187.07468032836914 GB (163619.5 MB free)
  Uptime: 1.10867135e6 sec
  Load Avg:  19.06  19.59  18.4
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 48 virtual cores

Baseline

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      "Rocky Linux release 8.9 (Green Obsidian)"
  uname: Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Wed Jan 10 22:58:54 UTC 2024 x86_64 x86_64
  CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz: 
                 speed         user         nice          sys         idle          irq
       #1-48  2100 MHz  190719005 s       7130 s    3522736 s  335948290 s    1382233 s
  Memory: 187.07468032836914 GB (174674.2109375 MB free)
  Uptime: 1.11103335e6 sec
  Load Avg:  10.19  11.36  16.52
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 48 virtual cores

DanielDoehring · 2024-02-22T08:01:53Z

2 Threads

Results

A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
that indicate possible regressions or improvements - are shown below (thus, an empty table means that all
benchmark results remained invariant between builds).

ID	time ratio	memory ratio
`["benchmark/elixir_2d_euler_vortex_tree.jl", "p7_rhs!"]`	1.06 (5%) ❌	1.00 (1%)
`["benchmark/elixir_2d_euler_vortex_unstructured.jl", "p7_rhs!"]`	1.06 (5%) ❌	1.00 (1%)
`["latency", "mhd_2d"]`	0.93 (5%) ✅	1.00 (1%)
`["latency", "polydeg_3"]`	0.86 (5%) ✅	1.00 (1%)
`["structured_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["structured_2d_dgsem/elixir_advection_extended.jl", "p7_rhs!"]`	1.08 (5%) ❌	1.00 (1%)
`["structured_2d_dgsem/elixir_euler_ec.jl", "p7_analysis"]`	0.95 (5%) ✅	1.00 (1%)
`["structured_2d_dgsem/elixir_euler_ec.jl", "p7_rhs!"]`	0.95 (5%) ✅	1.00 (1%)
`["structured_2d_dgsem/elixir_mhd_ec.jl", "p7_rhs!"]`	1.10 (5%) ❌	1.00 (1%)
`["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl", "p3_rhs!"]`	1.08 (5%) ❌	1.00 (1%)
`["structured_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"]`	0.95 (5%) ✅	1.00 (1%)
`["structured_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p3_analysis"]`	0.95 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_extended.jl", "p3_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_ec.jl", "p3_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_ec.jl", "p3_rhs!"]`	1.17 (5%) ❌	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_ec.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_2d_dgsem/elixir_euler_vortex_mortar.jl", "p7_rhs!"]`	1.09 (5%) ❌	1.00 (1%)
`["tree_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.95 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_advection_extended.jl", "p7_rhs!"]`	1.08 (5%) ❌	1.00 (1%)
`["tree_3d_dgsem/elixir_euler_ec.jl", "p7_rhs!"]`	1.07 (5%) ❌	1.00 (1%)
`["tree_3d_dgsem/elixir_euler_shockcapturing.jl", "p3_rhs!"]`	0.93 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"]`	0.94 (5%) ✅	1.00 (1%)
`["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_rhs!"]`	0.85 (5%) ✅	1.00 (1%)
`["unstructured_2d_dgsem/elixir_euler_wall_bc.jl", "p3_rhs!"]`	1.06 (5%) ❌	1.00 (1%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

["benchmark/elixir_2d_euler_vortex_p4est.jl"]
["benchmark/elixir_2d_euler_vortex_structured.jl"]
["benchmark/elixir_2d_euler_vortex_tree.jl"]
["benchmark/elixir_2d_euler_vortex_unstructured.jl"]
["latency"]
["p4est_2d_dgsem/elixir_advection_extended.jl"]
["p4est_3d_dgsem/elixir_advection_basic.jl"]
["structured_2d_dgsem/elixir_advection_extended.jl"]
["structured_2d_dgsem/elixir_advection_nonperiodic.jl"]
["structured_2d_dgsem/elixir_euler_ec.jl"]
["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl"]
["structured_2d_dgsem/elixir_mhd_ec.jl"]
["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl"]
["structured_3d_dgsem/elixir_euler_ec.jl"]
["structured_3d_dgsem/elixir_euler_source_terms_nonperiodic_curved.jl"]
["structured_3d_dgsem/elixir_mhd_ec.jl"]
["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl"]
["tree_2d_dgsem/elixir_advection_extended.jl"]
["tree_2d_dgsem/elixir_euler_ec.jl"]
["tree_2d_dgsem/elixir_euler_vortex_mortar.jl"]
["tree_2d_dgsem/elixir_euler_vortex_mortar_shockcapturing.jl"]
["tree_2d_dgsem/elixir_mhd_ec.jl"]
["tree_3d_dgsem/elixir_advection_extended.jl"]
["tree_3d_dgsem/elixir_euler_ec.jl"]
["tree_3d_dgsem/elixir_euler_mortar.jl"]
["tree_3d_dgsem/elixir_euler_shockcapturing.jl"]
["tree_3d_dgsem/elixir_mhd_ec.jl"]
["unstructured_2d_dgsem/elixir_euler_wall_bc.jl"]

Julia versioninfo

Target

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      "Rocky Linux release 8.9 (Green Obsidian)"
  uname: Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Wed Jan 10 22:58:54 UTC 2024 x86_64 x86_64
  CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz: 
                 speed         user         nice          sys         idle          irq
       #1-48  2100 MHz  190740522 s       7143 s    3523861 s  336936464 s    1382380 s
  Memory: 187.07468032836914 GB (181579.1796875 MB free)
  Uptime: 1.1131404e6 sec
  Load Avg:  1.31  1.36  2.89
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 2 on 48 virtual cores

Baseline

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      "Rocky Linux release 8.9 (Green Obsidian)"
  uname: Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Wed Jan 10 22:58:54 UTC 2024 x86_64 x86_64
  CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz: 
                 speed         user         nice          sys         idle          irq
       #1-48  2100 MHz  190771163 s       7148 s    3525368 s  337942610 s    1382565 s
  Memory: 187.07468032836914 GB (181894.94140625 MB free)
  Uptime: 1.11530533e6 sec
  Load Avg:  1.36  1.52  1.86
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 2 on 48 virtual cores

Project.toml

ranocha

Thanks for running the benchmarks, too. As far as I understand, no benchmarks show regressions in two cases (either the same number of threads and your/mine server or fixed server and a different number of threads). Thus, I assume that there are no serious performance regressions in this PR.

Thanks a lot! This is nearly ready to merge - I just have a minor comment.

src/auxiliary/math.jl

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

Project.toml

DanielDoehring · 2024-02-22T13:42:44Z

Thanks for running the benchmarks, too. As far as I understand, no benchmarks show regressions in two cases (either the same number of threads and your/mine server or fixed server and a different number of threads). Thus, I assume that there are no serious performance regressions in this PR.

I ran another test to make sure and there are no shared elixirs with increased runtime for both single and multi threaded between both runs on the same system.
Additionally, as already observed, there are also no elixirs for which there are increases in runtime between my second run and the run you posted.

Co-authored-by: Joshua Lampert <51029046+JoshuaLampert@users.noreply.github.com>

ranocha

So we're ready to go from your point of view?

DanielDoehring · 2024-02-22T14:14:16Z

Yes!

I plan file an issue/PR to the NaNMath.jl repo to showcase our implementation as it is probably more efficient as the one currently provided by the package.

DanielDoehring added 6 commits December 12, 2023 17:24

Introduce NaNMath for unsafe sqrt and log

562d79e

performance measurements

291b916

implement log myself

0b68235

Try out different log implementation

dc839f7

remove NaNMath, own implementation

9f3be79

remove unrelated

f6de171

DanielDoehring commented Dec 18, 2023

View reviewed changes

src/equations/compressible_euler_2d.jl Outdated Show resolved Hide resolved

DanielDoehring added 4 commits December 18, 2023 15:48

Update src/equations/compressible_euler_2d.jl

829c992

Merge branch 'main' into NaNMath

7ee3742

NaNSqrt for quasi 1d CEE

83e8b1e

fmt

9931e04

DanielDoehring commented Dec 18, 2023

View reviewed changes

src/auxiliary/math.jl Outdated Show resolved Hide resolved

DanielDoehring commented Dec 18, 2023

View reviewed changes

src/auxiliary/math.jl Outdated Show resolved Hide resolved

Update src/auxiliary/math.jl

72fdb4a

DanielDoehring commented Dec 18, 2023

View reviewed changes

src/auxiliary/math.jl Outdated Show resolved Hide resolved

Update src/auxiliary/math.jl

ea9d394

DanielDoehring commented Dec 18, 2023

View reviewed changes

src/auxiliary/math.jl Outdated Show resolved Hide resolved

Update src/auxiliary/math.jl

8941e03

ranocha requested changes Dec 18, 2023

View reviewed changes

src/auxiliary/math.jl Outdated Show resolved Hide resolved

src/solvers/dgsem_tree/dg_2d_compressible_euler.jl Outdated Show resolved Hide resolved

ranocha added the breaking label Dec 18, 2023

DanielDoehring added 2 commits December 19, 2023 12:17

for comparison

cac51a3

Merge branch 'NaNMath' of github.com:DanielDoehring/Trixi.jl into NaN…

e9e633b

…Math

ranocha reviewed Dec 19, 2023

View reviewed changes

src/auxiliary/math.jl Outdated Show resolved Hide resolved

src/auxiliary/math.jl Outdated Show resolved Hide resolved

DanielDoehring and others added 2 commits December 19, 2023 17:11

Update src/auxiliary/math.jl

ead190b

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

Update src/auxiliary/math.jl

d1131e6

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

DanielDoehring changed the title ~~Own sqrt and log returning NaN for correct multi-thread behaviour~~ Own sqrt and log returning NaN for "correct" multi-thread behaviour Jan 9, 2024

Merge branch 'main' into NaNMath

65d1b6d

fix benchmarks configuration

a920e51

DanielDoehring added 2 commits February 7, 2024 09:07

Merge branch 'main' into NaNMath

e6cf76b

Merge branch 'main' into NaNMath

2ca4984

ranocha and others added 2 commits February 21, 2024 11:06

Merge branch 'main' into NaNMath

fef217d

Merge branch 'main' into NaNMath

74d9566

JoshuaLampert reviewed Feb 22, 2024

View reviewed changes

Project.toml Show resolved Hide resolved

ranocha added 2 commits February 22, 2024 13:53

skip UUIDs in downgrade CI job

c5eee35

Merge branch 'main' into NaNMath

6c17cc8

ranocha reviewed Feb 22, 2024

View reviewed changes

src/auxiliary/math.jl Outdated Show resolved Hide resolved

Update src/auxiliary/math.jl

3d0e108

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

JoshuaLampert reviewed Feb 22, 2024

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

DanielDoehring and others added 2 commits February 22, 2024 14:43

Update Project.toml

fc22360

Co-authored-by: Joshua Lampert <51029046+JoshuaLampert@users.noreply.github.com>

Merge branch 'main' into NaNMath

0229017

ranocha approved these changes Feb 22, 2024

View reviewed changes

ranocha merged commit 029ddea into trixi-framework:main Feb 23, 2024
26 of 34 checks passed

DanielDoehring deleted the NaNMath branch February 23, 2024 08:25

DanielDoehring mentioned this pull request Feb 25, 2024

NaN check callback #1854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Own `sqrt` and `log` returning `NaN` for "correct" multi-thread behaviour #1781

Own `sqrt` and `log` returning `NaN` for "correct" multi-thread behaviour #1781

DanielDoehring commented Dec 18, 2023

github-actions bot commented Dec 18, 2023

ranocha left a comment

DanielDoehring commented Dec 19, 2023

DanielDoehring commented Jan 9, 2024

DanielDoehring commented Jan 10, 2024 •

edited

Loading

DanielDoehring commented Feb 2, 2024

DanielDoehring commented Feb 5, 2024

ranocha commented Feb 6, 2024

DanielDoehring commented Feb 6, 2024

ranocha commented Feb 7, 2024

ranocha commented Feb 7, 2024 •

edited

Loading

DanielDoehring commented Feb 8, 2024

ranocha commented Feb 21, 2024

DanielDoehring commented Feb 22, 2024

DanielDoehring commented Feb 22, 2024

ranocha left a comment

DanielDoehring commented Feb 22, 2024

ranocha left a comment

DanielDoehring commented Feb 22, 2024

Own sqrt and log returning NaN for "correct" multi-thread behaviour #1781

Own sqrt and log returning NaN for "correct" multi-thread behaviour #1781

Conversation

DanielDoehring commented Dec 18, 2023

github-actions bot commented Dec 18, 2023

Review checklist

Purpose and scope

Code quality

Documentation

Testing

Performance

Verification

ranocha left a comment

Choose a reason for hiding this comment

DanielDoehring commented Dec 19, 2023

DanielDoehring commented Jan 9, 2024

DanielDoehring commented Jan 10, 2024 • edited Loading

DanielDoehring commented Feb 2, 2024

DanielDoehring commented Feb 5, 2024

ranocha commented Feb 6, 2024

DanielDoehring commented Feb 6, 2024

ranocha commented Feb 7, 2024

ranocha commented Feb 7, 2024 • edited Loading

DanielDoehring commented Feb 8, 2024

ranocha commented Feb 21, 2024

1 thread

2 threads

DanielDoehring commented Feb 22, 2024

1 Thread

Results

Benchmark Group List

Julia versioninfo

Target

Baseline

DanielDoehring commented Feb 22, 2024

2 Threads

Results

Benchmark Group List

Julia versioninfo

Target

Baseline

ranocha left a comment

Choose a reason for hiding this comment

DanielDoehring commented Feb 22, 2024

ranocha left a comment

Choose a reason for hiding this comment

DanielDoehring commented Feb 22, 2024

Own `sqrt` and `log` returning `NaN` for "correct" multi-thread behaviour #1781

Own `sqrt` and `log` returning `NaN` for "correct" multi-thread behaviour #1781

DanielDoehring commented Jan 10, 2024 •

edited

Loading

ranocha commented Feb 7, 2024 •

edited

Loading