Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Own sqrt and log returning NaN for "correct" multi-thread behaviour #1781

Merged
merged 75 commits into from
Feb 23, 2024

Conversation

DanielDoehring
Copy link
Contributor

Motivation: See #1766

Inspiration for implementation: https://discourse.julialang.org/t/fastest-sqrt-and-log-with-negative-check/107575

I replaced for the moment only those sqrt and log where the argument can turn negative. Not sure if we want to use the custom implementation of sqrt_ if it is really faster (for whatever reason).

Making sure we do not loose (too much performance):

Example derived from examples/tree_2d_dgsem/elixir_euler_blast_wave_amr.jl with surface_flux = flux_hllc :

Main:

 ──────────────────────────────────────────────────────────────────────────────────────
               Trixi.jl                       Time                    Allocations      
                                     ───────────────────────   ────────────────────────
          Tot / % measured:               66.3s /  97.3%           1.42GiB /  98.7%    

 Section                     ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 rhs!                         9.28k    60.5s   93.8%  6.52ms    183KiB    0.0%    20.2B
   volume integral            9.28k    47.3s   73.3%  5.10ms    174KiB    0.0%    19.2B
     blended DG-FV            9.28k    40.5s   62.7%  4.36ms     0.00B    0.0%    0.00B
     pure DG                  9.28k    5.43s    8.4%   585μs     0.00B    0.0%    0.00B
     blending factors         9.28k    1.27s    2.0%   137μs   64.1KiB    0.0%    7.08B
     ~volume integral~        9.28k    144ms    0.2%  15.5μs    110KiB    0.0%    12.1B
   interface flux             9.28k    7.97s   12.4%   859μs     0.00B    0.0%    0.00B
   mortar flux                9.28k    1.69s    2.6%   183μs     0.00B    0.0%    0.00B
   surface integral           9.28k    1.36s    2.1%   147μs     0.00B    0.0%    0.00B
   prolong2interfaces         9.28k    1.18s    1.8%   127μs     0.00B    0.0%    0.00B
   prolong2mortars            9.28k    367ms    0.6%  39.6μs     0.00B    0.0%    0.00B
   Jacobian                   9.28k    352ms    0.5%  37.9μs     0.00B    0.0%    0.00B
   reset ∂u/∂t                9.28k    261ms    0.4%  28.1μs     0.00B    0.0%    0.00B
   ~rhs!~                     9.28k   23.3ms    0.0%  2.51μs   9.33KiB    0.0%    1.03B
   prolong2boundaries         9.28k   1.92ms    0.0%   207ns     0.00B    0.0%    0.00B
   boundary flux              9.28k    195μs    0.0%  21.0ns     0.00B    0.0%    0.00B
   source terms               9.28k    175μs    0.0%  18.9ns     0.00B    0.0%    0.00B
 AMR                            371    3.75s    5.8%  10.1ms   1.40GiB  100.0%  3.88MiB
   refine                       371    1.87s    2.9%  5.03ms    472MiB   32.8%  1.27MiB
     mesh                       364    1.63s    2.5%  4.47ms   6.26MiB    0.4%  17.6KiB
       refine_unbalanced!       364    1.57s    2.4%  4.32ms    262KiB    0.0%     738B
       rebalance!               480   48.3ms    0.1%   101μs   1.57MiB    0.1%  3.34KiB
       ~mesh~                   364   6.23ms    0.0%  17.1μs   4.44MiB    0.3%  12.5KiB
     solver                     364    238ms    0.4%   653μs    465MiB   32.3%  1.28MiB
     ~refine~                   371   1.61ms    0.0%  4.34μs    713KiB    0.0%  1.92KiB
   coarsen                      371    1.82s    2.8%  4.89ms    937MiB   65.1%  2.52MiB
     mesh                       371    1.49s    2.3%  4.03ms   2.84MiB    0.2%  7.84KiB
     solver                     371    223ms    0.3%   601μs    511MiB   35.5%  1.38MiB
     ~coarsen~                  371   98.8ms    0.2%   266μs    423MiB   29.4%  1.14MiB
   indicator                    371   59.6ms    0.1%   161μs   13.6MiB    0.9%  37.6KiB
   ~AMR~                        371   11.7ms    0.0%  31.6μs   15.9MiB    1.1%  43.9KiB
 calculate dt                 1.86k    260ms    0.4%   140μs     0.00B    0.0%    0.00B
 initial condition AMR            1    365μs    0.0%   365μs    260KiB    0.0%   260KiB
   AMR                            1    364μs    0.0%   364μs    259KiB    0.0%   259KiB
     indicator                    1    312μs    0.0%   312μs    128KiB    0.0%   128KiB
     ~AMR~                        1   51.0μs    0.0%  51.0μs    131KiB    0.0%   131KiB
     coarsen                      1    266ns    0.0%   266ns     64.0B    0.0%    64.0B
     refine                       1    154ns    0.0%   154ns     64.0B    0.0%    64.0B
   ~initial condition AMR~        1   1.11μs    0.0%  1.11μs      752B    0.0%     752B
 ──────────────────────────────────────────────────────────────────────────────────────

NaNSqrt & NaNLog:

 ──────────────────────────────────────────────────────────────────────────────────────
               Trixi.jl                       Time                    Allocations      
                                     ───────────────────────   ────────────────────────
          Tot / % measured:               66.1s /  97.3%           1.42GiB /  98.7%    

 Section                     ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 rhs!                         9.28k    60.3s   93.8%  6.50ms    183KiB    0.0%    20.2B
   volume integral            9.28k    47.3s   73.6%  5.10ms    174KiB    0.0%    19.2B
     blended DG-FV            9.28k    40.5s   63.0%  4.36ms     0.00B    0.0%    0.00B
     pure DG                  9.28k    5.43s    8.4%   585μs     0.00B    0.0%    0.00B
     blending factors         9.28k    1.27s    2.0%   137μs   64.1KiB    0.0%    7.08B
     ~volume integral~        9.28k    143ms    0.2%  15.4μs    110KiB    0.0%    12.1B
   interface flux             9.28k    7.77s   12.1%   837μs     0.00B    0.0%    0.00B
   mortar flux                9.28k    1.65s    2.6%   178μs     0.00B    0.0%    0.00B
   surface integral           9.28k    1.38s    2.1%   149μs     0.00B    0.0%    0.00B
   prolong2interfaces         9.28k    1.17s    1.8%   126μs     0.00B    0.0%    0.00B
   prolong2mortars            9.28k    367ms    0.6%  39.5μs     0.00B    0.0%    0.00B
   Jacobian                   9.28k    356ms    0.6%  38.4μs     0.00B    0.0%    0.00B
   reset ∂u/∂t                9.28k    262ms    0.4%  28.2μs     0.00B    0.0%    0.00B
   ~rhs!~                     9.28k   20.1ms    0.0%  2.17μs   9.33KiB    0.0%    1.03B
   prolong2boundaries         9.28k   1.61ms    0.0%   173ns     0.00B    0.0%    0.00B
   boundary flux              9.28k    326μs    0.0%  35.1ns     0.00B    0.0%    0.00B
   source terms               9.28k    174μs    0.0%  18.7ns     0.00B    0.0%    0.00B
 AMR                            371    3.74s    5.8%  10.1ms   1.40GiB  100.0%  3.88MiB
   refine                       371    1.85s    2.9%  5.00ms    472MiB   32.8%  1.27MiB
     mesh                       364    1.60s    2.5%  4.40ms   6.26MiB    0.4%  17.6KiB
       refine_unbalanced!       364    1.55s    2.4%  4.26ms    262KiB    0.0%     738B
       rebalance!               480   47.0ms    0.1%  98.0μs   1.57MiB    0.1%  3.34KiB
       ~mesh~                   364   6.12ms    0.0%  16.8μs   4.44MiB    0.3%  12.5KiB
     solver                     364    250ms    0.4%   687μs    465MiB   32.3%  1.28MiB
     ~refine~                   371   1.67ms    0.0%  4.49μs    713KiB    0.0%  1.92KiB
   coarsen                      371    1.81s    2.8%  4.89ms    937MiB   65.1%  2.52MiB
     mesh                       371    1.47s    2.3%  3.96ms   2.84MiB    0.2%  7.84KiB
     solver                     371    263ms    0.4%   708μs    511MiB   35.5%  1.38MiB
     ~coarsen~                  371   83.4ms    0.1%   225μs    423MiB   29.4%  1.14MiB
   indicator                    371   59.7ms    0.1%   161μs   13.6MiB    0.9%  37.6KiB
   ~AMR~                        371   13.8ms    0.0%  37.3μs   15.9MiB    1.1%  43.9KiB
 calculate dt                 1.86k    258ms    0.4%   139μs     0.00B    0.0%    0.00B
 initial condition AMR            1    341μs    0.0%   341μs    260KiB    0.0%   260KiB
   AMR                            1    340μs    0.0%   340μs    259KiB    0.0%   259KiB
     indicator                    1    287μs    0.0%   287μs    128KiB    0.0%   128KiB
     ~AMR~                        1   52.7μs    0.0%  52.7μs    131KiB    0.0%   131KiB
     refine                       1    252ns    0.0%   252ns     64.0B    0.0%    64.0B
     coarsen                      1    191ns    0.0%   191ns     64.0B    0.0%    64.0B
   ~initial condition AMR~        1    947ns    0.0%   947ns      752B    0.0%     752B
 ──────────────────────────────────────────────────────────────────────────────────────

Verification using BenchmarkTools ( I repeated these couple of times)

julia> x = rand(10^4)
julia> @btime sqrt.(x)
 12.788 μs (4 allocations: 78.20 KiB)
 
julia> @btime Trixi.sqrt_.(x)
 6.534 μs (4 allocations: 78.20 KiB)
 
julia> @btime log.(x)
 33.395 μs (4 allocations: 78.20 KiB)

julia> @btime Trixi.log_.(x)
 33.763 μs (4 allocations: 78.20 KiB)

Not sure what is going on with the sqrt_, but log_ is marginally (0.5 - 0.3 micro sec per 10000 floats) slower (as one might expect)

Copy link
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

src/auxiliary/math.jl Outdated Show resolved Hide resolved
src/auxiliary/math.jl Outdated Show resolved Hide resolved
src/auxiliary/math.jl Outdated Show resolved Hide resolved
src/auxiliary/math.jl Outdated Show resolved Hide resolved
Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the initial investigation!

  • Could you please report some performance numbers from elixirs with and without bounds checking?
  • How do these full elixir runs vary when executing them multiple times?
  • Could you please post some benchmarks like @benchmark Trixi.rhs!(...)?
  • Benchmarks like x = rand(10^4); @btime sqrt.(x) are not really meaningful for us since we don't perform such uniform operations on vectors. Benchmarking Trixi.rhs! would be better, I think.

src/auxiliary/math.jl Outdated Show resolved Hide resolved
src/solvers/dgsem_tree/dg_2d_compressible_euler.jl Outdated Show resolved Hide resolved
@DanielDoehring
Copy link
Contributor Author

Some reports on @benchmark Trixi.rhs!

examples/tree_2d_dgsem/elixir_euler_blast_wave_amr.jl with surface_flux = flux_hllc

Custom implementation:

1 Thread:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min  max):  7.880 ms   11.458 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     8.562 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.565 ms ± 180.060 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▁▃▅▇▇█▅▂
  ▂▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▃▃▃▄▄▄▅▆▇█████████▆▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  7.88 ms         Histogram: frequency by time        9.13 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

4 Threads:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min  max):  2.503 ms    4.925 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.709 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.741 ms ± 161.206 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▂▇█▇▁
  ▂▂▁▂▂▂▂▃▅█████▇▇▆▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▂▂▁▂▂▁▁▂▂▂▁▂▂▂ ▃
  2.5 ms          Histogram: frequency by time         3.5 ms <

 Memory estimate: 3.73 KiB, allocs estimate: 9.

Standard sqrt and log :

1 Thread:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min  max):  8.083 ms   11.872 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     8.670 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.724 ms ± 244.216 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▃▆▆█▅▃ ▁
  ▂▂▁▂▂▂▂▂▂▂▂▂▃▄▄▅▆▇█████████▆▆▄▄▄▃▄▃▃▃▃▃▂▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  8.08 ms         Histogram: frequency by time        9.65 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

4 Threads:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min  max):  2.449 ms   4.265 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.676 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.686 ms ± 89.104 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▁▆█▆▆▅▄▃
  ▂▁▁▁▂▂▂▁▂▂▂▂▂▂▂▃▃▅▇█████████▆▅▃▂▂▂▂▁▂▂▂▂▂▂▁▂▂▁▁▁▂▂▂▁▂▁▂▂▂▂ ▃
  2.45 ms        Histogram: frequency by time        3.02 ms <

 Memory estimate: 3.73 KiB, allocs estimate: 9.


tree_3d_dgsem/elixir_mhd_ec.jl
with conservative surface flux flux_hlle and initial_refinement_level = 4:

Custom sqrt, log :

1 Thread:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min  max):  50.370 ms  70.366 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     52.456 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.799 ms ±  1.403 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁▃▇█▆▃▂
  ▂▁▁▂▂▁▂▂▂▃▅███████████▇▅▄▄▃▃▃▃▃▃▂▃▂▂▂▃▂▂▃▂▂▂▂▁▂▂▂▁▂▂▂▁▁▁▂▁▂ ▃
  50.4 ms         Histogram: frequency by time          58 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

8 Threads:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min  max):  14.900 ms   19.075 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     16.890 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.838 ms ± 431.002 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▂▁▄▂▂ ▁▄ ▆█▇▆▅▃▅▁
  ▂▁▁▁▁▁▁▁▁▂▁▂▂▁▂▃▁▁▄▂▄▃▄▄▄▅▄▇██████▇████████████▇▆▆▄▃▂▂▃▃▂▂▂▂ ▄
  14.9 ms         Histogram: frequency by time           18 ms <

 Memory estimate: 1.41 KiB, allocs estimate: 5.

Standard sqrt log:

1 Thread:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min  max):  50.199 ms  66.471 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     51.948 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.097 ms ±  1.086 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▂▃█▇█▄▄
  ▂▂▂▂▂▂▂▂▃▃▄▄▆████████▅▆▄▃▃▃▂▂▂▂▃▂▂▂▁▂▂▂▂▂▂▂▁▁▁▂▂▂▁▁▁▂▁▁▁▁▁▂ ▃
  50.2 ms         Histogram: frequency by time        56.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

8 Threads:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min  max):  14.270 ms   18.180 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     16.018 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.950 ms ± 373.664 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                        ▁▁▃▄▄▇▇█▇▃▂▁
  ▂▁▁▁▁▁▁▁▁▁▂▁▂▁▁▂▂▃▃▃▃▃▃▄▅▄▄▄▄▆▅▄▆▅▇▇▆▆█████████████▆▆▅▄▃▃▂▂▃ ▄
  14.3 ms         Histogram: frequency by time         16.7 ms <

 Memory estimate: 1.41 KiB, allocs estimate: 5.

src/auxiliary/math.jl Outdated Show resolved Hide resolved
src/auxiliary/math.jl Outdated Show resolved Hide resolved
DanielDoehring and others added 2 commits December 19, 2023 17:11
Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
@DanielDoehring DanielDoehring changed the title Own sqrt and log returning NaN for correct multi-thread behaviour Own sqrt and log returning NaN for "correct" multi-thread behaviour Jan 9, 2024
@DanielDoehring
Copy link
Contributor Author

DanielDoehring commented Jan 10, 2024

@ranocha Maybe I found something that suits our needs:

As for the sqrt_llvm we could call a LLVm implementation of the log via

log_(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
log_(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32, ), x)

which actually return NaN or NaN32 if called with negative arguments.
(Taken from JuliaLang/julia#8869 (comment) )

To still enable usage of algorithmic differentiation we would still provide

log_(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x)

Repeating the benchmarks from above:

examples/tree_2d_dgsem/elixir_euler_blast_wave_amr.jl with surface_flux = flux_hllc

t0 = tspan[1]
u0 = sol.u[2]
du = similar(u0)

using BenchmarkTools
b = @benchmarkable Trixi.rhs!(du, u0, semi, t0) evals=5 samples=2000 seconds=120
run(b)

Custom sqrt, log :

1 Thread:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min  max):  8.090 ms   11.248 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     8.781 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.819 ms ± 234.715 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▂▄▇█▆▅▃▂▂▂▃▅▃▂▁                           
  ▂▁▂▁▁▁▂▂▃▂▂▃▂▂▂▃▄▆▇████████████████▅▄▄▄▄▃▄▃▃▃▃▃▂▂▂▂▃▂▂▂▂▂▂▂ ▄
  8.09 ms         Histogram: frequency by time        9.64 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

4 Threads:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min  max):  2.368 ms   13.217 ms  ┊ GC (min  max): 0.00%  14.48%
 Time  (median):     2.734 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.796 ms ± 361.907 μs  ┊ GC (mean ± σ):  0.03% ±  0.32%

           ▂▅██▁                                               
  ▂▁▂▂▃▃▃▄▇█████▆▄▄▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▂▂▂▁▁▁▂▂ ▃
  2.37 ms         Histogram: frequency by time         4.2 ms <

 Memory estimate: 3.73 KiB, allocs estimate: 9.

Base sqrt, log :

1 thread:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min  max):  8.238 ms   10.269 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     8.764 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.788 ms ± 209.589 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▁▃▅█▆▅▆▆▃▄▆▅▅▄▄▁▁▁                           
  ▁▁▁▁▁▁▁▁▁▃▃▅▅▆█▇███████████████████▇▆▄▂▃▃▃▃▂▂▂▂▂▁▁▂▂▃▁▂▂▁▁▁ ▄
  8.24 ms         Histogram: frequency by time        9.49 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

4 threads:

BenchmarkTools.Trial: 2000 samples with 5 evaluations.
 Range (min  max):  2.454 ms   13.782 ms  ┊ GC (min  max): 0.00%  13.19%
 Time  (median):     2.803 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.866 ms ± 349.961 μs  ┊ GC (mean ± σ):  0.03% ±  0.30%

      ▁ ▂▃▄▆▇███▇▆▄▄▂▂▂▁▁▁                                    ▂
  ▅▆▆▇████████████████████▇██▇█▇▆█▆▇▆▇▇▆▇█▆▅▅▅▄▅▅▁▅▅▅▄▆▅▅▄▅▅▅ █
  2.45 ms      Histogram: log(frequency) by time      4.03 ms <

 Memory estimate: 3.73 KiB, allocs estimate: 9.


tree_3d_dgsem/elixir_mhd_ec.jl
with conservative surface flux flux_hlle and initial_refinement_level = 4:

t0 = tspan[1]
u0 = sol.u[2]
du = similar(u0)

using BenchmarkTools
b = @benchmarkable Trixi.rhs!(du, u0, semi, t0) evals=5 samples=2000 seconds=120
run(b)

Custom sqrt, log :

1 thread:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min  max):  51.155 ms  67.260 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     53.327 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   53.673 ms ±  1.674 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▁▃▅▅█▇▆▂▁▄▅ ▂▂                                          
  ▂▃▄▄▇██████████████▇▆▇▅▆▄▅▄▃▃▃▃▂▃▂▁▃▂▂▂▃▂▃▁▂▂▁▁▃▁▁▁▁▁▂▁▁▁▂▂ ▄
  51.2 ms         Histogram: frequency by time        61.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
 
 8 threads: 
 
BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min  max):  14.807 ms   19.546 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     16.817 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.778 ms ± 373.889 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                      ▂▂ ▂▃▆▃█▆▆▂▂              
  ▂▁▂▁▁▂▁▁▁▁▁▁▁▂▂▂▂▃▂▃▁▃▂▂▂▃▃▂▃▆▅▅▅▆█▇██▇█████████▇▇▅▄▄▄▃▃▃▂▃▂ ▄
  14.8 ms         Histogram: frequency by time         17.7 ms <

 Memory estimate: 1.41 KiB, allocs estimate: 5.
 

Base sqrt, log :

1 thread:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min  max):  50.650 ms  66.282 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     53.364 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   53.588 ms ±  1.332 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▃▃▇█▆▅▇▄▃▁                                       
  ▂▁▁▁▂▁▃▂▂▄▅▇███████████▇▇▆▅▅▃▄▃▄▃▃▃▂▂▂▂▂▁▂▁▁▂▁▁▁▁▁▂▁▁▁▂▁▁▁▂ ▄
  50.6 ms         Histogram: frequency by time        59.6 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

8 threads:

BenchmarkTools.Trial: 1000 samples with 3 evaluations.
 Range (min  max):  15.194 ms   19.291 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     16.770 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.745 ms ± 412.283 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▁▂▃▃▅█▆▄▄▃▂                       
  ▂▂▁▁▁▂▃▂▁▁▂▃▃▃▃▃▃▃▄▄▄▃▆▄▅▇▇█████████████▆▇▄▃▃▃▃▃▂▃▂▂▂▁▂▁▂▂▂▂ ▄
  15.2 ms         Histogram: frequency by time           18 ms <

 Memory estimate: 1.41 KiB, allocs estimate: 5.

These look almost identical to me, which I would consider a success.

@DanielDoehring
Copy link
Contributor Author

@DanielDoehring Could you please run the automated benchmarks on this branch as described in https://trixi-framework.github.io/Trixi.jl/stable/performance/#Automated-benchmarking? You should make sure to use a workstation for this that doesn't run other expensive stuff. And you should be prepared to wait a few hours until everything finishes.

I'll see what I can do - unfortunately, for reliable performance measure I would need to block an entire node of our compute cluster for multiple hours, which might take quite some time to get scheduled. Alternatively, I can run this as a non-exclusive job at the expense of getting possibly less reliable results.

@DanielDoehring
Copy link
Contributor Author

@DanielDoehring Could you please run the automated benchmarks on this branch as described in https://trixi-framework.github.io/Trixi.jl/stable/performance/#Automated-benchmarking? You should make sure to use a workstation for this that doesn't run other expensive stuff. And you should be prepared to wait a few hours until everything finishes.

Unfortunately, I get an error when (presumably) executing the benchmarks of the main branch:

ERROR: ArgumentError: Package Trixi not found in current path.
- Run `import Pkg; Pkg.add("Trixi")` to install the Trixi package.
Stacktrace:
 [1] macro expansion
   @ Base ./loading.jl:1766 [inlined]
 [2] macro expansion
   @ Base ./lock.jl:267 [inlined]
 [3] __require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1747
 [4] #invoke_in_world#3
   @ Base ./essentials.jl:921 [inlined]
 [5] invoke_in_world
   @ Base ./essentials.jl:918 [inlined]
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1740

with standard output:

PkgBenchmark: creating benchmark tuning file /rwthfs/rz/cluster/home/git/Trixi.jl/benchmark/tune.json...
(1/28) tuning "tree_2d_dgsem/elixir_euler_vortex_mortar.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 15.714234339 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 76.10260231 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 21.453400237 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 15.056647826 seconds)
done (took 131.906980299 seconds)
(2/28) tuning "tree_3d_dgsem/elixir_mhd_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 29.225925116 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 375.223802844 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 61.658343698 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 16.326966469 seconds)
done (took 485.785440819 seconds)
(3/28) tuning "structured_3d_dgsem/elixir_euler_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 21.638970102 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 145.468492825 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 49.897601131 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 14.41875533 seconds)
done (took 235.043929672 seconds)
(4/28) tuning "tree_3d_dgsem/elixir_euler_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 84.156037201 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 1140.237702335 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 317.161217237 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 38.72742346 seconds)
done (took 1583.759410713 seconds)
(5/28) tuning "unstructured_2d_dgsem/elixir_euler_wall_bc.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 12.216714106 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 22.011360655 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 12.402302703 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 11.527143089 seconds)
done (took 61.965998148 seconds)
(6/28) tuning "tree_3d_dgsem/elixir_euler_shockcapturing.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 91.256684659 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 1200.799535653 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 268.97470469 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 32.808439139 seconds)
done (took 1597.80109128 seconds)
(7/28) tuning "tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 10.464582383 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 16.305267609 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 18.016344489 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 12.809092958 seconds)
done (took 60.668745842 seconds)
(8/28) tuning "benchmark/elixir_2d_euler_vortex_p4est.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 11.264458026 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 30.218253739 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 19.14552961 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 13.664613786 seconds)
done (took 78.138848087 seconds)
(9/28) tuning "tree_3d_dgsem/elixir_advection_extended.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 18.862721663 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 204.710356465 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 155.833574637 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 21.683521389 seconds)
done (took 404.458885476 seconds)
(10/28) tuning "structured_2d_dgsem/elixir_advection_extended.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 12.846023065 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 22.663477303 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 18.438284368 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 15.56976736 seconds)
done (took 72.939499736 seconds)
(11/28) tuning "tree_2d_dgsem/elixir_advection_extended.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 9.233539848 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 17.373932421 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 11.447854133 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 12.105896462 seconds)
done (took 53.381192204 seconds)
(12/28) tuning "tree_2d_dgsem/elixir_euler_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 18.572986465 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 108.93184312 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 27.673408382 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 19.16687415 seconds)
done (took 177.845071313 seconds)
(13/28) tuning "structured_2d_dgsem/elixir_euler_ec.jl"...
  (1/4) tuning "p3_rhs!"...
  done (took 9.990776535 seconds)
  (2/4) tuning "p7_rhs!"...
  done (took 29.410013647 seconds)
  (3/4) tuning "p7_analysis"...
  done (took 17.955411267 seconds)
  (4/4) tuning "p3_analysis"...
  done (took 12.113499793 seconds)
done (took 72.332992949 seconds)
(14/28) tuning "latency"...
  (1/5) tuning "polydeg_3"...
PkgBenchmark: Running benchmarks...

The script I execute is

using PkgBenchmark, Trixi

results = judge(Trixi,
             BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --project=. --check-bounds=no --threads=2`), # target
             BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --project=. --check-bounds=no --threads=2`, id="main") # baseline
       )

#export_markdown(pkgdir(Trixi, "benchmark", "results.md"), results)
export_markdown("results.md", results)

while I also tried

using PkgBenchmark, Trixi

results = judge(Trixi,
             BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --check-bounds=no --threads=2`), # target
             BenchmarkConfig(juliacmd=`$(Base.julia_cmd()) --check-bounds=no --threads=2`, id="main") # baseline
       )

#export_markdown(pkgdir(Trixi, "benchmark", "results.md"), results)
export_markdown("results.md", results)

I installed Trixi in dev mode from my fork of Trixi and switched to the to be tested branch.

@ranocha
Copy link
Member

ranocha commented Feb 6, 2024

Did you install the development version of Trixi.jl also in the benchmark project as done in

- name: Install dependencies
run: julia --project=benchmark/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
- name: Run benchmarks
run: julia --project=benchmark/ --color=yes benchmark/run_benchmarks.jl

in our GitHub action? I think the docs should be improved to describe this step in more detail (or at all 😅).

@DanielDoehring
Copy link
Contributor Author

Did you install the development version of Trixi.jl also in the benchmark project as done in

- name: Install dependencies
run: julia --project=benchmark/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
- name: Run benchmarks
run: julia --project=benchmark/ --color=yes benchmark/run_benchmarks.jl

No - I will give this a try 👍

@ranocha
Copy link
Member

ranocha commented Feb 7, 2024

I just found a problem in the benchmarks config. You need to update your local main branch and the branch of this PR.
You should also run it with Julia 1.9 or delete the --check-bounds=no specification for Julia 1.10.

@ranocha
Copy link
Member

ranocha commented Feb 7, 2024

I'm running some stuff locally. It looks like the benchmarks setup is a bit bit-rotten...

@DanielDoehring
Copy link
Contributor Author

Did you install the development version of Trixi.jl also in the benchmark project as done in

- name: Install dependencies
run: julia --project=benchmark/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
- name: Run benchmarks
run: julia --project=benchmark/ --color=yes benchmark/run_benchmarks.jl

in our GitHub action? I think the docs should be improved to describe this step in more detail (or at all 😅).

Hm, I still get the

ERROR: ArgumentError: Package Trixi not found in current path.
- Run `import Pkg; Pkg.add("Trixi")` to install the Trixi package.

error, even after instatiating the package in the benchmarks directory both on main and NaNMath branch.

@ranocha
Copy link
Member

ranocha commented Feb 21, 2024

Here is what I get on one of our servers:

1 thread

ID time ratio memory ratio
["benchmark/elixir_2d_euler_vortex_tree.jl", "p3_rhs!"] 0.95 (5%) ✅ 1.00 (1%)
["p4est_2d_dgsem/elixir_advection_extended.jl", "p3_rhs!"] 0.93 (5%) ✅ 1.00 (1%)
["structured_2d_dgsem/elixir_mhd_ec.jl", "p3_rhs!"] 0.95 (5%) ✅ 1.00 (1%)
["structured_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.95 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_vortex_mortar.jl", "p3_rhs!"] 0.94 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.93 (5%) ✅ 1.00 (1%)

2 threads

ID time ratio memory ratio
["benchmark/elixir_2d_euler_vortex_structured.jl", "p3_rhs!"] 0.91 (5%) ✅ 1.00 (1%)
["benchmark/elixir_2d_euler_vortex_unstructured.jl", "p3_rhs!"] 1.17 (5%) ❌ 1.00 (1%)
["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl", "p3_rhs!"] 1.09 (5%) ❌ 1.00 (1%)
["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl", "p3_rhs!"] 0.88 (5%) ✅ 1.00 (1%)
["structured_3d_dgsem/elixir_euler_source_terms_nonperiodic_curved.jl", "p3_rhs!"] 1.10 (5%) ❌ 1.00 (1%)
["structured_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"] 0.95 (5%) ✅ 1.00 (1%)
["structured_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p3_rhs!"] 1.43 (5%) ❌ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p7_analysis"] 0.89 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_vortex_mortar_shockcapturing.jl", "p3_rhs!"] 0.95 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_advection_extended.jl", "p3_rhs!"] 0.88 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"] 0.95 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)

It would be interesting to see results from another server/run.

@DanielDoehring
Copy link
Contributor Author

1 Thread

Results

A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
that indicate possible regressions or improvements - are shown below (thus, an empty table means that all
benchmark results remained invariant between builds).

ID time ratio memory ratio
["benchmark/elixir_2d_euler_vortex_p4est.jl", "p3_analysis"] 0.87 (5%) ✅ 1.00 (1%)
["benchmark/elixir_2d_euler_vortex_p4est.jl", "p3_rhs!"] 0.80 (5%) ✅ 1.00 (1%)
["benchmark/elixir_2d_euler_vortex_structured.jl", "p3_analysis"] 1.06 (5%) ❌ 1.00 (1%)
["benchmark/elixir_2d_euler_vortex_tree.jl", "p3_rhs!"] 0.89 (5%) ✅ 1.00 (1%)
["benchmark/elixir_2d_euler_vortex_unstructured.jl", "p3_rhs!"] 1.06 (5%) ❌ 1.00 (1%)
["latency", "default_example"] 1.05 (5%) ❌ 1.00 (1%)
["latency", "euler_2d"] 1.11 (5%) ❌ 1.00 (1%)
["latency", "polydeg_3"] 0.90 (5%) ✅ 1.00 (1%)
["latency", "polydeg_7"] 1.08 (5%) ❌ 1.00 (1%)
["p4est_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"] 1.24 (5%) ❌ 1.00 (1%)
["structured_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["structured_2d_dgsem/elixir_advection_nonperiodic.jl", "p3_analysis"] 1.06 (5%) ❌ 1.00 (1%)
["structured_2d_dgsem/elixir_euler_ec.jl", "p3_analysis"] 0.89 (5%) ✅ 1.00 (1%)
["structured_2d_dgsem/elixir_euler_ec.jl", "p3_rhs!"] 0.95 (5%) ✅ 1.00 (1%)
["structured_2d_dgsem/elixir_euler_ec.jl", "p7_analysis"] 0.87 (5%) ✅ 1.00 (1%)
["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl", "p3_analysis"] 1.15 (5%) ❌ 1.00 (1%)
["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl", "p7_analysis"] 1.05 (5%) ❌ 1.00 (1%)
["structured_2d_dgsem/elixir_mhd_ec.jl", "p3_analysis"] 1.10 (5%) ❌ 1.00 (1%)
["structured_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 1.11 (5%) ❌ 1.00 (1%)
["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl", "p7_rhs!"] 1.08 (5%) ❌ 1.00 (1%)
["structured_3d_dgsem/elixir_euler_ec.jl", "p3_analysis"] 0.85 (5%) ✅ 1.00 (1%)
["structured_3d_dgsem/elixir_euler_ec.jl", "p3_rhs!"] 0.86 (5%) ✅ 1.00 (1%)
["structured_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"] 0.95 (5%) ✅ 1.00 (1%)
["structured_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.93 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p3_rhs!"] 0.70 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p7_analysis"] 0.87 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p7_rhs!"] 0.86 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_extended.jl", "p3_analysis"] 1.34 (5%) ❌ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_extended.jl", "p3_rhs!"] 1.09 (5%) ❌ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"] 0.86 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_extended.jl", "p7_rhs!"] 1.10 (5%) ❌ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_ec.jl", "p3_analysis"] 1.30 (5%) ❌ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_vortex_mortar.jl", "p3_analysis"] 0.86 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_vortex_mortar.jl", "p7_rhs!"] 0.87 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_vortex_mortar_shockcapturing.jl", "p3_analysis"] 1.16 (5%) ❌ 1.00 (1%)
["tree_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_advection_extended.jl", "p3_analysis"] 1.30 (5%) ❌ 1.00 (1%)
["tree_3d_dgsem/elixir_advection_extended.jl", "p7_analysis"] 0.93 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_euler_ec.jl", "p3_analysis"] 0.83 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_euler_ec.jl", "p7_analysis"] 0.87 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_euler_ec.jl", "p7_rhs!"] 0.95 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_euler_shockcapturing.jl", "p3_analysis"] 0.85 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_euler_shockcapturing.jl", "p7_analysis"] 0.88 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"] 0.83 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.77 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_rhs!"] 0.92 (5%) ✅ 1.00 (1%)
["unstructured_2d_dgsem/elixir_euler_wall_bc.jl", "p3_analysis"] 0.92 (5%) ✅ 1.00 (1%)
["unstructured_2d_dgsem/elixir_euler_wall_bc.jl", "p3_rhs!"] 0.94 (5%) ✅ 1.00 (1%)
["unstructured_2d_dgsem/elixir_euler_wall_bc.jl", "p7_rhs!"] 0.94 (5%) ✅ 1.00 (1%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

  • ["benchmark/elixir_2d_euler_vortex_p4est.jl"]
  • ["benchmark/elixir_2d_euler_vortex_structured.jl"]
  • ["benchmark/elixir_2d_euler_vortex_tree.jl"]
  • ["benchmark/elixir_2d_euler_vortex_unstructured.jl"]
  • ["latency"]
  • ["p4est_2d_dgsem/elixir_advection_extended.jl"]
  • ["p4est_3d_dgsem/elixir_advection_basic.jl"]
  • ["structured_2d_dgsem/elixir_advection_extended.jl"]
  • ["structured_2d_dgsem/elixir_advection_nonperiodic.jl"]
  • ["structured_2d_dgsem/elixir_euler_ec.jl"]
  • ["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl"]
  • ["structured_2d_dgsem/elixir_mhd_ec.jl"]
  • ["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl"]
  • ["structured_3d_dgsem/elixir_euler_ec.jl"]
  • ["structured_3d_dgsem/elixir_euler_source_terms_nonperiodic_curved.jl"]
  • ["structured_3d_dgsem/elixir_mhd_ec.jl"]
  • ["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl"]
  • ["tree_2d_dgsem/elixir_advection_extended.jl"]
  • ["tree_2d_dgsem/elixir_euler_ec.jl"]
  • ["tree_2d_dgsem/elixir_euler_vortex_mortar.jl"]
  • ["tree_2d_dgsem/elixir_euler_vortex_mortar_shockcapturing.jl"]
  • ["tree_2d_dgsem/elixir_mhd_ec.jl"]
  • ["tree_3d_dgsem/elixir_advection_extended.jl"]
  • ["tree_3d_dgsem/elixir_euler_ec.jl"]
  • ["tree_3d_dgsem/elixir_euler_mortar.jl"]
  • ["tree_3d_dgsem/elixir_euler_shockcapturing.jl"]
  • ["tree_3d_dgsem/elixir_mhd_ec.jl"]
  • ["unstructured_2d_dgsem/elixir_euler_wall_bc.jl"]

Julia versioninfo

Target

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      "Rocky Linux release 8.9 (Green Obsidian)"
  uname: Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Wed Jan 10 22:58:54 UTC 2024 x86_64 x86_64
  CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz: 
                 speed         user         nice          sys         idle          irq
       #1-48  2100 MHz  190265762 s       7123 s    3499305 s  335295439 s    1378880 s
  Memory: 187.07468032836914 GB (163619.5 MB free)
  Uptime: 1.10867135e6 sec
  Load Avg:  19.06  19.59  18.4
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 48 virtual cores

Baseline

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      "Rocky Linux release 8.9 (Green Obsidian)"
  uname: Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Wed Jan 10 22:58:54 UTC 2024 x86_64 x86_64
  CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz: 
                 speed         user         nice          sys         idle          irq
       #1-48  2100 MHz  190719005 s       7130 s    3522736 s  335948290 s    1382233 s
  Memory: 187.07468032836914 GB (174674.2109375 MB free)
  Uptime: 1.11103335e6 sec
  Load Avg:  10.19  11.36  16.52
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 48 virtual cores

@DanielDoehring
Copy link
Contributor Author

2 Threads

Results

A ratio greater than 1.0 denotes a possible regression (marked with ❌), while a ratio less
than 1.0 denotes a possible improvement (marked with ✅). Only significant results - results
that indicate possible regressions or improvements - are shown below (thus, an empty table means that all
benchmark results remained invariant between builds).

ID time ratio memory ratio
["benchmark/elixir_2d_euler_vortex_tree.jl", "p7_rhs!"] 1.06 (5%) ❌ 1.00 (1%)
["benchmark/elixir_2d_euler_vortex_unstructured.jl", "p7_rhs!"] 1.06 (5%) ❌ 1.00 (1%)
["latency", "mhd_2d"] 0.93 (5%) ✅ 1.00 (1%)
["latency", "polydeg_3"] 0.86 (5%) ✅ 1.00 (1%)
["structured_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["structured_2d_dgsem/elixir_advection_extended.jl", "p7_rhs!"] 1.08 (5%) ❌ 1.00 (1%)
["structured_2d_dgsem/elixir_euler_ec.jl", "p7_analysis"] 0.95 (5%) ✅ 1.00 (1%)
["structured_2d_dgsem/elixir_euler_ec.jl", "p7_rhs!"] 0.95 (5%) ✅ 1.00 (1%)
["structured_2d_dgsem/elixir_mhd_ec.jl", "p7_rhs!"] 1.10 (5%) ❌ 1.00 (1%)
["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl", "p3_rhs!"] 1.08 (5%) ❌ 1.00 (1%)
["structured_3d_dgsem/elixir_mhd_ec.jl", "p3_analysis"] 0.95 (5%) ✅ 1.00 (1%)
["structured_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl", "p3_analysis"] 0.95 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_extended.jl", "p3_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_advection_extended.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_ec.jl", "p3_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_ec.jl", "p3_rhs!"] 1.17 (5%) ❌ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_ec.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_2d_dgsem/elixir_euler_vortex_mortar.jl", "p7_rhs!"] 1.09 (5%) ❌ 1.00 (1%)
["tree_2d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.95 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_advection_extended.jl", "p7_rhs!"] 1.08 (5%) ❌ 1.00 (1%)
["tree_3d_dgsem/elixir_euler_ec.jl", "p7_rhs!"] 1.07 (5%) ❌ 1.00 (1%)
["tree_3d_dgsem/elixir_euler_shockcapturing.jl", "p3_rhs!"] 0.93 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_analysis"] 0.94 (5%) ✅ 1.00 (1%)
["tree_3d_dgsem/elixir_mhd_ec.jl", "p7_rhs!"] 0.85 (5%) ✅ 1.00 (1%)
["unstructured_2d_dgsem/elixir_euler_wall_bc.jl", "p3_rhs!"] 1.06 (5%) ❌ 1.00 (1%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

  • ["benchmark/elixir_2d_euler_vortex_p4est.jl"]
  • ["benchmark/elixir_2d_euler_vortex_structured.jl"]
  • ["benchmark/elixir_2d_euler_vortex_tree.jl"]
  • ["benchmark/elixir_2d_euler_vortex_unstructured.jl"]
  • ["latency"]
  • ["p4est_2d_dgsem/elixir_advection_extended.jl"]
  • ["p4est_3d_dgsem/elixir_advection_basic.jl"]
  • ["structured_2d_dgsem/elixir_advection_extended.jl"]
  • ["structured_2d_dgsem/elixir_advection_nonperiodic.jl"]
  • ["structured_2d_dgsem/elixir_euler_ec.jl"]
  • ["structured_2d_dgsem/elixir_euler_source_terms_nonperiodic.jl"]
  • ["structured_2d_dgsem/elixir_mhd_ec.jl"]
  • ["structured_3d_dgsem/elixir_advection_nonperiodic_curved.jl"]
  • ["structured_3d_dgsem/elixir_euler_ec.jl"]
  • ["structured_3d_dgsem/elixir_euler_source_terms_nonperiodic_curved.jl"]
  • ["structured_3d_dgsem/elixir_mhd_ec.jl"]
  • ["tree_2d_dgsem/elixir_advection_amr_nonperiodic.jl"]
  • ["tree_2d_dgsem/elixir_advection_extended.jl"]
  • ["tree_2d_dgsem/elixir_euler_ec.jl"]
  • ["tree_2d_dgsem/elixir_euler_vortex_mortar.jl"]
  • ["tree_2d_dgsem/elixir_euler_vortex_mortar_shockcapturing.jl"]
  • ["tree_2d_dgsem/elixir_mhd_ec.jl"]
  • ["tree_3d_dgsem/elixir_advection_extended.jl"]
  • ["tree_3d_dgsem/elixir_euler_ec.jl"]
  • ["tree_3d_dgsem/elixir_euler_mortar.jl"]
  • ["tree_3d_dgsem/elixir_euler_shockcapturing.jl"]
  • ["tree_3d_dgsem/elixir_mhd_ec.jl"]
  • ["unstructured_2d_dgsem/elixir_euler_wall_bc.jl"]

Julia versioninfo

Target

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      "Rocky Linux release 8.9 (Green Obsidian)"
  uname: Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Wed Jan 10 22:58:54 UTC 2024 x86_64 x86_64
  CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz: 
                 speed         user         nice          sys         idle          irq
       #1-48  2100 MHz  190740522 s       7143 s    3523861 s  336936464 s    1382380 s
  Memory: 187.07468032836914 GB (181579.1796875 MB free)
  Uptime: 1.1131404e6 sec
  Load Avg:  1.31  1.36  2.89
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 2 on 48 virtual cores

Baseline

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      "Rocky Linux release 8.9 (Green Obsidian)"
  uname: Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Wed Jan 10 22:58:54 UTC 2024 x86_64 x86_64
  CPU: Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz: 
                 speed         user         nice          sys         idle          irq
       #1-48  2100 MHz  190771163 s       7148 s    3525368 s  337942610 s    1382565 s
  Memory: 187.07468032836914 GB (181894.94140625 MB free)
  Uptime: 1.11530533e6 sec
  Load Avg:  1.36  1.52  1.86
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 2 on 48 virtual cores

Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the benchmarks, too. As far as I understand, no benchmarks show regressions in two cases (either the same number of threads and your/mine server or fixed server and a different number of threads). Thus, I assume that there are no serious performance regressions in this PR.

Thanks a lot! This is nearly ready to merge - I just have a minor comment.

src/auxiliary/math.jl Outdated Show resolved Hide resolved
Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
Project.toml Outdated Show resolved Hide resolved
@DanielDoehring
Copy link
Contributor Author

Thanks for running the benchmarks, too. As far as I understand, no benchmarks show regressions in two cases (either the same number of threads and your/mine server or fixed server and a different number of threads). Thus, I assume that there are no serious performance regressions in this PR.

I ran another test to make sure and there are no shared elixirs with increased runtime for both single and multi threaded between both runs on the same system.
Additionally, as already observed, there are also no elixirs for which there are increases in runtime between my second run and the run you posted.

DanielDoehring and others added 2 commits February 22, 2024 14:43
Co-authored-by: Joshua Lampert <51029046+JoshuaLampert@users.noreply.github.com>
Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're ready to go from your point of view?

@DanielDoehring
Copy link
Contributor Author

Yes!

I plan file an issue/PR to the NaNMath.jl repo to showcase our implementation as it is probably more efficient as the one currently provided by the package.

@ranocha ranocha merged commit 029ddea into trixi-framework:main Feb 23, 2024
26 of 34 checks passed
@DanielDoehring DanielDoehring deleted the NaNMath branch February 23, 2024 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants