Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul display of a Trial #217

Merged
merged 1 commit into from
Jun 25, 2021
Merged

Overhaul display of a Trial #217

merged 1 commit into from
Jun 25, 2021

Conversation

tecosaur
Copy link
Collaborator

@tecosaur tecosaur commented May 28, 2021

Use layout inspired by hyperfine, with an added histogram. Closes #215.

Sample:

Initial version

image

image

@mcabbott
Copy link

mcabbott commented Jun 1, 2021

This looks neat and seems easy to read. I like the colour showing mean/median, and I like that the units are clear at a glance. Going horizontally gets enough bins without making you scroll too much.

I have suggestions. Perhaps printing "Histogram: frequency by time" every time is a bit much. Perhaps printing the min/max twice is also unnecessary. It would be nicer to have mean & median on an equal footing, instead of median being tacked on at the right somewhere. So I'd propose that it look like this:

BechmarkTools.Trial:
 10_000 samples with 968 evaulations.
 Time: median:     90.902 ns               (0.00% GC)
       mean ± σ:   94.936 ns ±  47.797 ns  (2.78% ±  5.03% GC)

          ▁▃▅▆▇█▇▆▅▂▁                                          
  ▂▂▃▃▄▅▆▇███████████▇▆▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
  min: 77.7 ns (0.00% GC)                max: 137 ns (87.94% GC)

 Memory estimate: 160 bytes, allocations estimate: 1.

Perhaps min/max need not be coloured, if their position makes it obvious (or perhaps white not grey?) Perhaps grayscale plus two colours is enough. I'd also say that the standard deviation should not share the mean's colour -- it's data about the grey bars, not the green one.

Edit -- now I see there are two maxima, one is a cutoff to make the plot show the interesting part. Then I'm less sure but change my suggestion to something like this (with the [137 ns] in light grey?) with some marker for the broken axis?

BechmarkTools.Trial:
 10_000 samples with 968 evaulations.
 Time: median:     90.902 ns               (0.00% GC)
       mean ± σ:   94.936 ns ±  47.797 ns  (2.78% ±  5.03% GC)

          ▁▃▅▆▇█▇▆▅▂▁                                          
  ▂▂▃▃▄▅▆▇███████████▇▆▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ... ▄
  min: 77.7 ns (0.00% GC)                      [137 ns]
                                    max: 954.823 ns (87.94% GC)
 Memory estimate: 1 allocation, 160 bytes.

In the screenshot above, I think min … max seems a bit confusing right next to mean ± σ, it's a sort-of pun between ± & …, and too many colours to scan quickly. Would prefer it to be at left & right ends of the graph, less of a puzzle to know where to look.

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 1, 2021

Thanks for the feedback @mcabbott !

Perhaps printing "Histogram: frequency by time" every time is a bit much

The reason why I have this is because it's not always "frequency by time". From the suite of expressions I benchmarked I found cases in a frequency vs. time plot distribution was not very helpful. So, conditional on a little heuristic sometimes "log(frequency) by time" is used. If you have a suggestion that is more succinct but still captures this that would be great.

Perhaps printing the min/max twice is also unnecessary

The max isn't printed twice. Because long tails seem common, but are not indicative of the performance characteristics of the expression I chop off the top 1% of times when creating the histogram. You can see this in the screenshot where the max time is ~710ns but the max on the graph is 140ns. Not doing this causes what I often consider to be the region of interest to become undesirably squished.

It would be nicer to have mean & median on an equal footing, instead of median being tacked on at the right somewhere

I think the colouring gives it the same prominence, but this seems reasonable.

I hope this makes the motivation behind some of my choices a bit more clear :)


Edit: Regarding your new suggestion, I must admit that this:
image
looks visually messy to me, and a cleaner way of doing this has yet to come to mind.

Some sort of a broken axis could be a decent idea though, but then I suddenly need to cram a lot more information onto one line:

  • Min value
  • Min GC
  • Is frequency logged
  • Max of graph
  • Max value
  • Max GC

and I suspect that might just be too much.

@mcabbott
Copy link

mcabbott commented Jun 1, 2021

Ah, if it's sometimes log, then perhaps that needs to be noted. Can you add an example above maybe?

I guess my complaint there is about saying "time" twice, and labelling a histogram "histogram" every time. But here's a possible idea; it doesn't seem too crammed to put min & max on the same line although I agree that 6 things would probably be too much. (I picture [] in light grey, as you had it.)

  ▂▂▃▃▄▅▆▇███████████▇▆▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ... ▄
                               [linear axis, to 137 ns]
  min: 77.7 ns (0.00% GC)             max: 954.823 ns (87.94% GC)

My main complaint about the existing printing is that I never remember what order min/max/median/mean get printed; I think mean ± σ would help, and moving min/max would make them super-obvious.

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 1, 2021

I've just realised that your approach would be misleading, as min/max GC don't necessarily correspond to min/max time 🙁.

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 5, 2021

Are any of the JuliaCI people planning on taking a look at this? Perhaps @Roger-luo, @vtjnash, or @vchuravy ?

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 5, 2021

Ah oops, I assumed that Roger-luo and vtjnash were part of JuliaCI based on recent commits, but I see that that isn't the case. Sorry for the bogus ping.

@mcabbott
Copy link

mcabbott commented Jun 5, 2021

min/max GC don't necessarily correspond to min/max time

This is a question I meant to raise too. What exactly is that GC percentage printed next to mean? There are several possibilities, is one more useful than the others? Or might it be clearer to omit that?

I would argue that the current printing certainly suggests that the GC percentages for min/max/median belong to the same runs as the times shown:

  minimum time:     248.763 ns (0.00% GC)
  median time:      262.463 ns (0.00% GC)
  mean time:        370.344 ns (28.55% GC)
  maximum time:     14.132 μs (97.67% GC)

If this isn't now the case, perhaps it should be... or perhaps the printing should separate these if in fact it's the median GC time, as a percentage (or median GC percentage?) not the GC percentage of the median-time run.

I don't see anything about this point in the documentation.

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 5, 2021

Good point about it not being clear. How about this?

image

I don't see anything about this point in the documentation.

I don't think it's documented, but I noticed when working with this that the time and GC results lists are both sorted.

@mcabbott
Copy link

mcabbott commented Jun 5, 2021

but I noticed when working with this that the time and GC results lists are both sorted

That would usually be true for all the options I mentioned, right?

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 5, 2021

That would usually be true for all the options I mentioned, right?

What do you mean?

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 7, 2021

Pushed the change in the screenshot above.

@codecov
Copy link

codecov bot commented Jun 7, 2021

Codecov Report

Merging #217 (bfa8691) into master (134675b) will decrease coverage by 9.85%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #217      +/-   ##
==========================================
- Coverage   85.63%   75.77%   -9.86%     
==========================================
  Files           6        6              
  Lines         738      834      +96     
==========================================
  Hits          632      632              
- Misses        106      202      +96     
Impacted Files Coverage Δ
src/trials.jl 48.68% <0.00%> (-22.47%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 134675b...bfa8691. Read the comment docs.

@Arkoniak
Copy link

Arkoniak commented Jun 7, 2021

Before this PR is merged, can I ask you please to take a pause and think it through again?

In my opinion, such visual change is a big thing and shouldn't be taken lightly, even despite the best intentions, for accessibility reasons.

First of all, you are using lots of colors. While colors can be very expressive, at the same time they can be very distracting. The good thing about the current design that it is color neutral, which means that while it doesn't give you the best possible visual it is not offending anyone.

Secondly, there were issues when various colorschemes were broken when extra colors were introduced in logs in Julia 1.6 for example (JuliaLang/julia#38730). One can argue that it is a problem of the colorscheme itself, but there were also issues when people had problems since they had problems with their eyes or monitors.

I think, there can be multiple ways to solve this problem. These solutions can be for example:

  1. Add a keyword colorized to @benchmark macro with default value false.
  2. Instead of hardcoded values use some sort of colorscheme (:median_color instead of :blue) and a dictionary, where one can setup any sort of color he/she wants. By default all colors should be :white
  3. Instead of overriding @benchmark macro add @colorized_hyperfine.
  4. Use white colors by default, but detect that OhMyREPL.jl is loaded and use colors in the latter case.

For color schemes or colorized variant, one can add a possibility to set them globally through options or global environment variables, so those who wants to see colored version can add necessary lines in startup.jl.

Just to clarify, it is nice to have good looking output and it should be kept as a feature, but it should be opt-in. Colors shouldn't be a distraction or an obstacle to fulfilling the main purpose of BenchmarkTools.jl, i.e. benchmark code.

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 7, 2021

@Arkoniak I appreciate your concern, but the only colours used here are 8-color terminal code. These are set by the terminal/user and should be readable at all times. My take is that this is the responsibility of the terminal/user. Should I be using 256-color codes, I would be inclined to agree that the use of colour would warrant more thought. This is why after trying a shaded gradient with the histogram I reverted to solid 'white'. The lone exception to this here is :light_black, but that seems to be addressed by Base.text_colors[:light_black] = Base.text_colors[:white] should it be an issue.

How "distracting" you want terminal colours to be is indeed a personal preference, and it can be addressed by users changing the base terminal colours to match their preference --- as they already do.

@carstenbauer
Copy link

I like colors if they are (1) customizable and (2) highlighting the most important things. In that regard I would ask the question: do we really need to highlight mean median min and max, i.e. all of them? Afaik, min and median are the most important ones, so why not just highlight those with colors? The full information is still there. One might argue that If you highlight / color everything you don't highlight anything anymore.

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 8, 2021

The colours here aren't currently customisable, but if there's much demand I can make them so.
Regading

The full information is still there. One might argue that If you highlight / color everything you don't highlight anything anymore.

I think there's an important point to be made here. If you emphasise everything you emphasise nothing, but colour can also be used for categorisation, as I do here. Which number corresponds to max? The one that's the same colour. That's not emphasis, that's categorisation. If you're looking at multiple benchmarks and wanting to compare a particular parameter, using different colours makes it easier to visually jump to what you're looking for. Hence, IMO the current level of colouring is warranted.

I think it's also worth bearing in mind that once/if this PR is accepted, it's not as if this feature is locked in stone. Tweaks (like colour customisation) can always be added later.

@rafaqz
Copy link

rafaqz commented Jun 8, 2021

The colors are great. But light_black should be reserved for text that carries no important information, or a lot of people wont be able to see it.

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 8, 2021

Hmmm, maybe I should just make the GC text white then

@rafaqz
Copy link

rafaqz commented Jun 8, 2021

And probably the timings underneath each end of the histogram...

I just did the colorsheme in DimensionalData.jl and light black does look good, but I only used it for additional comments, brackets etc that wont be missed if you cant see them. Here the divider, brackets, colons, probably "Histogram: frequency by time" is also non-essential, people will figure that out.

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 8, 2021

That seems like a pretty reasonable approach, I'll give that a shot 👍

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 8, 2021

Implemented 🙂

image

@rafaqz
Copy link

rafaqz commented Jun 8, 2021

Looks great! But there's still that one dark 80.3 ns ;)

I'm using this already it's a vast improvement for benchmarking. Thanks.

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 8, 2021

I make the 80.3 ns dark because it's just the minimum value, but the top end isn't the maximum, it's truncated: so I've made that end non-dark because it's new information, and left the left/min end alone because it's a repeat of information above.

@rafaqz
Copy link

rafaqz commented Jun 8, 2021

IDK, but when I read it thats the value I read, not the one in the middle

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 8, 2021

Fair enough. Here's a variant then which does that and applies grey to the parents and colons:

image

@carstenbauer
Copy link

carstenbauer commented Jun 9, 2021

FWIW: This is how things look for me (I was wondering because @tecosaur's @benchmark is purple rather than orange):

Screenshot 2021-06-09 at 11 39 25

(I'd think that my terminal color scheme is pretty "standard", i.e. no drastic color changes.)

@tecosaur
Copy link
Collaborator Author

tecosaur commented Jun 9, 2021

Thanks @crstnbr. It's nice to see how things look with other people. The purple @benchmark is just because of the OhMyREPL.jl theme I've set. I think I'm pretty happy with this current state of affairs. Now if I could just get a maintainer to take a look at this... 😔

@fredrikekre
Copy link
Member

Perhaps it would be worth keeping the old format around for the 2-arg signature (show::IO, ::Trial))? Then you will still get this pretty printing in e.g. the REPL, but if you print or string interpolate the variable you get the old output.

@pfitzseb
Copy link

pfitzseb commented Jun 9, 2021

What do you think about keeping the text ordering roughly the same, so that Range is the first entry? That's usually the most important one when only glancing at the output, no?

@tecosaur
Copy link
Collaborator Author

@vchuravy I just implemented C without red, so unless there's anything else you'd like to see, I'd say this is ready to merge 🙂.

@vchuravy vchuravy merged commit dea246d into JuliaCI:master Jun 25, 2021
@DilumAluthge
Copy link
Member

Would be good to get a new release.

@StefanKarpinski
Copy link

The new output format is so nice! Bravo 👏🏻

@tecosaur
Copy link
Collaborator Author

Glad you like it 😁 I love how Julia seems much more enthusiastic about making a rich terminal experience than most other languages I've used ❤️.

@giordano
Copy link
Member

I've actually been trying to make a PR, but I've been unable to build Julia master with GCC 11 slightly_frowning_face.

Uh? JuliaLang/julia#40703 was fixed before you posted the message here.

@tecosaur
Copy link
Collaborator Author

I saw, the fact of the matter is that I wasn't able to build it though, and I have all of the dependencies.

src/trials.jl Show resolved Hide resolved
cmcaine added a commit to cmcaine/BenchmarkTools.jl that referenced this pull request Jul 9, 2021
@KristofferC
Copy link
Contributor

So now we have had this for a while and I am curious what people think. Personally, I think the new output is really cool but the level of noise to signal is in most cases very high. This makes me doubt that having this as a default display is that good. Something terser that would fit on one line would be great, and then there could be an option to expand this to have the richer printing.

@pfitzseb
Copy link

Isn't that basically @btime though? Or, rather, what part of the output is the signal for you? Something like

julia> b = @benchmark sort(data) setup=(data=rand(10))
BenchmarkTools.Trial: 10000 samples with 973 evaluations.
 67.764 ns, 0.00% GC (median 85.830 ns, 0.00% GC), 1 allocs for 160 bytes

julia> verbose(b)
BenchmarkTools.Trial: 10000 samples with 973 evaluations.
 Range (min … max):  67.764 ns … 793.284 ns  ┊ GC (min … max): 0.00% … 81.37%
 Time  (median):     85.830 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   92.337 ns ±  34.990 ns  ┊ GC (mean ± σ):  1.83% ±  4.91%

    ▁▄▅▆▇███▇▇▆▅▄▄▄▂▂▁▁▁                                       ▃
  ▂▄██████████████████████▇██▇█▇▇▇▇▇▇▆▇▇▇▇▇▇▇███▇█▇▆▇▆▇▇▅▆▅▅▆▅ █
  67.8 ns       Histogram: log(frequency) by time       183 ns <

 Memory estimate: 160 bytes, allocs estimate: 1.

@KristofferC
Copy link
Contributor

KristofferC commented Sep 13, 2021

Yes, something like that. There could fit be a +-σ there as well.

@tecosaur
Copy link
Collaborator Author

I'm quite happy to say that the new histogram benchmark, while displaying more information, is actually more compact than the previous display to the point that it's actually less lines. Regarding,

Something terser that would fit on one line would be great

It sounds like this is more an argument for expanding @btime than contracting @benchmark. It's currently just time (allocations: bytes) but you could suggest something like time (min time-max time, allocations: bytes) etc.

@timholy
Copy link
Member

timholy commented Sep 16, 2021

I like the status quo. The histogram makes it crystal-clear what's happening, and gets people thinking more deeply about performance measurement.

I'd support an io = IOContext(stdout, :compact=>true) for a more compact display.

@KristofferC
Copy link
Contributor

The histogram makes it crystal-clear what's happening, and gets people thinking more deeply about performance measurement.

In the vast majority of cases, it shows literally noise (your computer doing various background tasks which is a bad thing to show because you are likely to confuse the runtime of your code based on external "pollution".) I have yet seen a single comment about the histogram result, in fact, I have not seen a single comment about almost any of all the new numbers.
All I've seen from people doing e.g. presentations with the new output is them having trouble finding the relevant data in the sea of numbers. And I can't read benchmarks on slack anymore because the output is so wide that it line breaks and turn the ASCII art into garbage.

Just as a reference, for people who are interested, https://youtu.be/vrfYLlR8X8k?t=915 (and about 7 minutes forward) is a bit of an interesting discussion of timing code from the guy who used to do a lot of performance work at facebook.

@timholy
Copy link
Member

timholy commented Sep 16, 2021

So then why not just use @btime?

My question is often "is the difference in performance between two implementations meaningful, or noise?" That's the whole reason I just submitted #255, with a real-world demo in JuliaLang/julia#42227 (comment). You could (correctly) make the argument that if I'm going to the effort to supply an IOContext then I could turn on a more verbose-mode printing. I don't have strong objections to that view, although elsewhere we typically are verbose unless :compact => true. Even without that pretty-printing across runs, often you can get by looking at the standard deviation (I would in fact propose we add the absolute deviation to the median). So yes, I do look at the numbers.

For reference where seeing the raw data can be helpful: https://stackoverflow.com/questions/69164027/unreliable-results-from-timeit-cache-issue which came up in working on https://github.com/JuliaImages/image_benchmarks. And the video is useful, thanks. I think it actually provides fodder for the view that we should show the histogram: he points out that it is a bit complicated, that there are circumstances like networking where the minimum is not the only measure you're interested in.

I should make it clear that I can live with any outcome; I'm mostly glad this is available, and if we switch to a less verbose printing I will turn it on manually.

@StefanKarpinski
Copy link

I feel the same way as @timholy here. If brevity is so crucial, why not use @btime? The @benchmark macro has always been about getting more detailed information and the new version with histogram takes up fewer lines than the old version without it did.

@Moelf
Copy link
Contributor

Moelf commented Sep 17, 2021

if @btime can show allocation and compilation time and GC % when it's not 0, that would be enough.

Basically, I think the idea is sometimes people want to communicate, via non-screenshot format, potentially in a limited width setting like slack thread, benchmark with real-life workload (=> having allocation, GC time, etc.) *without wrapping 30 lines with black-and-white tofus.

I'm all for keeping everything in @benchmark except re-organize them so the sender can selectively copy-paste without the need to manually edit across multiple lines, as outlined in #236

@timholy
Copy link
Member

timholy commented Sep 18, 2021

Not to beat a dead horse, but yesterday I got this histogram:

         █             ▃      ▃ ▃ ▃  ▃ ▃█    ▃ █ ▃ ▃  █▃▃█      
  ▇▁▁▁▁▁▇█▁▁▁▇▁▇▁▇▁▇▁▇▁█▁▁▇▇▇▁█▇█▇█▁▇█▁██▁▁▁▁█▇█▇█▇█▁▇████▇▁▁▇ ▁
  68 ms           Histogram: frequency by time          102 ms <

It instantly told me that I had some kind of "jackpot event" (of low quantal count and thus high variability) dominating the runtime, which in this case proved to be thread scheduling. Took about half a second to diagnose thanks to the histogram.

@chriselrod
Copy link

chriselrod commented Sep 18, 2021

I think histograms are worth looking at whenever using Threads.@spawn or Threads.@threads, e.g.

julia> @benchmark fetch(Threads.@spawn 1+1)
BenchmarkTools.Trial: 7468 samples with 10 evaluations.
 Range (min  max):  10.931 μs  360.687 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     66.520 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   66.620 μs ±   5.501 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                        ▅█▆▁
  ▂▁▂▂▁▁▂▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▂▁▂▁▂▁▁▂▂▁▂▂▁▁▂▁▁▁▁▂▂▂▄▇████▆▃ ▃
  10.9 μs         Histogram: frequency by time         70.8 μs <

 Memory estimate: 480 bytes, allocs estimate: 5.

The minimum time isn't representative at all.

If all we ever look at is a single number summary, I think the median is better than the minimum.
If the minimum is representative, then the median will be very close.
If the minimum is not representative...then of course we shouldn't be focusing on it.

@mcabbott
Copy link

mcabbott commented Sep 18, 2021

There is space on the line to make @btime print a bit more detail, maybe we should do that too? I'd argue that omitting the minimum would be weird, and that if you want one more time it should be the mean not the median --- they agree in this @spawn example, but disagree in cases with many fast & a few slower, like #237 (comment)

Edit: this is now #258. The hope would be that this could be used more places which people now use @benchmark, and hence save pixels overall.

@chriselrod
Copy link

I agree. While with a single number summary the mean can be biased by a single slow run, given two numbers I think it's a better addition to the minimum than the median.

@KristofferC
Copy link
Contributor

KristofferC commented Sep 22, 2021

If brevity is so crucial, why not use btime?

Unfortunately, I spend a lot more time reading others results than my own. This is my life now:

Screenshot_2021-09-22-20-08-40-617_com Slack

:p

Again, I know distributions have their uses, all I am talking about is what to have as the default.

@ericphanson
Copy link
Contributor

I think many folks used to report the @btime results before-- I don't remember seeing much of the old printing before the histograms. Post-histograms, folks seem to be reporting the results of @benchmark more.

So I feel like it's the fact that users switched which one they use to report results (probably because of the fancier printing) rather than what the default of @benchmark is.

@timholy
Copy link
Member

timholy commented Sep 23, 2021

This is my life now:

Agreed, that's really awful. But I agree with @ericphanson that the reason more people are using @benchmark now is because of the histogram. That certainly describes me. You still haven't answered why people can't just use @btime. Can we focus on that question? What would it take to make that acceptable advice? You could just say, "sorry, I can't read that. Please quote the results of @btime" and your suffering is over.

Comment on lines +370 to +375
julia> std(t)
BenchmarkTools.TrialEstimate:
time: 25.161 μs
gctime: 23.999 μs (95.38%)
memory: 16.36 KiB
allocs: 19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-link to #146 for reexporting std too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A more compact, colourful, output style.