Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC/WIP] Tools for measuring cycles and cpu_times and tricking out LLVM #92

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Dec 9, 2017

I recently started exploring options for more precise and low-level benchmarking tools.
As it is this PR is notready to be included in BenchmarkTools, but should provide a starting point for discussions.

  1. clobber() and escape()
    Two methods to prevent certain compiler optimisations on the LLVM level. (see https://youtu.be/nXaxk27zwlk?t=2441)
    clobber() is a memory barrier that forces the compiler to flush all writes to memory and escape is an method to prevent
    LLVM from optimising a value away since we are faking a store of it. escape() is not quite done since it can't handel boxed values
    and it would be easier to write if we could depend on LLVM.jl

  2. bench_start() and bench_end()
    Inspired by https://github.com/dterei/gotsc and https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
    Since CPUs can do speculative execution reordering and a bunch of other shenanigans this is a very careful series of instructions that tries to prevent as much of that
    as possible and thus should give a as precise as possible estimate of the number of cycles it takes for a block of code to run. These instructions are not completely noise free
    since we still are running in user-space and the current implementation is x86_64 only (and requires a series of processor features). It is also tricky to convert cycles
    to time spend. If we use this method it should be opt-in and we need to method variance and overhead.

  3. getProcessTime() and getThreadTime()
    I got curious and looked into what google/benchmark is using for time measurement and it turns out they actual measure two things.
    run time and cpu time, where the latter is the time that a process is actually spend being run. The current implementation is Linux only but can get extended to to all platforms we
    care about. For runtime measurement they uses http://en.cppreference.com/w/cpp/chrono/high_resolution_clock. Currently we are using uv_hrtime from libuv.
    Both uv_hrtime and the c++ timer will under Unix fall back to clock_gettime(CLOCK_MONOTONIC, ...) similar to my implementation of getProcessTime.

What should we do?
I think taking a lead from google/benchmark and also measuring CPU time vs just runtime would be a first good actionable item. I am much
less sure about what to do with 1. and 2. and if they are useful for BenchmarkTools.jl, that needs further evaluation and for that I currently don't have time.

end

"""
getProcessTime()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a very Julian name, both in that it starts with "get" and that it's in camel case. I'd just call it processtime(). Likewise for getThreadTime, I'd call that threadtime().

@chethega
Copy link

It is also tricky to convert cycles to time spend. If we use this method it should be opt-in and we need to method variance and overhead.

Cycles spent is an extremely relevant metric in itself, often far more relevant than times. So I'd say, measure and report both, as well as the implied measured frequency.
This can serve as a reality check for users (if the reported frequency differs a lot from the official frequency, then we probably have a lot of measurement error). Also, when interpreting results, every relevant resource is normally counted in clock cycles anyway (instruction costs, cache-miss penalties, memory fetches, branch mispredicts, etc). Say you do some computations with N logical steps; then you always want to count how many OP/cycle, and this tells you roughly how good your code is (large number: few bookkeeping instructions, good use of memory and ILP; small number: figure out the problem).

Converting cycles to nanoseconds is bad; if any conversion makes sense, then it is nanoseconds -> cycles. By reporting measured frequency, the user is also empowered to spot problems like frequency drop due to AVX2, etc (some CPUs scale down frequency when some vector instructions are used).

@vchuravy
Copy link
Member Author

Do you know of anyway to measure cycles in a platform portable way (e.g.) something that works for ARM and PPC?

Originally I went forward with #94 since cputime is an important measure as well (how much time did we actually spent in a program and not sleeping/in the kernel).
I agree that cycle benchmarking has its place and is an important tool, but I am not convinced that a general framework such as BenchmarkTools is the right place for it (maybe we need a LowlevelBenchmarkTools package.)
Since when measuring cycles you want to tightly control the code executed before and after the region of interest and any that introduces overhead that will throw off any other timing measurements.

Anyway I won't have time to work on either, so I would happy if someone could pick this up and bring it to conclusion.

@vchuravy
Copy link
Member Author

So one of the things that has me come back to this PR is that https://perf.rust-lang.org/ defaults to instructions and cycles,
as well as http://llvm-compile-time-tracker.com/

But maybe the better pathway is to use LinuxPerf.jl to build that infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants