-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: benchmarking infrastructure. #3
RFC: benchmarking infrastructure. #3
Conversation
Sounds good to me. FWIW, I have a complimentary RFC incoming that focuses on selecting a corpus of benchmark programs, how to avoid measurement bias, and how to do a sound statistical analysis of results. I don't think we need to block on separating the benchmark server from the web server, but it should be high priority in my opinion. Regarding security, we should have an allow list of users who can trigger benchmark runs for PRs. These users should be able to comment |
|
||
If any of the above becomes impractical or difficult, an intermediate | ||
design-point could involve a web interface that allows approved users | ||
(authenticated in some way) to request a run on a given git commit hash; this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this feature; it's especially powerful to search which commit introduced a regression using the bisection method.
that the integration should not present unforeseen problems. | ||
|
||
# Open questions | ||
[open-questions]: #open-questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experience on maintaining arewefastyet, another pain point has been availability of the service. Machines would sometimes stop running the benchmarks for Some Reason (e.g. timeout, not enough disk space on one machine, network failure,...). Then a person needs to look into it. This can lower availability, and in particular increase latency (if only some people in a given time zone have access to some machines, then the system may be unavailable for half days). I think this would be interesting to think ahead of time, and/or to mention what are the availability expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree with this that maintaining continuous benchmarking, while great, does cost resources and will have hiccups. These are fine I think but it's something we'll want to go into eyes-open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the most important thing here is that it won't necessarily gate PRs: if there's a performance-sensitive thing going in and the benchmarking infrastructure is down, maybe we wait until we have numbers, but otherwise it seems like a thing we can easily enough consider optional when unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for writing this up!
One small concern I have is about where all of this is going go live in terms of benchmarking code. We don't want to get into a situation where we need to compare two commits but APIs changed between them so we can't compare the two easily. The only answer I can see to this though is to store all of the benchmarks in the wasmtime repository itself so we can update APIs and examples as the API changes, but I'm not sure how feasible this will be over time if we want larger benchmarks. In any case it might be worth thinking a bit about the stable interface between the benchmark runner and wasmtime? (it may also be too early to think about this)
accepted/benchmark-infrastructure.md
Outdated
this host for running benchmarks as well, if all stakeholders agree. | ||
|
||
We will need an x86-64 host as well; the best option is likely to rent a | ||
low-cost dedicated host. Details of hosting choice and funding are TBD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is what rust-lang/rust does and I think it's worked our relatively well for them in terms of reliability and cost.
accepted/benchmark-infrastructure.md
Outdated
order to serve this UI. Alternately, if hosting performance becomes an issue, | ||
we could design a static/pregenerated-page workflow wherein the runners upload | ||
data to a git repository and static data is served from GitHub Pages (or | ||
another static host). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a dynamic website one of my concerns would be management with deployment and such, but one possibility is that I believe Heroku has a free tier which should be more than sufficient for this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get pretty far with a static github pages frontend and have JS to switch between static graph images that were pre-rendered by the benchmark analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfallin @alexcrichton .. Not sure I get the question correctly (specifically static data?). There is a viewer (nuxt.js) that loads a json file that servers as the database. Separately there is another html file that does a simple plot of the latest data (can be extended to allow selecting any history of data, any specific data). What was attempted before was collecting data on one machine and the scp this updated json database history file to github pages that served the same viewer. I don't remember the issues with that but I can try to duplicate that effort again and go from there.
that the integration should not present unforeseen problems. | ||
|
||
# Open questions | ||
[open-questions]: #open-questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree with this that maintaining continuous benchmarking, while great, does cost resources and will have hiccups. These are fine I think but it's something we'll want to go into eyes-open.
Oh there's one other thing I thought of just now, which is that we should decide up-front what platforms we would like to benchmark. You mentioned that we happen to have an aarch64 machine, and we can probably get an x86_64 one as well, but there's also a question of whether to include macOS and Windows with what will presumably be Linux benchmarking. The downside to this is that if we have more than one target to benchmark that's multiple physical machines we're managing, and it also makes collection a bit harder since we need to coordinate across systems. That being said though I think we would get a lot of value from benchmarking three targets:
Money permitting I think having at least aarch64/x86_64 will be very beneficial to the two main backends, and then once we have two the infrastructure to add a third ideally won't be the worst, but we'd have to do Windows implementation work. |
The platform question is an important one; thanks for bringing it up! Diversity of evaluation platforms is certainly important; I had had CPU architecture diversity front-of-mind, but I imagine there will be plenty of unique OS-specific performance concerns in runtime work (WASI bindings, async runtime, exception support, etc). Perhaps we can "lazily instantiate" platforms as needed -- and keep portability in mind as much as possible as we develop the benchmark runner. Out of curiosity, how difficult was WIndows-specific perf testing infra to build in your past experience (Rust) -- is it a MinGW/Cygwin "mostly still Linux" setup or is there a native Windows runner daemon of some sort? |
@abrown and I have talked a bit about the interface between Wasmtime and the runner. The runner would open a dylib that has an expected interface (this is similar to what sightglass currently does, but we would want a slightly different interface). This interface will consist of three functions, one for each thing that we'll measure: When comparing multiple commits, we would build one of these dylibs for each commit. Additionally, there is the question of the interface between the Wasm file and Wasmtime. This should basically just be WASI. But, to avoid measuring I/O used to load (for example) an OpenCV model when all we really care about is the compute portion of the program, we should define Finally, it should be noted that we don't want to take every sample from the same process, because of the risk of introducing measurement bias, so we will need to split the runner so that it has a parent process that occassionally scraps the process that loads the dylib. If I were to put this into pseudocode (and ignoring that we should randomize the order of benchmarks executed, as detailed in #4), then the runner would look like this: struct Config {
// Which commits we are measuring.
commits_to_measure: Vec<Commit>,
// The benchmark programs to run.
benchmarks: Vec<Wasm>,
// The number of processes to spread samples across.
num_process_executions: usize,
// The number of samples to take from each process. So total number of
// samples for each benchmark program is
// `num_process_executions * num_child_iterations`.
num_child_iterations: usize,
}
fn parent_runner(config: Config) {
for commit in config.commits_to_measure {
let dylib = build_dylib(commit);
for wasm in config.benchmarks {
for _ in 0..config.num_process_executions {
child = spawn_child_runner(dylib, wasm, config.num_child_iterations);
child.wait();
}
}
}
} and for good measure, here is a diagrammy thing that details the interfaces between each component:
|
@alexcrichton, I think with what @fitzgen and I discussed would mean that the benchmarks can live anywhere. My preference would be that all benchmarks that we provide are buildable with a
@fitzgen I have more thoughts on this since we last talked: we also discussed using the Wasm C API as the way to control Wasmtime and I am leaning toward using that because it eliminates a step. If we use compile/instantiate/execute, we have to rewrap Wasmtime as this dylib, but if we use the C API as the interface, we could either use the output of I also have a selfish reason for using the C API: I can then reuse this tool internally to do performance analysis on other Wasm engines. As @fitzgen has pointed out previously, we probably don't want to trigger an engine-vs-engine benchmark war but, to me, that is solvable by not publicizing head-to-head results, not by restricting the tool. (I'm working on such a tool at the moment so I'm interested in all of your thoughts!). |
@fitzgen and @abrown what y'all are saying makes sense, and what I'm thinking is that this dylib basically must live in the wasmtime repository for us to maintain it because the implementation of the dylib may change over time as we tweak the I think we need to be careful though because what we choose as an API between the benchmark runner and wasmtime itself dictates what we can an cannot benchmark. For example I don't think using the existing C API is good for benchmarking calling into wasm. In Rust this would idiomatically use the If we only want to benchmark the compilation and execution of primarily the wasm itself then it's probably not so bad though. Instantiation is a little iffy since we would ideally account for the whole cost of instantiation with the cost of creating @cfallin I agree yeah that as long as we have 2 platforms out of the gate to benchmark on it should be easy enough to add more platforms in the future. Unfortunately AFAIK rust-lang/rust doesn't do any benchmarking on Windows, and I myself don't know how to do anything off the top of my head other than measuring wall-time. I don't think that running a server or implementing a daemon will be too hard though as long as we stick to the Rust ecosystem, there shouldn't be any portability concerns for stuff we're doing. The only thing I worry about is how to measure the same metrics across platforms. On Linux we can use |
This makes sense to me.
Agreed that designing the interface requires some care. (Note: I am personally less interested in micro benchmarks. They have their place, but I don't think we need to continuously measure them over time as much as we need to do that for overall compilation/instantiation/execution which have more action at a distance.) I do think, however, we can largely enable this sort of thing not at the child runner <--> dylib interface level but at the Wasm interface level. For example, if we want to micro-benchmark Wasm->host calls, we can provide a @abrown rather than swapping engines at the C API level, you can always reimplement the dylib for different engines and, since the proposed dylib interface is so simple, it should be straightforward. I'm thinking less than a hundred lines of code really. You could also make an implementation of the dylib interface that uses the Wasm C API, and then you would only have to write the dylib implementation once, rather than for each engine, and could swap out engines at the C API level again. |
When using wasmtime as dylib, you should probably set |
I think that's a good point actually about microbenchmarks not being too too useful in a regression-testing framework! With that in mind I think you're right that it's less important than I might think the precise API we have, and what you're saying so far makes sense and seems plenty flexible. |
@alexcrichton, yeah, I had only really considered whole Wasm programs and I agree with @fitzgen's comments above; I will add that if we do want to do micros for some reason, though, I have split the recording part of sightglass out separately so we should be able to measure custom blocks of Rust code accessing Wasmtime directly. @fitzgen, I think that if Wasmtime provides an easy way for building the benchmarking dylib, e.g., |
Motion to finalize with a disposition to mergeGiven that discussion here has quiesced and the benchmarking work is well underway (and the sibling RFC #4 that describes the benchmark suite infra has merged), I'm proposing that we merge this RFC. I believe that all feedback has been merged into the updated RFC. Stakeholders sign-offTagging all employees of BA-affiliated companies who have committed to the Wasmtime or Lucet repos in the last three months1, plus anyone who has given feedback on this PR as a stakeholder. FastlyIntelIBMMozilla1: Cheatsheet for my own later reference: |
Entering Final Call Period
The FCP will end on Mon Mar 8. |
(Sorry, copy/paste error with FCP end date; fixed to Mar 8.) |
The FInal Call Period has elapsed with no other comments or feedback, so I will now merge this RFC. Thanks all! |
This RFC proposes establishing a benchmarking infrastructure for Cranelift/wasmtime.
rendered