benchmark: add improvement and t-test in Node #12585

jseijas · 2017-04-22T12:23:06Z

Added the capability of calculate the improvement, the p.value and the
confidence directly from node, for those that wants to check the
performance impact of their changes without installing R.

This is not a replacement for the R scripts, still needed for the
scatters and for having a more reliable results. But is a good way of
avoid the R dependency when you only want to measure the improvement
of your changes. Important note: the p.value calculated by R and the
one calculated by the Student's test implementation in node are not
equal, but pretty similar enough in scale, so the confidence should
be the same. This is due to not knowing exactly how is implemented the
Student's test in R, and also because the approximation of the gamma
function.

The compare.js script works as always, but now you have a new parameter
--stats. When this parameter is added, the csv lines are not showed in
the console, and when all the jobs are processed the stats are
calculated and shown in the console in the same format as the R script
does.

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
commit message follows commit guidelines

Affected core subsystem(s)

Added the capability of calculate the improvement, the p.value and the confidence directly from node, for those that wants to check the performance impact of their changes without installing R. This is not a replacement for the R scripts, still needed for the scatters and for having a more reliable results. But is a good way of avoid the R dependency when you only want to measure the improvement of your changes. Important note: the p.value calculated by R and the one calculated by the Student's test implementation in node are not equal, but pretty similar enough in scale, so the confidence should be the same. This is due to not knowing exactly how is implemented the Student's test in R, and also because the approximation of the gamma function. The compare.js script works as always, but now you have a new parameter --stats. When this parameter is added, the csv lines are not showed in the console, and when all the jobs are processed the stats are calculated and shown in the console in the same format as the R script does.

vsemozhetbyt · 2017-04-22T12:41:54Z

cc @nodejs/benchmarking, @nodejs/performance

mscdex · 2017-04-22T18:06:10Z

pinging @AndreasMadsen as they provided the original R scripts

AndreasMadsen

Implementing t-test in javascript is no easy task, I did it in https://github.com/AndreasMadsen/ttest and it too has its limitations.

If we are to do this I would like to see it as a CLI tool that a CSV stream can be piped to, just like the R script. Keeping data collection and data processing separate has always been a good strategy for me and I have always regretted it when I didn't do it. For example I could easily see this resulting in some x/0 or log(0) error.

I'm wondering if this is worth the effort, I know we are getting a CI server for benchmarking. Installing R on that server will not be a problem. This will take time to develop correctly, I don't want you to spend time on something that soon becomes redundant.

While you do say that this is not meant as a replacement for R, I fear that people will take it as such.

AndreasMadsen · 2017-04-22T23:48:34Z

benchmark/t-test.js

+  this.df = this.left.size + this.right.size - 2;
+  this.dfhalf = this.df / 2;
+  this.mean = this.left.mean - this.right.mean;
+  this.commonVariance = ((this.left.size - 1) * this.left.variance +


Don't assume equal variance, my investigations shows that this is very rare. Use the Welch t-test.

Welch t-test is more applicable when the samples have different sizes. In our case, we always have the same size, so we can assume Student's test as good enough with less computation time. In my experience for samples with the same sample size, Student's test is very reliable... but that's ok, I can implement Welch.

AndreasMadsen · 2017-04-22T23:48:45Z

benchmark/t-test.js

+};
+
+TTest.prototype.log1p = function(n) {
+  return Math.log(n + 1);


This is not numerically stable.

AndreasMadsen · 2017-04-22T23:49:00Z

benchmark/t-test.js

+};
+
+TTest.prototype.logGamma = function(n) {
+  return Math.log(Math.abs(this.gamma(n)));


This is not numerically stable.

AndreasMadsen · 2017-04-22T23:50:19Z

benchmark/t-test.js

+    sum += data[i];
+  }
+  // Calculate the mean.
+  var mean = sum / data.length;


This is not numerically stable, might not be an issue.

That can only happen if the developer run the compare.js without passing sets, and this case is already covered by the CLI. But we can put a check.

AndreasMadsen · 2017-04-22T23:51:49Z

benchmark/compare.js

  progress.startQueue(kStartOfQueue);
 }

+const outputs = [];
+
+function calculateStats() {


Where is this function used?

Already solved. Is called at the recursive, when the last one is ended.

jseijas · 2017-04-23T00:31:39Z

@AndreasMadsen I can do the changes suggested. As I said, is not a replacement of R, but a way of involving more developers on doing benchmarking, without having to install R. Given the fact that python is being used in other parts of the node cycle, perhaps a python pandas implementation can be also done.

Also, I did that in the past for having more light CI and also for having the capabilitie of checking benchmark inside docker, to be sure of the performance in the environment that will be deployed.

If you think that this does not have sense, then we can cancel the PR and that's all. On the other hand, if you think that makes any sense, then I will continue with the development.

I made the test with some modules (util, querystring, ...) and the mean, improvements and confidence were calculated exactly, but not the p.value. But in my case as developer, I only need to know the improvement and that this improvement has *** to know that I can rely on that, no need of the complete p.value.

AndreasMadsen · 2017-04-24T18:44:08Z

If you think that this does not have sense, then we can cancel the PR and that's all. On the other hand, if you think that makes any sense, then I will continue with the development.

I think a more accessible t-test makes sense, although it would be nice to know how big of an issue the current solution.

Just know that I spend a week on https://github.com/AndreasMadsen/ttest and it still has issues, to the point where I wouldn't be comfortable about using it in https://github.com/nodejs/node. https://github.com/AndreasMadsen/ttest is also quite a bit of code, we shouldn't maintain that much code if accessibility it is not a big issue.

An alternative solution could be to add native binding that implements the t-cdf function. I think C++ has all the math functions build in. A python solution could also work.

I made the test with some modules (util, querystring, ...) and the mean, improvements and confidence were calculated exactly, but not the p.value. But in my case as developer, I only need to know the improvement and that this improvement has *** to know that I can rely on that, no need of the complete p.value.I made the test with some modules (util, querystring, ...) and the mean, improvements and confidence were calculated exactly, but not the p.value. But in my case as developer, I only need to know the improvement and that this improvement has *** to know that I can rely on that, no need of the complete p.value.

The thing about numerical stability is that you often don't notice the issue, the implementation might work well for 95% of the cases, but for the remaining 5% you get incredibly wrong results.

The p-value should at least be within 1e-6 error, any higher than that and something is logically wrong, it is not just a matter of numerical precession. I would suspect most of the error is caused by the equal variance assumption.

edit: also did you implement betacf yourself? If not, we need to make sure we are not violating some copyright constraints.

jseijas · 2017-04-26T00:19:10Z

@AndreasMadsen From my point of view we must focus on the real application. The application is to make the benchmarks accessible to the maximum number of developers, avoiding the installation of a third party like R. Once this is done, we can expect more PR that contains info about the benchmarks, or ask to run the benchmarks without asking the installation of R. The benchmark that they can provide that way, even without a totally reliable p.value, are good enough to say "LGTM" or not, because the more important thing to do that is the mean (that is 100% correct) and the confidence about that improvement.

Given an upload that can affect the performance, I think that the performance group will check the performance of the change using the R script, even if the contributor adds his own benchmark to the commit. So, this is something that will never change. But you can have a first "smell" provided in an easy way before this happens.

On the other hand, I proposed to move from R to python pandas, becuase Python is already being used, and the migration can be easy.

About the betacf, is a node translation the fortran of the Fortran 77 Recipes from the Cambridge University. But after taking into account that the amount of samples is always the same for both old and new, this means that betacf and incbeta can be mathematically simplified, because always a = b.

TimothyGu · 2017-04-26T00:33:21Z

I'm a stats noob. How does this differ from the R script? Can the R script be rewritten in JS or Python or C++?

jseijas · 2017-04-26T00:46:42Z

@TimothyGu Differs a lot, because R is a great tool for statistics, with statistics functions inside the box, and of course the plots. If you want complete benchmark information and evolution, then you need a tool like R.

The R scripts can be rewritten to Python Pandas, IMO without too much trouble. But not to JS or C++, because are not stats focused languages, and the algorithms are very complicated to develop, with a reliable error.

This PR is not for replacing R, is for taking into account that not all the people is an expert on stats, R or Python pandas, and that having something more oriented to the developer can be a win, because can engage more of them to include the benchmarks in the PR, instead of having @mscdex running it once and once again, even when the improvement is negative.

But that's ok, we can think that we live in an ideal world where all the developers that do a PR has R installed on their machines, and all of them understand the meaning of the p-value. Or we can think that this ideal world does not exists and try to bring more tools to the developers out of the data science box.

AndreasMadsen · 2017-04-26T07:17:52Z

@mscdex In your experience how big of a problem is it that R is not installed.

In my experience R is not the problem, the problem is education and time.

Few people understand that they can't just make decisions by eye rolling the numbers. This turns out to be a fault in the biological brain.
Running the benchmarks once, takes a very long time, running them 60 times can appear unreasonable for some.

R is preinstalled on Mac, as easy to install on linux as a python package, and on Windows it is not super difficult and the benchmarks themselves are already problematic because of wrk. I would think that anyone who has the skills to optimize node.js will have the skills to install R. If they don't install R it is because they don't think statistics is important in the first place, but this is purely a belief.

jseijas · 2017-04-26T08:15:45Z

@AndreasMadsen Well, you don't have to convince me, I'm already a R fan, but moving to Python because I prefer it for Tensorflow.
I didn't know that is preinstalled on Mac, in mine I installed it with brew.
But understand me, nobody that is not involved in data science looks at the p-value, the people use to focus on the improvement/confidence. So that means that a node t-test seems good enough for this.

But as I said, if you don't see the point, we can remove the PR and all is done! :)

But I would like to propose, out of the PR, to move the R to python because you will be removing the amount of tools for the node echosystem. And python should gives you results and plots as good as the R ones.

AndreasMadsen · 2017-04-26T09:11:53Z

But understand me, nobody that is not involved in data science looks at the p-value, the people use to focus on the improvement/confidence. So that means that a node t-test seems good enough for this.

This is the point you keep bringing up, I have already addressed it:

The thing about numerical stability is that you often don't notice the issue, the implementation might work well for 95% of the cases, but for the remaining 5% you get incredibly wrong results.

The p-value should at least be within 1e-6 error, any higher than that and something is logically wrong, it is not just a matter of numerical precession. I would suspect most of the error is caused by the equal variance assumption.

If there is something in this that you find confusing or wrong then please refer to that directly.

But I would like to propose, out of the PR, to move the R to python because you will be removing the amount of tools for the node echosystem.

Let's try that.

mcollina

I'm 👎 on removing the csv step. We should really have a CSV step so that we can run things at different times. Otherwise debugging these script is extremely hard.

I will leave to @AndreasMadsen for the stats.

BridgeAR · 2017-08-26T12:03:28Z

There was no update for a long time and I feel this will not land overall due to the -1. Therefore I am going to close this. @jseijas please feel free to reopen if you would like to follow up on this!

nodejs-github-bot added the benchmark Issues and PRs related to the benchmark subsystem. label Apr 22, 2017

AndreasMadsen suggested changes Apr 23, 2017

View reviewed changes

benchmark: add improvement and t-test in Node

13fa710

benchmark: add improvement and t-test in Node

ff47873

mcollina requested changes May 1, 2017

View reviewed changes

refack mentioned this pull request Aug 10, 2017

Install of Rscript on new benchmarking machines nodejs/build#821

Closed

BridgeAR closed this Aug 26, 2017

kenany mentioned this pull request Nov 4, 2017

Porting compare.R to JavaScript? #16762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark: add improvement and t-test in Node #12585

benchmark: add improvement and t-test in Node #12585

jseijas commented Apr 22, 2017

vsemozhetbyt commented Apr 22, 2017

mscdex commented Apr 22, 2017

AndreasMadsen left a comment •

edited

Loading

AndreasMadsen Apr 22, 2017 •

edited

Loading

jseijas Apr 23, 2017

AndreasMadsen Apr 22, 2017

AndreasMadsen Apr 22, 2017

AndreasMadsen Apr 22, 2017

jseijas Apr 23, 2017

AndreasMadsen Apr 22, 2017

jseijas Apr 23, 2017

jseijas commented Apr 23, 2017

AndreasMadsen commented Apr 24, 2017 •

edited

Loading

jseijas commented Apr 26, 2017

TimothyGu commented Apr 26, 2017

jseijas commented Apr 26, 2017

AndreasMadsen commented Apr 26, 2017

jseijas commented Apr 26, 2017

AndreasMadsen commented Apr 26, 2017

mcollina left a comment

BridgeAR commented Aug 26, 2017

benchmark: add improvement and t-test in Node #12585

benchmark: add improvement and t-test in Node #12585

Conversation

jseijas commented Apr 22, 2017

Checklist

Affected core subsystem(s)

vsemozhetbyt commented Apr 22, 2017

mscdex commented Apr 22, 2017

AndreasMadsen left a comment • edited Loading

Choose a reason for hiding this comment

AndreasMadsen Apr 22, 2017 • edited Loading

Choose a reason for hiding this comment

jseijas Apr 23, 2017

Choose a reason for hiding this comment

AndreasMadsen Apr 22, 2017

Choose a reason for hiding this comment

AndreasMadsen Apr 22, 2017

Choose a reason for hiding this comment

AndreasMadsen Apr 22, 2017

Choose a reason for hiding this comment

jseijas Apr 23, 2017

Choose a reason for hiding this comment

AndreasMadsen Apr 22, 2017

Choose a reason for hiding this comment

jseijas Apr 23, 2017

Choose a reason for hiding this comment

jseijas commented Apr 23, 2017

AndreasMadsen commented Apr 24, 2017 • edited Loading

jseijas commented Apr 26, 2017

TimothyGu commented Apr 26, 2017

jseijas commented Apr 26, 2017

AndreasMadsen commented Apr 26, 2017

jseijas commented Apr 26, 2017

AndreasMadsen commented Apr 26, 2017

mcollina left a comment

Choose a reason for hiding this comment

BridgeAR commented Aug 26, 2017

AndreasMadsen left a comment •

edited

Loading

AndreasMadsen Apr 22, 2017 •

edited

Loading

AndreasMadsen commented Apr 24, 2017 •

edited

Loading