Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark: add improvement and t-test in Node #12585

Closed
wants to merge 3 commits into from

Conversation

jseijas
Copy link

@jseijas jseijas commented Apr 22, 2017

Added the capability of calculate the improvement, the p.value and the
confidence directly from node, for those that wants to check the
performance impact of their changes without installing R.

This is not a replacement for the R scripts, still needed for the
scatters and for having a more reliable results. But is a good way of
avoid the R dependency when you only want to measure the improvement
of your changes. Important note: the p.value calculated by R and the
one calculated by the Student's test implementation in node are not
equal, but pretty similar enough in scale, so the confidence should
be the same. This is due to not knowing exactly how is implemented the
Student's test in R, and also because the approximation of the gamma
function.

The compare.js script works as always, but now you have a new parameter
--stats. When this parameter is added, the csv lines are not showed in
the console, and when all the jobs are processed the stats are
calculated and shown in the console in the same format as the R script
does.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines
Affected core subsystem(s)

Added the capability of calculate the improvement, the p.value and the
confidence directly from node, for those that wants to check the
performance impact of their changes without installing R.

This is not a replacement for the R scripts, still needed for the
scatters and for having a more reliable results. But is a good way of
avoid the R dependency when you only want to measure the improvement
of your changes. Important note: the p.value calculated by R and the
one calculated by the Student's test implementation in node are not
equal, but pretty similar enough in scale, so the confidence should
be the same. This is due to not knowing exactly how is implemented the
Student's test in R, and also because the approximation of the gamma
function.

The compare.js script works as always, but now you have a new parameter
--stats. When this parameter is added, the csv lines are not showed in
the console, and when all the jobs are processed the stats are
calculated and shown in the console in the same format as the R script
does.
@nodejs-github-bot nodejs-github-bot added the benchmark Issues and PRs related to the benchmark subsystem. label Apr 22, 2017
@vsemozhetbyt
Copy link
Contributor

cc @nodejs/benchmarking, @nodejs/performance

@mscdex
Copy link
Contributor

mscdex commented Apr 22, 2017

pinging @AndreasMadsen as they provided the original R scripts

Copy link
Member

@AndreasMadsen AndreasMadsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing t-test in javascript is no easy task, I did it in https://github.com/AndreasMadsen/ttest and it too has its limitations.

If we are to do this I would like to see it as a CLI tool that a CSV stream can be piped to, just like the R script. Keeping data collection and data processing separate has always been a good strategy for me and I have always regretted it when I didn't do it. For example I could easily see this resulting in some x/0 or log(0) error.

I'm wondering if this is worth the effort, I know we are getting a CI server for benchmarking. Installing R on that server will not be a problem. This will take time to develop correctly, I don't want you to spend time on something that soon becomes redundant.

While you do say that this is not meant as a replacement for R, I fear that people will take it as such.

this.df = this.left.size + this.right.size - 2;
this.dfhalf = this.df / 2;
this.mean = this.left.mean - this.right.mean;
this.commonVariance = ((this.left.size - 1) * this.left.variance +
Copy link
Member

@AndreasMadsen AndreasMadsen Apr 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assume equal variance, my investigations shows that this is very rare. Use the Welch t-test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welch t-test is more applicable when the samples have different sizes. In our case, we always have the same size, so we can assume Student's test as good enough with less computation time. In my experience for samples with the same sample size, Student's test is very reliable... but that's ok, I can implement Welch.

};

TTest.prototype.log1p = function(n) {
return Math.log(n + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not numerically stable.

};

TTest.prototype.logGamma = function(n) {
return Math.log(Math.abs(this.gamma(n)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not numerically stable.

sum += data[i];
}
// Calculate the mean.
var mean = sum / data.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not numerically stable, might not be an issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can only happen if the developer run the compare.js without passing sets, and this case is already covered by the CLI. But we can put a check.

progress.startQueue(kStartOfQueue);
}

const outputs = [];

function calculateStats() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this function used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already solved. Is called at the recursive, when the last one is ended.

@jseijas
Copy link
Author

jseijas commented Apr 23, 2017

@AndreasMadsen I can do the changes suggested. As I said, is not a replacement of R, but a way of involving more developers on doing benchmarking, without having to install R. Given the fact that python is being used in other parts of the node cycle, perhaps a python pandas implementation can be also done.

Also, I did that in the past for having more light CI and also for having the capabilitie of checking benchmark inside docker, to be sure of the performance in the environment that will be deployed.

If you think that this does not have sense, then we can cancel the PR and that's all. On the other hand, if you think that makes any sense, then I will continue with the development.

I made the test with some modules (util, querystring, ...) and the mean, improvements and confidence were calculated exactly, but not the p.value. But in my case as developer, I only need to know the improvement and that this improvement has *** to know that I can rely on that, no need of the complete p.value.

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Apr 24, 2017

If you think that this does not have sense, then we can cancel the PR and that's all. On the other hand, if you think that makes any sense, then I will continue with the development.

I think a more accessible t-test makes sense, although it would be nice to know how big of an issue the current solution.

Just know that I spend a week on https://github.com/AndreasMadsen/ttest and it still has issues, to the point where I wouldn't be comfortable about using it in https://github.com/nodejs/node. https://github.com/AndreasMadsen/ttest is also quite a bit of code, we shouldn't maintain that much code if accessibility it is not a big issue.

An alternative solution could be to add native binding that implements the t-cdf function. I think C++ has all the math functions build in. A python solution could also work.

I made the test with some modules (util, querystring, ...) and the mean, improvements and confidence were calculated exactly, but not the p.value. But in my case as developer, I only need to know the improvement and that this improvement has *** to know that I can rely on that, no need of the complete p.value.I made the test with some modules (util, querystring, ...) and the mean, improvements and confidence were calculated exactly, but not the p.value. But in my case as developer, I only need to know the improvement and that this improvement has *** to know that I can rely on that, no need of the complete p.value.

The thing about numerical stability is that you often don't notice the issue, the implementation might work well for 95% of the cases, but for the remaining 5% you get incredibly wrong results.

The p-value should at least be within 1e-6 error, any higher than that and something is logically wrong, it is not just a matter of numerical precession. I would suspect most of the error is caused by the equal variance assumption.


edit: also did you implement betacf yourself? If not, we need to make sure we are not violating some copyright constraints.

@jseijas
Copy link
Author

jseijas commented Apr 26, 2017

@AndreasMadsen From my point of view we must focus on the real application. The application is to make the benchmarks accessible to the maximum number of developers, avoiding the installation of a third party like R. Once this is done, we can expect more PR that contains info about the benchmarks, or ask to run the benchmarks without asking the installation of R. The benchmark that they can provide that way, even without a totally reliable p.value, are good enough to say "LGTM" or not, because the more important thing to do that is the mean (that is 100% correct) and the confidence about that improvement.

Given an upload that can affect the performance, I think that the performance group will check the performance of the change using the R script, even if the contributor adds his own benchmark to the commit. So, this is something that will never change. But you can have a first "smell" provided in an easy way before this happens.

On the other hand, I proposed to move from R to python pandas, becuase Python is already being used, and the migration can be easy.

About the betacf, is a node translation the fortran of the Fortran 77 Recipes from the Cambridge University. But after taking into account that the amount of samples is always the same for both old and new, this means that betacf and incbeta can be mathematically simplified, because always a = b.

@TimothyGu
Copy link
Member

I'm a stats noob. How does this differ from the R script? Can the R script be rewritten in JS or Python or C++?

@jseijas
Copy link
Author

jseijas commented Apr 26, 2017

@TimothyGu Differs a lot, because R is a great tool for statistics, with statistics functions inside the box, and of course the plots. If you want complete benchmark information and evolution, then you need a tool like R.

The R scripts can be rewritten to Python Pandas, IMO without too much trouble. But not to JS or C++, because are not stats focused languages, and the algorithms are very complicated to develop, with a reliable error.

This PR is not for replacing R, is for taking into account that not all the people is an expert on stats, R or Python pandas, and that having something more oriented to the developer can be a win, because can engage more of them to include the benchmarks in the PR, instead of having @mscdex running it once and once again, even when the improvement is negative.

But that's ok, we can think that we live in an ideal world where all the developers that do a PR has R installed on their machines, and all of them understand the meaning of the p-value. Or we can think that this ideal world does not exists and try to bring more tools to the developers out of the data science box.

@AndreasMadsen
Copy link
Member

@mscdex In your experience how big of a problem is it that R is not installed.


In my experience R is not the problem, the problem is education and time.

  • Few people understand that they can't just make decisions by eye rolling the numbers. This turns out to be a fault in the biological brain.

  • Running the benchmarks once, takes a very long time, running them 60 times can appear unreasonable for some.

R is preinstalled on Mac, as easy to install on linux as a python package, and on Windows it is not super difficult and the benchmarks themselves are already problematic because of wrk. I would think that anyone who has the skills to optimize node.js will have the skills to install R. If they don't install R it is because they don't think statistics is important in the first place, but this is purely a belief.

@jseijas
Copy link
Author

jseijas commented Apr 26, 2017

@AndreasMadsen Well, you don't have to convince me, I'm already a R fan, but moving to Python because I prefer it for Tensorflow.
I didn't know that is preinstalled on Mac, in mine I installed it with brew.
But understand me, nobody that is not involved in data science looks at the p-value, the people use to focus on the improvement/confidence. So that means that a node t-test seems good enough for this.

But as I said, if you don't see the point, we can remove the PR and all is done! :)

But I would like to propose, out of the PR, to move the R to python because you will be removing the amount of tools for the node echosystem. And python should gives you results and plots as good as the R ones.

@AndreasMadsen
Copy link
Member

But understand me, nobody that is not involved in data science looks at the p-value, the people use to focus on the improvement/confidence. So that means that a node t-test seems good enough for this.

This is the point you keep bringing up, I have already addressed it:

The thing about numerical stability is that you often don't notice the issue, the implementation might work well for 95% of the cases, but for the remaining 5% you get incredibly wrong results.

The p-value should at least be within 1e-6 error, any higher than that and something is logically wrong, it is not just a matter of numerical precession. I would suspect most of the error is caused by the equal variance assumption.

If there is something in this that you find confusing or wrong then please refer to that directly.


But I would like to propose, out of the PR, to move the R to python because you will be removing the amount of tools for the node echosystem.

Let's try that.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👎 on removing the csv step. We should really have a CSV step so that we can run things at different times. Otherwise debugging these script is extremely hard.

I will leave to @AndreasMadsen for the stats.

@BridgeAR
Copy link
Member

There was no update for a long time and I feel this will not land overall due to the -1. Therefore I am going to close this. @jseijas please feel free to reopen if you would like to follow up on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Issues and PRs related to the benchmark subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants