-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmark: add improvement and t-test in Node #12585
Conversation
Added the capability of calculate the improvement, the p.value and the confidence directly from node, for those that wants to check the performance impact of their changes without installing R. This is not a replacement for the R scripts, still needed for the scatters and for having a more reliable results. But is a good way of avoid the R dependency when you only want to measure the improvement of your changes. Important note: the p.value calculated by R and the one calculated by the Student's test implementation in node are not equal, but pretty similar enough in scale, so the confidence should be the same. This is due to not knowing exactly how is implemented the Student's test in R, and also because the approximation of the gamma function. The compare.js script works as always, but now you have a new parameter --stats. When this parameter is added, the csv lines are not showed in the console, and when all the jobs are processed the stats are calculated and shown in the console in the same format as the R script does.
cc @nodejs/benchmarking, @nodejs/performance |
pinging @AndreasMadsen as they provided the original R scripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing t-test
in javascript is no easy task, I did it in https://github.com/AndreasMadsen/ttest and it too has its limitations.
If we are to do this I would like to see it as a CLI tool that a CSV stream can be piped to, just like the R script. Keeping data collection and data processing separate has always been a good strategy for me and I have always regretted it when I didn't do it. For example I could easily see this resulting in some x/0
or log(0)
error.
I'm wondering if this is worth the effort, I know we are getting a CI server for benchmarking. Installing R on that server will not be a problem. This will take time to develop correctly, I don't want you to spend time on something that soon becomes redundant.
While you do say that this is not meant as a replacement for R, I fear that people will take it as such.
this.df = this.left.size + this.right.size - 2; | ||
this.dfhalf = this.df / 2; | ||
this.mean = this.left.mean - this.right.mean; | ||
this.commonVariance = ((this.left.size - 1) * this.left.variance + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't assume equal variance, my investigations shows that this is very rare. Use the Welch t-test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welch t-test is more applicable when the samples have different sizes. In our case, we always have the same size, so we can assume Student's test as good enough with less computation time. In my experience for samples with the same sample size, Student's test is very reliable... but that's ok, I can implement Welch.
benchmark/t-test.js
Outdated
}; | ||
|
||
TTest.prototype.log1p = function(n) { | ||
return Math.log(n + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not numerically stable.
benchmark/t-test.js
Outdated
}; | ||
|
||
TTest.prototype.logGamma = function(n) { | ||
return Math.log(Math.abs(this.gamma(n))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not numerically stable.
sum += data[i]; | ||
} | ||
// Calculate the mean. | ||
var mean = sum / data.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not numerically stable, might not be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can only happen if the developer run the compare.js without passing sets, and this case is already covered by the CLI. But we can put a check.
progress.startQueue(kStartOfQueue); | ||
} | ||
|
||
const outputs = []; | ||
|
||
function calculateStats() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this function used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already solved. Is called at the recursive, when the last one is ended.
@AndreasMadsen I can do the changes suggested. As I said, is not a replacement of R, but a way of involving more developers on doing benchmarking, without having to install R. Given the fact that python is being used in other parts of the node cycle, perhaps a python pandas implementation can be also done. Also, I did that in the past for having more light CI and also for having the capabilitie of checking benchmark inside docker, to be sure of the performance in the environment that will be deployed. If you think that this does not have sense, then we can cancel the PR and that's all. On the other hand, if you think that makes any sense, then I will continue with the development. I made the test with some modules (util, querystring, ...) and the mean, improvements and confidence were calculated exactly, but not the p.value. But in my case as developer, I only need to know the improvement and that this improvement has *** to know that I can rely on that, no need of the complete p.value. |
I think a more accessible t-test makes sense, although it would be nice to know how big of an issue the current solution. Just know that I spend a week on https://github.com/AndreasMadsen/ttest and it still has issues, to the point where I wouldn't be comfortable about using it in https://github.com/nodejs/node. https://github.com/AndreasMadsen/ttest is also quite a bit of code, we shouldn't maintain that much code if accessibility it is not a big issue. An alternative solution could be to add native binding that implements the t-cdf function. I think C++ has all the math functions build in. A python solution could also work.
The thing about numerical stability is that you often don't notice the issue, the implementation might work well for 95% of the cases, but for the remaining 5% you get incredibly wrong results. The p-value should at least be within edit: also did you implement |
@AndreasMadsen From my point of view we must focus on the real application. The application is to make the benchmarks accessible to the maximum number of developers, avoiding the installation of a third party like R. Once this is done, we can expect more PR that contains info about the benchmarks, or ask to run the benchmarks without asking the installation of R. The benchmark that they can provide that way, even without a totally reliable p.value, are good enough to say "LGTM" or not, because the more important thing to do that is the mean (that is 100% correct) and the confidence about that improvement. Given an upload that can affect the performance, I think that the performance group will check the performance of the change using the R script, even if the contributor adds his own benchmark to the commit. So, this is something that will never change. But you can have a first "smell" provided in an easy way before this happens. On the other hand, I proposed to move from R to python pandas, becuase Python is already being used, and the migration can be easy. About the betacf, is a node translation the fortran of the Fortran 77 Recipes from the Cambridge University. But after taking into account that the amount of samples is always the same for both old and new, this means that betacf and incbeta can be mathematically simplified, because always a = b. |
I'm a stats noob. How does this differ from the R script? Can the R script be rewritten in JS or Python or C++? |
@TimothyGu Differs a lot, because R is a great tool for statistics, with statistics functions inside the box, and of course the plots. If you want complete benchmark information and evolution, then you need a tool like R. The R scripts can be rewritten to Python Pandas, IMO without too much trouble. But not to JS or C++, because are not stats focused languages, and the algorithms are very complicated to develop, with a reliable error. This PR is not for replacing R, is for taking into account that not all the people is an expert on stats, R or Python pandas, and that having something more oriented to the developer can be a win, because can engage more of them to include the benchmarks in the PR, instead of having @mscdex running it once and once again, even when the improvement is negative. But that's ok, we can think that we live in an ideal world where all the developers that do a PR has R installed on their machines, and all of them understand the meaning of the p-value. Or we can think that this ideal world does not exists and try to bring more tools to the developers out of the data science box. |
@mscdex In your experience how big of a problem is it that R is not installed. In my experience R is not the problem, the problem is education and time.
R is preinstalled on Mac, as easy to install on linux as a python package, and on Windows it is not super difficult and the benchmarks themselves are already problematic because of wrk. I would think that anyone who has the skills to optimize node.js will have the skills to install R. If they don't install R it is because they don't think statistics is important in the first place, but this is purely a belief. |
@AndreasMadsen Well, you don't have to convince me, I'm already a R fan, but moving to Python because I prefer it for Tensorflow. But as I said, if you don't see the point, we can remove the PR and all is done! :) But I would like to propose, out of the PR, to move the R to python because you will be removing the amount of tools for the node echosystem. And python should gives you results and plots as good as the R ones. |
This is the point you keep bringing up, I have already addressed it:
If there is something in this that you find confusing or wrong then please refer to that directly.
Let's try that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 👎 on removing the csv step. We should really have a CSV step so that we can run things at different times. Otherwise debugging these script is extremely hard.
I will leave to @AndreasMadsen for the stats.
There was no update for a long time and I feel this will not land overall due to the -1. Therefore I am going to close this. @jseijas please feel free to reopen if you would like to follow up on this! |
Added the capability of calculate the improvement, the p.value and the
confidence directly from node, for those that wants to check the
performance impact of their changes without installing R.
This is not a replacement for the R scripts, still needed for the
scatters and for having a more reliable results. But is a good way of
avoid the R dependency when you only want to measure the improvement
of your changes. Important note: the p.value calculated by R and the
one calculated by the Student's test implementation in node are not
equal, but pretty similar enough in scale, so the confidence should
be the same. This is due to not knowing exactly how is implemented the
Student's test in R, and also because the approximation of the gamma
function.
The compare.js script works as always, but now you have a new parameter
--stats. When this parameter is added, the csv lines are not showed in
the console, and when all the jobs are processed the stats are
calculated and shown in the console in the same format as the R script
does.
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passesAffected core subsystem(s)