-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: math: add Mean, Median, Mode, Variance, and StdDev #69264
Comments
Related Issues and Documentation (Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
In general the math package aims to provide the functions that are in the C++ standard library |
Thanks for the feedback! I get that the math package is meant to mirror the functions in C++'s |
Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once! |
I’ve done some digging into how statistical functions are currently being handled in the Go community. While libraries like Gonum and others provide statistical methods, there's no single source of truth or dominant package in this space, and many are designed for more complex or specialized tasks. However, the basic statistical functions we're proposing—like By integrating these into the standard library, we'd eliminate the need for external dependencies for basic tasks, which is in line with Go's philosophy of having a strong standard library for common use cases. While third-party packages are an option, including these functions in the |
this is the part where we need to see evidence. especially considering the existence of libraries like gonum, how often does the need arise for functions like those proposed where you wouldn't need the extra functionality that other libraries provide. |
For what it's worth, python has a statistics package in its standard library: https://docs.python.org/3/library/statistics.html It would be nice to have a simple package everyone agrees on for common use cases, but that doesn't necessarily need to be in std. |
These functions sound pretty simple, but I think there's actually a lot of subtlety here. For instance, what does |
in my experience everytime some numeric problems comes up gonum lib is suggested. they have a stats package https://pkg.go.dev/gonum.org/v1/gonum@v0.15.1/stat |
Yeah, so think about having it's functionalities in go std lib straight away ! |
Gonum library is indeed often suggested for statistical and numerical work in Go, and it has a dedicated However, my proposal is focused on adding foundational statistical functions like |
IMHO these functions would be very useful in the standard library, even if (or indeed, because) the implementation requires some care. There are many "quick" uses of these basic stats operations in testing, benchmarking, and writing CL descriptions that shouldn't require a heavyweight dependency on a fully-featured third-party stats library. (I often end up moving data out of my Go program to the shell and running the github.com/nferraz/st command.) Another function I would like is Percentile(n, series), which reports the nth percentile value of a given series. |
If it belongs in |
Here is a small experience report with existing stats packages: In some code I was using gonum’s stats package, and a collaborator started using github.com/montanaflynn/stats as well, whose API returns an error (which I felt was annoying.) Luckily, I caught the unnecessary dependency in code review. These are the types of things that can easily cause unnecessary dependencies to get added in projects. Hence, I think adding common statistics functions would be a great addition to the std. |
It seems like a lot of developers will benefit from this !! |
Can I know the update on this proposal ??_ |
The proposal review committee will likely look at it this week. It usually takes a few rounds to reach a final decision. |
OK, Cool ! |
Can I know the update on this proposal please ? |
Sorry, we didn't get to it last week, but perhaps will this week. |
Yes Please.... |
Some of the questions raised in the meeting were:
|
Thanks for the feedback! I totally get the concerns and here’s my take:
Overall, I think this keeps the package lightweight, practical, and easy to use, which should be the priority. Looking forward to hearing your thoughts! |
I think the goal of limiting the scope would be to ensure that these (other than Percentile) are not potential additions. ;-) I agree that a slice result for Mode seems appropriate. Perhaps it should be called Modes. |
Can I know the status ?? |
This is the status:
|
Can I receive any updates, at least weekly? |
@hemanth0525 I appreciate this issue is important to you. Please understand that we have over 700 proposals waiting for attention, as can be seen at https://github.com/orgs/golang/projects/17. It's not feasible for our small team to provide weekly updates for each separate proposal. You can track the proposal review activities at #33502. |
@ianlancetaylor Yes I get it, Thanks !! |
What's the scope?The scope of this package should be fairly narrow. If you search for "basic descriptive statistics", basically all results include mean, median, mode, and standard deviation. Variance is also common. "Range" is pretty common, but that's easy to get with the The Python statistics package is an interesting example here (thanks @jimmyfrasche), as it aims to be a small collection of common operations. However, I think it actually goes too far. I was particularly surprised to see kernel density estimation in there, as I consider that, and especially picking good KDE parameters, a fairly advanced statistical method. Which package?
Overall I'm leaning toward OperationsQuantile: I personally find myself wanting quantiles quite often, so this is certainly tempting. We should get a statistics expert to weigh in on which definition to use. I do think this should be "quantile" and not "percentile". Variance and standard deviation: Are these for populations or do they apply sample correction? Do we provide both a population form and a sample-corrected form (this is what Python does)? If we're going to provide sample forms, which of the various corrections do we use? Mode: I'm not completely convinced that we should include mode. If we do, I'd suggest only including "multimode", which returns a possibly-nil slice, as this is a total function, unlike mode. |
Meaning the parameter should be in [0,1] not [0,100]? Or that one should provide lower and upper bounds for the portion of the CDF of interest?
I would think that population is more in line with the typical use of such a package, but it may be safer to provide both with distinct names, preventing casual use of the wrong one. The doc comments should provide clear examples of which one is appropriate.
I agree; I proposed |
About the different ways to compute quantiles: R, which is very mainstream in statistics, implements 9 different quantile algorithms and lets the user choose. Documentation is at https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile . (I didn't check whether this is the same list of methods as in the Wikipedia article quoted above.) |
I'm not sure about the proposed API. Specifically, it seems to me that these should arguably take So to me, this API only really makes sense for small data series. Where the cost of looping multiple times is negligible and/or you are fine with pre-allocating them. An API to remedy that is arguably too complex for the stdlib. Are we okay with that limitation? If so, should we still make the arguments |
Another design would a single value with methods for all the various stats and whose factories take the slice or sequence (and any weights). That way it could do any sorting or failing on NaN upfront and cache any intermediate values required by multiple stats. Something like stats, err := statistics.For(floats)
// handle err
fmt.Println(stats.Mean(), stats.Max(), stats.Median()) |
Though an
There's a tantalizing idea here that perhaps one could just call |
@adonovan it could not print percentiles or just print quartiles and you have to ask if you need something more specific. My main thought with the API is that it makes it clear that it's taking ownership. I'm guessing in most cases you want more than one stat at a time so if it can cache some intermediary value that gets used for more than one stat or speed things up by storing the numbers in a special order or data structure that's a nice bonus. I don't know what the specific numerical methods are used for stats but I imagine there could be some savings by caching the sum or just knowing if there's a +Inf in there somewhere. |
To be clear: That is the opposite of what I was trying to point out :) It requires multiple passes to calculate both the Mean and the Stddev with the proposed API. Regardless of whether they are given as an And the promise has been, that |
Sorry, long day, tired brain. I don't really have a strong feeling about iterator vs slice. My initial feeling was that the slice was simpler, but perhaps we should embrace iterators so that all sequences can be supplied with equal convenience.
That's true, but the point I was trying to make was that we are unlikely to be able to correctly anticipate the exact set of operations that we should compute in a single pass. Should it be mean, median, and 90% percentile? What about 95% or 99%? And so on. So, I argue for separate operators, each taking an |
It seems to me that this discussion about using iterators and collecting multiple results in a single pass is circling around the idea of a generic fold/reduce mechanism over iterators, with the statistics operations discussed here being a set of predefined combining functions to use with that mechanism. Someone who wants to compute multiple at once could then presumably write their own combining function that wraps multiple others and produces a struct or map type (with one field/element per inner function) as its result. I will say immediately that I'm not super convinced that such complexity is justified, but if we think that combining multiple operations over a single iterator is something we want to support then I'd wonder what that would look like as a more general facility implemented in EDIT: After posting this I immediately found #61898 which proposes to add |
I am glad that you said that immediately. ;-)
This reminds me of a certain Google interview question from years back: how do you estimate the median value of a long stream with only finite working store? Any loop over a sequence can be expressed as a wrapper around a call to Reduce, but it is often neither clearer nor more efficient to do so. We absolutely should not require users of the new stats package to hold such higher-order concepts in mind. |
I should've said that my main intention in my earlier comment was to respond to the ideas around calculating multiple of these functions at the same time over a given sequence, not to the original proposal for separate functions. Concretely what I was thinking about was:
I intend the last item here to be an alternative to offering in this package any specialized API for calculating multiple aggregates together. In particular, an alternative to the I'm proposing this only if there's consensus that supporting the use of multiple functions over a single sequence in only one pass is a requirement. If we can convince ourselves that it isn't a requirement then I don't think this complexity is justified. I expect that the original proposal's functions, potentially but not necessarily recast as taking |
To clarify, Methods could cache any intermediary calculations that other stats may need so they don't need to be computed twice if you need two stats that depend on the same value. If, as part of storing and preparing the info, it could easily calculate and cache a few basic stats while it's at it, that's certainly a nice bonus—but that would be an implementation detail. Whether that makes sense in some part depends on what operations there will be (now and in the future), the methods for calculating them, and how many calculations can be shared between them. Though it could have multiple factories, one for slice and one for iter so you could work easily with either without having to have a seq and slice version of each operation. |
I firmly believe it should not be a requirement and that such complexity is unwarranted. The goal for this package is to provide simple implementations of the most well known of all statistical functions. I imagine a typical usage will be to print a summary of results in a benchmarking scenario. The cost of computing the statistics will be insignificant. |
I would recommend, if we include quantile, to do what both Python and R do, and accept a list of quantiles to be computed. I admit this is purely anecdotal, but I can't really recall a situation in which I had to compute a single quantile. |
I think these should all take slices. Slices are faster and simpler. Just because we have iterators doesn't mean we should stop using slices in APIs: slices should still be the default, unless there's a good justification for using an iterator. In this case, if people have enough data that they must stream it, they should probably be using something more specialized.
Right.
I agree. Let's leave It seems like, if we're going to have standard deviation and variance, that we need both population and sample version. It would be nice to have a stats expert weigh in on including both population and sample variance, and the question of which quantile definition to use. @adonovan is going to see about getting input from a stats expert, but any other experts should feel free to weigh in. So, I believe that leaves us at the following API for package func Mean(x []float64) float64
func Median(x []float64) float64
func Quantiles(x []float64, quantiles []float64) []float64
func SampleStdDev(x []float64) float64
func SampleVariance(x []float64) float64
func PopulationStdDev(x []float64) float64
func PopulationVariance(x []float64) float64 This leaves some open questions:
|
If we've settled on slices then back to an earlier question, should they be generic like func Mean[Slice ~[]E, E ~float32 | ~float64](x Slice) E or at least func Mean[Slice ~[]E, E ~float64](x Slice) E ? Also, could the quantiles param of Quantiles be |
We should not use generics; package math is float64-only. @adonovan is still trying to find out what specific algorithms we should be using. |
Would package math be float64-only if it were designed today? It's entirely reasonable either way, imo. However, I'd hope math/bits would use generics if designed today (or v2'd). If I had a slice of |
Two comments:
The only time the difference between population and sample standard deviation R, for example, always divides by (n-1) for both its base library var() and sd() functions (Variance and standard deviation, respectively).
If one is computing the standard deviation, almost always one also wants the mean too. It makes me cringe to think I'd have to do two passes to get both; so much so that I would avoid using the standard library functions if it forced this. In place of SampleStdDev() and SampleVariance() (the later seems redundant) I would just have a single MeanSd() func that returns both mean and sample standard deviation from a single pass. For example: // MeanSd returns the mean and sample standard I've provided a simple one-pass implementation of this here: https://github.com/glycerine/stats-go |
For quantile computation, it is hard to provide an efficient, exact, online Almost always you need an online algorithm for your statistics to avoid O(n^2) Therefore, most users are going to be better off using an online T-digest implementation like https://github.com/caio/go-tdigest with, for example, Unless someone has a better algorithm or a clever way to get the exact Or the standard library could bring in and polish one of the T-digest implementations for Quantile and CDF (cumulative distribution function) computation. That would also be nice. |
Thanks for weighing in, @glycerine !
Thanks! This all makes sense and certainly simplifies things. So instead of func SampleStdDev(x []float64) float64
func SampleVariance(x []float64) float64
func PopulationStdDev(x []float64) float64
func PopulationVariance(x []float64) float64 we'd have just func MeanAndStdDev(x []float64) (mean, stddev float64)
My sense is that T-digests would be beyond the scope of a small descriptive stats standard package. We're not trying to replace serious stats packages, just cover really common needs. T-digests are great if you need online quantiles, but have their own cognitive overheads, especially around understanding how they're approximating. My sense is that the common need is that you have a simple slice of data and just want to get a few quantiles. That's certainly been true in my code. I'm also not overly concerned with the performance of these functions. It just has to be "good enough." That's why we're thinking Quantiles would accept a slice of quantiles to compute because that generally allows for a lot of work sharing, and balances that with a simple API. (Side note: we could balance the performance needs a little more here by saying that Quantiles will be faster if you pass it sorted data, but that's not required.) |
Please consider a struct for the api. Also, you could use an online variant like in https://www.johndcook.com/skewness_kurtosis.html |
As a data point, at work we've created some |
We've discussed this above and it doesn't seem like the right trade-off for a simple descriptive stats API.
Again, it's not clear this is justified in this case. For more advanced stats needs, such as online computation, it's easy enough to pull in an external, more specialized package. |
Description:
This proposal aims to enhance the Go standard library’s
math
(math/stats.go
)package by introducing several essential statistical functions. The proposed functions are:and many more....
Motivation:
The inclusion of these statistical functions directly in the
math
package will offer Go developers robust tools for data analysis and statistical computation, enhancing the language's utility in scientific and financial applications. Currently, developers often rely on external libraries for these calculations, which adds dependencies and potential inconsistencies. Integrating these functions into the standard library will:Design:
The functions will be added to the existing
math
package, ensuring they are easy to use and integrate seamlessly with other mathematical operations. Detailed documentation and examples will be provided to illustrate their usage and edge case handling.Examples:
The text was updated successfully, but these errors were encountered: