Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: math: add Mean, Median, Mode, Variance, and StdDev #69264

Open
hemanth0525 opened this issue Sep 4, 2024 · 68 comments
Open

proposal: math: add Mean, Median, Mode, Variance, and StdDev #69264

hemanth0525 opened this issue Sep 4, 2024 · 68 comments
Assignees
Labels
Milestone

Comments

@hemanth0525
Copy link

hemanth0525 commented Sep 4, 2024

Description:

This proposal aims to enhance the Go standard library’s math ( math/stats.go )package by introducing several essential statistical functions. The proposed functions are:

  • Mean: Calculates the average value of a data set.
  • Median: Determines the middle value when the data set is sorted.
  • Mode: Identifies the most frequently occurring value in a data set.
  • Variance: Measures the spread of the data set from the mean.
  • StdDev: Computes the standard deviation, providing a measure of data dispersion.
    and many more....

Motivation:

The inclusion of these statistical functions directly in the math package will offer Go developers robust tools for data analysis and statistical computation, enhancing the language's utility in scientific and financial applications. Currently, developers often rely on external libraries for these calculations, which adds dependencies and potential inconsistencies. Integrating these functions into the standard library will:

  • Provide Comprehensive Statistical Analysis: These functions will facilitate fundamental statistical measures, aiding in more thorough data analysis and better understanding of data distributions.
  • Ensure Reliable Behavior: Functions are designed to handle edge cases, such as empty slices, to maintain predictable and accurate results.
  • Optimize Performance and Accuracy: Implemented with efficient algorithms to balance performance with calculation accuracy.
  • Increase Utility: Reduces the need for third-party libraries, making statistical computation more accessible and consistent within the Go ecosystem.

Design:

The functions will be added to the existing math package, ensuring they are easy to use and integrate seamlessly with other mathematical operations. Detailed documentation and examples will be provided to illustrate their usage and edge case handling.

Examples:

  • Mean:
    mean := math.Mean([]float64{1, 2, 3, 4, 5})
  • Median:
    median := math.Median([]float64{1, 3, 3, 6, 7, 8, 9})
  • Mode:
    mode := math.Mode([]float64{1, 2, 2, 3, 4})
  • Variance:
    variance := math.Variance([]float64{1, 2, 3, 4, 5})
  • StdDev:
    stddev := math.StdDev([]float64{1, 2, 3, 4, 5})

@gabyhelp
Copy link

gabyhelp commented Sep 4, 2024

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@seankhliao seankhliao changed the title math: Implement Statistical Functions for Mean, Median, Mode, Variance, and StdDev proposal: math: Implement Statistical Functions for Mean, Median, Mode, Variance, and StdDev Sep 4, 2024
@gopherbot gopherbot added this to the Proposal milestone Sep 4, 2024
@seankhliao seankhliao changed the title proposal: math: Implement Statistical Functions for Mean, Median, Mode, Variance, and StdDev proposal: math: add Mean, Median, Mode, Variance, and StdDev Sep 4, 2024
@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Sep 4, 2024
@ianlancetaylor
Copy link
Member

ianlancetaylor commented Sep 4, 2024

In general the math package aims to provide the functions that are in the C++ standard library <math>.

@hemanth0525
Copy link
Author

Thanks for the feedback! I get that the math package is meant to mirror the functions in C++'s <cmath>, but I think adding some built-in stats functions could be a nice improvement. A lot of developers deal with stats regularly, so having these in the standard library could make things easier without stepping too far from the package’s core purpose. Happy to chat more about it if needed!

@earthboundkid
Copy link
Contributor

Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once!

@hemanth0525
Copy link
Author

I’ve done some digging into how statistical functions are currently being handled in the Go community. While libraries like Gonum and others provide statistical methods, there's no single source of truth or dominant package in this space, and many are designed for more complex or specialized tasks. However, the basic statistical functions we're proposing—like Mean, Median, Mode, Variance, and StdDev—are foundational for a wide range of applications, from simple data analysis to more advanced scientific and financial computations.

By integrating these into the standard library, we'd eliminate the need for external dependencies for basic tasks, which is in line with Go's philosophy of having a strong standard library for common use cases. While third-party packages are an option, including these functions in the math package would make Go more self-sufficient for everyday statistical needs, benefiting developers who want a simple, reliable way to compute these without resorting to third-party solutions.

@seankhliao
Copy link
Member

for common use cases

this is the part where we need to see evidence. especially considering the existence of libraries like gonum, how often does the need arise for functions like those proposed where you wouldn't need the extra functionality that other libraries provide.

@jimmyfrasche
Copy link
Member

For what it's worth, python has a statistics package in its standard library: https://docs.python.org/3/library/statistics.html

It would be nice to have a simple package everyone agrees on for common use cases, but that doesn't necessarily need to be in std.

@randall77
Copy link
Contributor

These functions sound pretty simple, but I think there's actually a lot of subtlety here. For instance, what does Mean do for rounding? Do we need to use Kahan's algorithm? What if the sum at some point rounds up to +Inf?

@doggedOwl
Copy link

Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once

in my experience everytime some numeric problems comes up gonum lib is suggested. they have a stats package https://pkg.go.dev/gonum.org/v1/gonum@v0.15.1/stat

@hemanth0525
Copy link
Author

Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once

in my experience everytime some numeric problems comes up gonum lib is suggested. they have a stats package https://pkg.go.dev/gonum.org/v1/gonum@v0.15.1/stat

Yeah, so think about having it's functionalities in go std lib straight away !

@hemanth0525
Copy link
Author

Gonum library is indeed often suggested for statistical and numerical work in Go, and it has a dedicated stat package. It’s a robust library that covers a wide range of statistical functions, and for more complex needs, it's definitely a go-to solution.

However, my proposal is focused on adding foundational statistical functions like Mean, Median, Mode, Variance, and StdDev,... directly into the standard library. These are basic but essential tools that many developers need in day-to-day tasks, and having them in the standard library could save developers from importing an entire external library like Gonum for simple calculations. I believe integrating these functions would make Go more self-sufficient, particularly for developers who need straightforward statistical calculations without additional dependencies.

@adonovan
Copy link
Member

adonovan commented Sep 6, 2024

IMHO these functions would be very useful in the standard library, even if (or indeed, because) the implementation requires some care. There are many "quick" uses of these basic stats operations in testing, benchmarking, and writing CL descriptions that shouldn't require a heavyweight dependency on a fully-featured third-party stats library. (I often end up moving data out of my Go program to the shell and running the github.com/nferraz/st command.)

Another function I would like is Percentile(n, series), which reports the nth percentile value of a given series.

@jimmyfrasche
Copy link
Member

If it belongs in std, it should probably be in a "math/stats" or "math/statistics" instead of directly in "math".

@meling
Copy link

meling commented Sep 10, 2024

Here is a small experience report with existing stats packages: In some code I was using gonum’s stats package, and a collaborator started using github.com/montanaflynn/stats as well, whose API returns an error (which I felt was annoying.) Luckily, I caught the unnecessary dependency in code review.

These are the types of things that can easily cause unnecessary dependencies to get added in projects. Hence, I think adding common statistics functions would be a great addition to the std.

@hemanth0525
Copy link
Author

It seems like a lot of developers will benefit from this !!

@hemanth0525
Copy link
Author

hemanth0525 commented Sep 22, 2024

Can I know the update on this proposal ??_

@adonovan
Copy link
Member

The proposal review committee will likely look at it this week. It usually takes a few rounds to reach a final decision.

@hemanth0525
Copy link
Author

The proposal review committee will likely look at it this week. It usually takes a few rounds to reach a final decision.

OK, Cool !

@hemanth0525
Copy link
Author

Can I know the update on this proposal please ?

@adonovan
Copy link
Member

Sorry, we didn't get to it last week, but perhaps will this week.

@hemanth0525
Copy link
Author

Yes Please....

@adonovan
Copy link
Member

adonovan commented Oct 2, 2024

Some of the questions raised in the meeting were:

  • Which package should this live in? The scope of the math package aligns with the C++ math package, so it does not seem the appropriate home. Perhaps math/stats? But this might create a temptation to add a lot more statistical functions. Which leads to:
  • If we create a new package, what should be its scope? The proposed set of functions (including Percentile) is roughly the set of statistical functions that every high-school student knows, and perhaps that's the appropriate scope.
  • Should the functions be generic? Should we support the median of an integer series, say? Personally I'm not convinced it's necessary; users can convert integers to floats as needed. This package should make common problems (such as arise during testing and benchmarking) convenient, not aim for maximum generality or efficiency.
  • Is a single result sufficient for the Mode function? What is the mode of [1, 2]?

@hemanth0525
Copy link
Author

hemanth0525 commented Oct 2, 2024

Thanks for the feedback! I totally get the concerns and here’s my take:

  1. Package Location: I agree that a new math/stats package makes sense. It keeps things organized and prevents the core math package from becoming too broad. We can start with the basics—mean, median, mode, variance, etc.—covering foundational stats functions that are universally useful.

  2. Scope: Let’s keep it simple for now. The goal should be to provide common, practical functions that people need for everyday testing, benchmarking, and basic analytics. We don’t need to cover advanced statistical methods yet—just the essentials. And yeah !, potential addons would be [ Percentile, Quartiles, Geometric Mean, Harmonic Mean, Mean Absolute Deviation (MAD), Coefficient of Variation (CV), Cumulative Sum (Cumsum), Root Mean Square (RMS), Skewness, Kurtosis, Covariance, Correlation Coefficient, Z-Score, ..... ]

  3. Generics: I don’t think we need generics here. Users can convert integers to floats if needed, and keeping it focused on simplicity will make the package more accessible.

  4. Mode Function: For cases like [1, 2], we can return nil or an empty slice [] if no mode exists, or return all modes in a slice when there’s more than one. That way, it’s clear and flexible.

Overall, I think this keeps the package lightweight, practical, and easy to use, which should be the priority. Looking forward to hearing your thoughts!

@adonovan
Copy link
Member

adonovan commented Oct 2, 2024

And yeah potential addons would be Percentile, ...[long list]...

I think the goal of limiting the scope would be to ensure that these (other than Percentile) are not potential additions. ;-)

I agree that a slice result for Mode seems appropriate. Perhaps it should be called Modes.

@hemanth0525
Copy link
Author

Can I know the status ??

@adonovan
Copy link
Member

Can I know the status ??

This is the status:

It can take anywhere from several weeks to months before a final decision is made.

@hemanth0525
Copy link
Author

Can I receive any updates, at least weekly?

@ianlancetaylor
Copy link
Member

@hemanth0525 I appreciate this issue is important to you. Please understand that we have over 700 proposals waiting for attention, as can be seen at https://github.com/orgs/golang/projects/17. It's not feasible for our small team to provide weekly updates for each separate proposal. You can track the proposal review activities at #33502.

@hemanth0525
Copy link
Author

@ianlancetaylor Yes I get it, Thanks !!

@aclements
Copy link
Member

What's the scope?

The scope of this package should be fairly narrow. If you search for "basic descriptive statistics", basically all results include mean, median, mode, and standard deviation. Variance is also common. "Range" is pretty common, but that's easy to get with the min and max built-ins. Most include some form of quantile/percentile/quartile.

The Python statistics package is an interesting example here (thanks @jimmyfrasche), as it aims to be a small collection of common operations. However, I think it actually goes too far. I was particularly surprised to see kernel density estimation in there, as I consider that, and especially picking good KDE parameters, a fairly advanced statistical method.

Which package?

math/stats could invite feature creep. On the other hand, it's scoped and purposeful. It's also easier to search for.

math currently follows the C library, but I'm not convinced that's very important (Go isn't C). However, everything in math operates on one or two float64s, so this would be a break from that. math already mixes together a few different fields (e.g., there's no math/trig), but that's probably just because it follows the C math library. It already had a few other sub-packages for different data types (math/cmplx) and specific fields (math/bits).

Overall I'm leaning toward math/stats.

Operations

Quantile: I personally find myself wanting quantiles quite often, so this is certainly tempting. We should get a statistics expert to weigh in on which definition to use. I do think this should be "quantile" and not "percentile".

Variance and standard deviation: Are these for populations or do they apply sample correction? Do we provide both a population form and a sample-corrected form (this is what Python does)? If we're going to provide sample forms, which of the various corrections do we use?

Mode: I'm not completely convinced that we should include mode. If we do, I'd suggest only including "multimode", which returns a possibly-nil slice, as this is a total function, unlike mode.

@adonovan
Copy link
Member

Quantile: I personally find myself wanting quantiles quite often, so this is certainly tempting. We should get a statistics expert to weigh in on which definition to use. I do think this should be "quantile" and not "percentile".

Meaning the parameter should be in [0,1] not [0,100]? Or that one should provide lower and upper bounds for the portion of the CDF of interest?

Variance and standard deviation: Are these for populations or do they apply sample correction? Do we provide both a population form and a sample-corrected form (this is what Python does)? If we're going to provide sample forms, which of the various corrections do we use?

I would think that population is more in line with the typical use of such a package, but it may be safer to provide both with distinct names, preventing casual use of the wrong one. The doc comments should provide clear examples of which one is appropriate.

Mode: I'm not completely convinced that we should include mode. If we do, I'd suggest only including "multimode", which returns a possibly-nil slice, as this is a total function, unlike mode.

I agree; I proposed Modes([]float) []float to acknowledge its multiplicity up front.

@seehuhn
Copy link

seehuhn commented Nov 6, 2024

About the different ways to compute quantiles: R, which is very mainstream in statistics, implements 9 different quantile algorithms and lets the user choose. Documentation is at https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile . (I didn't check whether this is the same list of methods as in the Wikipedia article quoted above.)

@Merovius
Copy link
Contributor

Merovius commented Nov 13, 2024

I'm not sure about the proposed API. Specifically, it seems to me that these should arguably take iter.Seq[float64] instead of []float64, from a pure API perspective. But if you need more than one of these outputs (which I would assume you commonly do), iter.Seq makes it clear that it's less efficient to iterate multiple times. Instead of having a single loop that does a bunch of accumulations. The same concern ultimately exists with slices, it's just less obvious.

So to me, this API only really makes sense for small data series. Where the cost of looping multiple times is negligible and/or you are fine with pre-allocating them. An API to remedy that is arguably too complex for the stdlib.

Are we okay with that limitation? If so, should we still make the arguments iter.Seq?

@jimmyfrasche
Copy link
Member

Another design would a single value with methods for all the various stats and whose factories take the slice or sequence (and any weights). That way it could do any sorting or failing on NaN upfront and cache any intermediate values required by multiple stats.

Something like

stats, err := statistics.For(floats)
// handle err
fmt.Println(stats.Mean(), stats.Max(), stats.Median())

@adonovan
Copy link
Member

Are we okay with that limitation [that the cost of looping multiple times is negligible and/or you are fine with pre-allocating them]? If so, should we still make the arguments iter.Seq?

Though an iter.Seq[float64] is the logical parameter type, I suspect it is not only less efficient (because of repeated passes) but also less convenient (because typically one has a slice already). Although an iterator would allow the caller to avoid materializing an array of float64 when they have some other data structure (such as an array of integers or structs), I suspect the work to define a one-off iterator over that data structure is probably still more than to create a []float64 slice from it. So, []float64 is probably more convenient. And as you point out, if multiple statistics are required, it may be more efficient too, but that's a secondary concern.

Another design would a single value with methods for all the various stats

There's a tantalizing idea here that perhaps one could just call fmt.Println(stats.For(series)) and obtain a nice string showing the mean, median, percentiles and so on, not unlike the convenience of fmt.Println(time.Since(t0)). But the Percentile operator requires an argument (0.9. 0.99, etc). I think the API originally proposed is simpler.

@jimmyfrasche
Copy link
Member

@adonovan it could not print percentiles or just print quartiles and you have to ask if you need something more specific.

My main thought with the API is that it makes it clear that it's taking ownership. I'm guessing in most cases you want more than one stat at a time so if it can cache some intermediary value that gets used for more than one stat or speed things up by storing the numbers in a special order or data structure that's a nice bonus. I don't know what the specific numerical methods are used for stats but I imagine there could be some savings by caching the sum or just knowing if there's a +Inf in there somewhere.

@Merovius
Copy link
Contributor

Merovius commented Nov 14, 2024

@adonovan

And as you point out, if multiple statistics are required, it may be more efficient too, but that's a secondary concern.

To be clear: That is the opposite of what I was trying to point out :) It requires multiple passes to calculate both the Mean and the Stddev with the proposed API. Regardless of whether they are given as an iter.Seq or as a []float64. If you don't use the API, you can do it in one pass.

And the promise has been, that range over slices.Values(s) is just as fast as over []float64 directly, FWIW. So it is not less efficient, in the special case that you already have a []float64.

@adonovan
Copy link
Member

To be clear: That is the opposite of what I was trying to point out :)

Sorry, long day, tired brain.

I don't really have a strong feeling about iterator vs slice. My initial feeling was that the slice was simpler, but perhaps we should embrace iterators so that all sequences can be supplied with equal convenience.

It requires multiple passes to calculate both the Mean and the Stddev with the proposed API.

That's true, but the point I was trying to make was that we are unlikely to be able to correctly anticipate the exact set of operations that we should compute in a single pass. Should it be mean, median, and 90% percentile? What about 95% or 99%? And so on.

So, I argue for separate operators, each taking an iter.Seq[float64].

@apparentlymart
Copy link

apparentlymart commented Nov 14, 2024

It seems to me that this discussion about using iterators and collecting multiple results in a single pass is circling around the idea of a generic fold/reduce mechanism over iterators, with the statistics operations discussed here being a set of predefined combining functions to use with that mechanism.

Someone who wants to compute multiple at once could then presumably write their own combining function that wraps multiple others and produces a struct or map type (with one field/element per inner function) as its result.

I will say immediately that I'm not super convinced that such complexity is justified, but if we think that combining multiple operations over a single iterator is something we want to support then I'd wonder what that would look like as a more general facility implemented in package iter, or similar.

EDIT: After posting this I immediately found #61898 which proposes to add Reduce and Reduce2 helpers to a new experimental package. Would it be possible to implement some or all of these statistical operators as functions that could be passed to xiter.Reduce?

@adonovan
Copy link
Member

I will say immediately that I'm not super convinced that such complexity is justified.

I am glad that you said that immediately. ;-)

I immediately found #61898 which proposes to add Reduce and Reduce2 helpers to a new experimental package. Would it be possible to implement some or all of these statistical operators as functions that could be passed to xiter.Reduce?

This reminds me of a certain Google interview question from years back: how do you estimate the median value of a long stream with only finite working store?

Any loop over a sequence can be expressed as a wrapper around a call to Reduce, but it is often neither clearer nor more efficient to do so. We absolutely should not require users of the new stats package to hold such higher-order concepts in mind.

@apparentlymart
Copy link

apparentlymart commented Nov 14, 2024

I should've said that my main intention in my earlier comment was to respond to the ideas around calculating multiple of these functions at the same time over a given sequence, not to the original proposal for separate functions.

Concretely what I was thinking about was:

  • Expose each of these functions as something that can be passed to a "reduce" function.
  • For each of them, also offer a simpler wrapper like in the original proposal that is designed for the simple case of calculating only one function for a given sequence, wrapping a call to the "reduce" function.
  • Anyone who wants to, for example, calculate both mean and standard deviation at the same time would do that by directly calling "reduce" with a function that wraps both the mean and standard deviation functions and returns something like struct { Mean, StdDev float64 } with the results of both functions.

I intend the last item here to be an alternative to offering in this package any specialized API for calculating multiple aggregates together. In particular, an alternative to the statistics.For and others like it.

I'm proposing this only if there's consensus that supporting the use of multiple functions over a single sequence in only one pass is a requirement. If we can convince ourselves that it isn't a requirement then I don't think this complexity is justified. I expect that the original proposal's functions, potentially but not necessarily recast as taking iter.Seq instead, should be sufficient at least in the common case.

@jimmyfrasche
Copy link
Member

To clarify, statistics.For would not (necessarily) calculate any statistics upfront it would just prepare and store all the information needed to calculate the values.

Methods could cache any intermediary calculations that other stats may need so they don't need to be computed twice if you need two stats that depend on the same value.

If, as part of storing and preparing the info, it could easily calculate and cache a few basic stats while it's at it, that's certainly a nice bonus—but that would be an implementation detail.

Whether that makes sense in some part depends on what operations there will be (now and in the future), the methods for calculating them, and how many calculations can be shared between them. Though it could have multiple factories, one for slice and one for iter so you could work easily with either without having to have a seq and slice version of each operation.

@adonovan
Copy link
Member

I'm proposing this only if there's consensus that supporting the use of multiple functions over a single sequence in only one pass is a requirement.

I firmly believe it should not be a requirement and that such complexity is unwarranted. The goal for this package is to provide simple implementations of the most well known of all statistical functions. I imagine a typical usage will be to print a summary of results in a benchmarking scenario. The cost of computing the statistics will be insignificant.

@CAFxX
Copy link
Contributor

CAFxX commented Nov 21, 2024

Quantile: I personally find myself wanting quantiles quite often, so this is certainly tempting. We should get a statistics expert to weigh in on which definition to use. I do think this should be "quantile" and not "percentile".

I would recommend, if we include quantile, to do what both Python and R do, and accept a list of quantiles to be computed. I admit this is purely anecdotal, but I can't really recall a situation in which I had to compute a single quantile.

@aclements
Copy link
Member

I think these should all take slices. Slices are faster and simpler. Just because we have iterators doesn't mean we should stop using slices in APIs: slices should still be the default, unless there's a good justification for using an iterator. In this case, if people have enough data that they must stream it, they should probably be using something more specialized.

Meaning the parameter should be in [0,1] not [0,100]?

Right.

I would recommend, if we include quantile, to do what both Python and R do, and accept a list of quantiles to be computed.

I agree.

Let's leave Mode/Modes out for now. Especially on floating point numbers, these seem like asking for trouble given how easy it is to wind up with nearly equal floating point numbers. We could consider mode over integers, but then it doesn't fit as well into the rest of the API. It's a new package, so let's start with a narrow scope.

It seems like, if we're going to have standard deviation and variance, that we need both population and sample version. gonum.org/v1/gonum/stat calls these StdDev, PopStdDev, Variance, and PopVariance. I'm inclined to be more explicit and put Sample in the names of the sample versions. We could also provide just one or the other of standard deviation and variance, since one can trivially be computed from the other, but I suspect it's common enough that people look for one or the other that we might as well provide the convenience of both.

It would be nice to have a stats expert weigh in on including both population and sample variance, and the question of which quantile definition to use. @adonovan is going to see about getting input from a stats expert, but any other experts should feel free to weigh in.

So, I believe that leaves us at the following API for package math/stats:

func Mean(x []float64) float64
func Median(x []float64) float64
func Quantiles(x []float64, quantiles []float64) []float64
func SampleStdDev(x []float64) float64
func SampleVariance(x []float64) float64
func PopulationStdDev(x []float64) float64
func PopulationVariance(x []float64) float64

This leaves some open questions:

  • What should these functions do when given an empty slice? As @meling pointed out, adding an error result would be pretty inconvenient. Other reasonable options are to return NaN or to panic. I'm slightly inclined toward panicking because of the way NaNs infect other operations and because I think NaN should be used to indicate there was a NaN in the argument.
  • What should these functions do when the slice contains NaN, Inf, or -Inf? I think if there's any NaN, the result should be NaN (this is another reason for not returning NaN if the slice is empty). For Median and Quantiles, Inf, and -Inf should be sorted accordingly as samples. For Mean, if the input contains both Inf and -Inf, the result should be NaN; if it contains just one of the two, the result should be Inf or -Inf, respectively. And for StdDev and Variance, if the input contains either Inf or -Inf, the result should be Inf.

@jimmyfrasche
Copy link
Member

If we've settled on slices then back to an earlier question, should they be generic like

func Mean[Slice ~[]E, E ~float32 | ~float64](x Slice) E

or at least

func Mean[Slice ~[]E, E ~float64](x Slice) E

?

Also, could the quantiles param of Quantiles be ...float64 with a reasonable default if left off?

@rsc
Copy link
Contributor

rsc commented Dec 4, 2024

We should not use generics; package math is float64-only. @adonovan is still trying to find out what specific algorithms we should be using.

@jimmyfrasche
Copy link
Member

Would package math be float64-only if it were designed today? It's entirely reasonable either way, imo.

However, I'd hope math/bits would use generics if designed today (or v2'd).

If I had a slice of type Temp float64 I'd hate to have to make a copy to get the mean temperature, so package slices seems as appropriate a precedence as package math here.

@glycerine
Copy link

Two comments:

  1. To provide better guidance for statistically naive users, I would suggest omitting the PopulationStdDev and PopulationVariance functions.

The only time the difference between population and sample standard deviation
matters is when the sample size is very small and then you should be using the SampleStdDev and SampleVariance anyway.

R, for example, always divides by (n-1) for both its base library var() and sd() functions (Variance and standard deviation, respectively).

  1. Like one commenter above, I'm also bothered by the suggested API forcing two passes through the data when only one will do. It seems a poor example to provide an API which forces algorithmic inefficiency in the standard library.

If one is computing the standard deviation, almost always one also wants the mean too. It makes me cringe to think I'd have to do two passes to get both; so much so that I would avoid using the standard library functions if it forced this.

In place of SampleStdDev() and SampleVariance() (the later seems redundant) I would just have a single MeanSd() func that returns both mean and sample standard deviation from a single pass. For example:

// MeanSd returns the mean and sample standard
// deviation from a single pass through the observations in x.
func MeanSd(x []float64) (mean, stddev float64)

I've provided a simple one-pass implementation of this here: https://github.com/glycerine/stats-go

@glycerine
Copy link

For quantile computation, it is hard to provide an efficient, exact, online
implementation, in the sense that exact computation usually requires
using/storing all the data.

Almost always you need an online algorithm for your statistics to avoid O(n^2)
of updates, since you are typically reporting statistics regularly. An approximation or estimate of the quantiles is also usually sufficient.

Therefore, most users are going to be better off using an online T-digest implementation like https://github.com/caio/go-tdigest with, for example,
a compress setting of 100 (which gives 1000x space reduction). Although this is an approximation of the quantiles, the tail accuracy is still very good, and the space savings makes it very worth while.

Unless someone has a better algorithm or a clever way to get the exact
quantiles without needing to retain all the data over time, I would recommend
leaving the Quantile() function out of the standard library and pointing people at a T-digest implementation instead.

Or the standard library could bring in and polish one of the T-digest implementations for Quantile and CDF (cumulative distribution function) computation. That would also be nice.

@aclements
Copy link
Member

Thanks for weighing in, @glycerine !

The only time the difference between population and sample standard deviation
matters is when the sample size is very small and then you should be using the SampleStdDev and SampleVariance anyway.

...

If one is computing the standard deviation, almost always one also wants the mean too. It makes me cringe to think I'd have to do two passes to get both; so much so that I would avoid using the standard library functions if it forced this.

Thanks! This all makes sense and certainly simplifies things.

So instead of

func SampleStdDev(x []float64) float64
func SampleVariance(x []float64) float64
func PopulationStdDev(x []float64) float64
func PopulationVariance(x []float64) float64

we'd have just

func MeanAndStdDev(x []float64) (mean, stddev float64)

For quantile computation, it is hard to provide an efficient, exact, online
implementation, in the sense that exact computation usually requires
using/storing all the data.

...

Or the standard library could bring in and polish one of the T-digest implementations for Quantile and CDF (cumulative distribution function) computation. That would also be nice.

My sense is that T-digests would be beyond the scope of a small descriptive stats standard package. We're not trying to replace serious stats packages, just cover really common needs. T-digests are great if you need online quantiles, but have their own cognitive overheads, especially around understanding how they're approximating.

My sense is that the common need is that you have a simple slice of data and just want to get a few quantiles. That's certainly been true in my code.

I'm also not overly concerned with the performance of these functions. It just has to be "good enough." That's why we're thinking Quantiles would accept a slice of quantiles to compute because that generally allows for a lot of work sharing, and balances that with a simple API. (Side note: we could balance the performance needs a little more here by saying that Quantiles will be faster if you pass it sorted data, but that's not required.)

@arnehormann
Copy link
Contributor

Please consider a struct for the api. Also, you could use an online variant like in https://www.johndcook.com/skewness_kurtosis.html
I strongly suspect it has sign errors in the +operator implementation for skewness and kurtosis and it could use min and max, otherwise it's great.
Primary sources are referenced in the linked article.

@tmaxmax
Copy link

tmaxmax commented Dec 16, 2024

As a data point, at work we've created some StdDev, Mean and Median helpers for some OCR code. Something interesting is that in every place we use StdDev we also use Mean, which is inefficient and redundant. I'd be in favour for a MeanStdDev function, should these statistics functions be introduced.

@aclements
Copy link
Member

Please consider a struct for the api.

We've discussed this above and it doesn't seem like the right trade-off for a simple descriptive stats API.

Also, you could use an online variant like in https://www.johndcook.com/skewness_kurtosis.html

Again, it's not clear this is justified in this case. For more advanced stats needs, such as online computation, it's easy enough to pull in an external, more specialized package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Active
Development

No branches or pull requests