Weighted variance, standard deviation, covariance & correlation. #53

lindahua · 2014-04-04T20:49:02Z

I am going to add these functionalities to the package soon.

One question needs to be decided: should we correct the scale like we do in unweighted cases? like:

m = mean(x, w)
# shall we do:
var(x, w) = sum(abs2(x - m)) / (sum(w) - 1)
# or do :
var(x, w) = sum(abs2(x - m)) / sum(w)

johnmyleswhite · 2014-04-04T21:10:14Z

Does the correction by subtracting 1 produce an unbiased estimator?

nalimilan · 2014-04-04T21:26:27Z

Wikipedia offers a detailed explanation of the problem:
https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance

In summary, the unbiased estimator is only defined if the weights represent an integer number of cases, which is what @lindahua's first corrected formula above does. But it is quite common to have other types of weights, e.g. inverse-variance weights or sampling weights; the latter are even sometimes expressed as integers (like case weights).

Numpy does not offer weighted variance at all, and MATLAB doesn't talk about correction when weights are used. So I'd say by default you'd better return the uncorrected version (second one), but support the corrected version via the same keyword argument as for the unweighted variance, with a warning if weights are not integers, and a clear documentation about the fact that weights need to be case/repeat weights.

StefanKarpinski · 2014-04-05T13:13:31Z

If the default is to correct in the non-weighted case, it seems both more correct and more consistent to correct for weighted variance as well, the only trouble being to figure out the right correction. Can we just figure out what the correction ought to be in the non-integer case? I somehow doubt that this group of people can't figure out the right way to do this correction. For that matter, @dmbates might just know it off the top of his head.

StefanKarpinski · 2014-04-05T13:16:59Z

Ah, I see that the corrected weighted variance is not well defined when the weights don't represent sample counts. So yeah, it's not just a matter of not knowing, but that it can't be known.

nalimilan · 2014-04-05T14:21:17Z

A possibly interesting solution could be to define different kind of weights as different Julia types, so that one would do var(x, caseweights(w)), just like one already does mean(x, weights(w)) to create a WeightVec object. Case/repeat weights would use the correction by default, but other types of weights wouldn't. CaseWeightVec (or any other name) would inherit from the more general AbstractWeightVec type, but would allow making more assumptions when applicable.

This may look overkill just to compute the variance, but I'm thinking about the potential benefits for more complex methods like regression models. In R it's always difficult to know exactly what kind of weights a specific modeling function expects, and it is dangerous if you get it wrong. If the meaning of weights was specified using the type system, everything would be clear, and functions could accept different types of weights and still do the correct computations (or print an error).

johnmyleswhite · 2014-04-05T15:58:36Z

In theory, I like the idea of using weight types to distinguish different cases. The question becomes: are there just a few cases we need to support and can we put them into StatsBase so that other stats packages will support them? If the types become canonical, they'll be great. If they're not, they'll just seem verbose.

My reading of the Wikipedia article is that the weighted case can always be made unbiased if you know the sample size. Is that right? If so, that seems to argue for having three definitions:

(1) The uncorrected, biased estimator defined when using just weights.

(2) A corrected, unbiased estimator that requires that you specify the sample size explicitly.

(3) A corrected, biased estimator that just hopes that the sample size is the sum of the weights.

StefanKarpinski · 2014-04-05T16:43:37Z

Is there a situation where weighted vectors make sense without a sample size? I guess if you're describing something that's not derived from a sample then it makes sense – but in that case correction is actually wrong. Perhaps the important distinction here is whether the data represents a sample or an ideal. Computing variance of a sample requires knowing the sample size and should be corrected, whereas for ideal distribution, the sample size doesn't even make sense.

nalimilan · 2014-04-05T17:15:09Z

@johnmyleswhite What do you mean by "a few cases we need to support"? A few types of weights? A few methods?

@StefanKarpinski Yeah, that's why I suggested that the generic WeightsVec would not make any assumption regarding the sample, while other more specific types would allow making the required assumptions.

johnmyleswhite · 2014-04-05T21:51:55Z

One place where weighted variances that might not have a well-defined sample size come up is EM for Gaussian mixture models.

Milan, I was thinking of a few types of weights. CountWeights, ArbitraryWeights, etc. If that list is short and generally useful, using the type system seems reasonable.

nalimilan · 2014-04-05T22:47:13Z

I think you can find a good list here: http://www.stata.com/help.cgi?weight and http://books.google.fr/books?id=L96ludyhFBsC&pg=PP17&lpg=PP17

What they call "probability weights" are also called "sampling weights". The only assumption which can be made with them is that they represent the inverse of the sampling probability, therefore their sum is the size of the target population. (More information is usually provided with surveys to compute statistics taking into account the survey design.)
"Frequency weights" are also called "case weights" or "repeat weights".
"Importance weights" should be the fallback type where others do not apply.
"Analytic weights" are also called "precision weights" or "inverse variance weights".

I think that covers all cases. One may add "replicate weights" to the list: these simply reflect resampling replicates (e.g. bootstrap). I'm not sure they deserve their own type since they are actually just frequency weights (though sometimes combined with sampling weights, i.e. an observation appearing twice in the replicate will have its weight doubled -- such weights are sometimes shipped with survey data for privacy reasons when details about the survey design cannot be made public.)

StefanKarpinski · 2014-04-05T23:03:35Z

I'm concerned that such a fine-grained classification is too fussy and not user friendly.

johnmyleswhite · 2014-04-05T23:05:14Z

I'm very sympathetic to that concern, but I've also heard my coworkers complain many times about how hard it is to know what kinds of weights a function in R expects as input. Using the type system here could remove that kind of uncertainty.

lindahua · 2014-04-06T12:08:20Z

Frankly, I did not realize that there are so many kinds of weights with subtly different meanings.

In machine learning, weights usually come from certain applications where samples are associated with confidence, inverse variances, and assignment probabilities (e.g. EM). Correction is thus not really necessary in such cases.

I raise this issue as I feel that statisticians may have deeper understanding here. Are there any literature related to this discussion?

nalimilan · 2014-04-06T13:43:23Z

@StefanKarpinski It's not user-friendly, but it's just how the world works. ;-) Seriously, anybody dealing with weights will have to check what's the exact type of weights they have. And while for example I would personally only need sampling weights, @lindahua appears to be more familiar with inverse variance weights. That said, people only working with arbitrary weights with no precise meaning could still specify them as the most generic type, and they would get errors to prevent them from doing things that require more assumptions where applicable (like for corrected variance).

@lindahua What kind of literature are you looking for? Something giving details about each type of weights are their use? http://books.google.fr/books?id=L96ludyhFBsC&pg=PP15 (and the rest of the chapter, as well as the whole book) is a good reference for sampling weights.

While using the type system to convey information about the type of weights sounds logical, I realized it does not suit much the case where weights are stored in a matrix or DataFrame, as a column, together with variables -- which is the use case I think is the most common. If you do so, you lose the type, since you need weights to be a vector, not a random object, to fit into a matrix or DataFrame. So you're going to have to specify the type of the weights each time you call the function.

That's not a criticism of the current interface as long as its goal is to take arbitrary vectors; but we may want to find a better mechanism for people working with DataFrames, where you'd be able to specify the type of the weights only once (to be stored as an attribute of the DataFrame). If we are to create a few types corresponding to different kinds of weights, we should keep this in mind and check how it would fit with a DataFrames interface. Else it's probably not worth creating these types, if another competing mechanism has to be implemented, making the whole system too complex.

nalimilan · 2014-04-06T16:48:10Z

This page provides a detailed summary of how (corrected) weighted variance is computed in Stata for analytic and sampling weights, with detailed formulas (see bottom of the page): http://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/

In their terminology, aweights are analytic/inverse variance weights, fweights are frequency weights, pweights are probability/sampling weights, and iweights are importance/arbitrary weights.

lindahua · 2014-04-06T21:44:49Z

Thanks @nalimilan. I will look at the Stata page.

simonster · 2014-05-21T22:27:27Z

@lindahua Are you working on this? If not, I may give it a shot.

lindahua · 2014-05-21T22:36:35Z

@simonster Please go ahead to give it a shot. I am not working on this issue right now.

lindahua · 2014-06-23T14:53:28Z

I have added weighted covariance.

Currently, it scales by inv(sum(wv)) without correction. I think this is a reasonable default behavior (documented), as it is possible that sum(wv) < 1.

Please feel free to reopen if there is better idea as to how this should be implemented.

lindahua added the question label Apr 4, 2014

simonster mentioned this issue May 23, 2014

Basic computation routines #67

Closed

6 tasks

lindahua added this to the version 0.5 milestone Jun 1, 2014

lindahua modified the milestones: version 0.5, version 0.6 Jun 22, 2014

lindahua closed this as completed Jun 23, 2014

lindahua modified the milestone: version 0.6 Jun 23, 2014

nalimilan mentioned this issue Jun 20, 2015

Weights in ModelFrame JuliaData/DataFrames.jl#823

Closed

nalimilan mentioned this issue Mar 12, 2016

Implement fit statistics functions JuliaStats/GLM.jl#115

Merged

7 tasks

rofinn mentioned this issue Apr 22, 2017

Corrected var, std, cov and cor #249

Closed

rofinn mentioned this issue Apr 25, 2017

Adding support for different weight vector types #250

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weighted variance, standard deviation, covariance & correlation. #53

Weighted variance, standard deviation, covariance & correlation. #53

lindahua commented Apr 4, 2014

johnmyleswhite commented Apr 4, 2014

nalimilan commented Apr 4, 2014

StefanKarpinski commented Apr 5, 2014

StefanKarpinski commented Apr 5, 2014

nalimilan commented Apr 5, 2014

johnmyleswhite commented Apr 5, 2014

StefanKarpinski commented Apr 5, 2014

nalimilan commented Apr 5, 2014

johnmyleswhite commented Apr 5, 2014

nalimilan commented Apr 5, 2014 •

edited

Loading

StefanKarpinski commented Apr 5, 2014

johnmyleswhite commented Apr 5, 2014

lindahua commented Apr 6, 2014

nalimilan commented Apr 6, 2014

nalimilan commented Apr 6, 2014

lindahua commented Apr 6, 2014

simonster commented May 21, 2014

lindahua commented May 21, 2014

lindahua commented Jun 23, 2014

Weighted variance, standard deviation, covariance & correlation. #53

Weighted variance, standard deviation, covariance & correlation. #53

Comments

lindahua commented Apr 4, 2014

johnmyleswhite commented Apr 4, 2014

nalimilan commented Apr 4, 2014

StefanKarpinski commented Apr 5, 2014

StefanKarpinski commented Apr 5, 2014

nalimilan commented Apr 5, 2014

johnmyleswhite commented Apr 5, 2014

StefanKarpinski commented Apr 5, 2014

nalimilan commented Apr 5, 2014

johnmyleswhite commented Apr 5, 2014

nalimilan commented Apr 5, 2014 • edited Loading

StefanKarpinski commented Apr 5, 2014

johnmyleswhite commented Apr 5, 2014

lindahua commented Apr 6, 2014

nalimilan commented Apr 6, 2014

nalimilan commented Apr 6, 2014

lindahua commented Apr 6, 2014

simonster commented May 21, 2014

lindahua commented May 21, 2014

lindahua commented Jun 23, 2014

nalimilan commented Apr 5, 2014 •

edited

Loading