Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighted variance, standard deviation, covariance & correlation. #53

Closed
lindahua opened this issue Apr 4, 2014 · 19 comments
Closed

Weighted variance, standard deviation, covariance & correlation. #53

lindahua opened this issue Apr 4, 2014 · 19 comments
Labels
Milestone

Comments

@lindahua
Copy link
Contributor

lindahua commented Apr 4, 2014

I am going to add these functionalities to the package soon.

One question needs to be decided: should we correct the scale like we do in unweighted cases? like:

m = mean(x, w)
# shall we do:
var(x, w) = sum(abs2(x - m)) / (sum(w) - 1)
# or do :
var(x, w) = sum(abs2(x - m)) / sum(w) 
@johnmyleswhite
Copy link
Member

Does the correction by subtracting 1 produce an unbiased estimator?

@nalimilan
Copy link
Member

Wikipedia offers a detailed explanation of the problem:
https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance

In summary, the unbiased estimator is only defined if the weights represent an integer number of cases, which is what @lindahua's first corrected formula above does. But it is quite common to have other types of weights, e.g. inverse-variance weights or sampling weights; the latter are even sometimes expressed as integers (like case weights).

Numpy does not offer weighted variance at all, and MATLAB doesn't talk about correction when weights are used. So I'd say by default you'd better return the uncorrected version (second one), but support the corrected version via the same keyword argument as for the unweighted variance, with a warning if weights are not integers, and a clear documentation about the fact that weights need to be case/repeat weights.

@StefanKarpinski
Copy link
Contributor

If the default is to correct in the non-weighted case, it seems both more correct and more consistent to correct for weighted variance as well, the only trouble being to figure out the right correction. Can we just figure out what the correction ought to be in the non-integer case? I somehow doubt that this group of people can't figure out the right way to do this correction. For that matter, @dmbates might just know it off the top of his head.

@StefanKarpinski
Copy link
Contributor

Ah, I see that the corrected weighted variance is not well defined when the weights don't represent sample counts. So yeah, it's not just a matter of not knowing, but that it can't be known.

@nalimilan
Copy link
Member

A possibly interesting solution could be to define different kind of weights as different Julia types, so that one would do var(x, caseweights(w)), just like one already does mean(x, weights(w)) to create a WeightVec object. Case/repeat weights would use the correction by default, but other types of weights wouldn't. CaseWeightVec (or any other name) would inherit from the more general AbstractWeightVec type, but would allow making more assumptions when applicable.

This may look overkill just to compute the variance, but I'm thinking about the potential benefits for more complex methods like regression models. In R it's always difficult to know exactly what kind of weights a specific modeling function expects, and it is dangerous if you get it wrong. If the meaning of weights was specified using the type system, everything would be clear, and functions could accept different types of weights and still do the correct computations (or print an error).

@johnmyleswhite
Copy link
Member

In theory, I like the idea of using weight types to distinguish different cases. The question becomes: are there just a few cases we need to support and can we put them into StatsBase so that other stats packages will support them? If the types become canonical, they'll be great. If they're not, they'll just seem verbose.

My reading of the Wikipedia article is that the weighted case can always be made unbiased if you know the sample size. Is that right? If so, that seems to argue for having three definitions:

(1) The uncorrected, biased estimator defined when using just weights.

(2) A corrected, unbiased estimator that requires that you specify the sample size explicitly.

(3) A corrected, biased estimator that just hopes that the sample size is the sum of the weights.

@StefanKarpinski
Copy link
Contributor

Is there a situation where weighted vectors make sense without a sample size? I guess if you're describing something that's not derived from a sample then it makes sense – but in that case correction is actually wrong. Perhaps the important distinction here is whether the data represents a sample or an ideal. Computing variance of a sample requires knowing the sample size and should be corrected, whereas for ideal distribution, the sample size doesn't even make sense.

@nalimilan
Copy link
Member

@johnmyleswhite What do you mean by "a few cases we need to support"? A few types of weights? A few methods?

@StefanKarpinski Yeah, that's why I suggested that the generic WeightsVec would not make any assumption regarding the sample, while other more specific types would allow making the required assumptions.

@johnmyleswhite
Copy link
Member

One place where weighted variances that might not have a well-defined sample size come up is EM for Gaussian mixture models.

Milan, I was thinking of a few types of weights. CountWeights, ArbitraryWeights, etc. If that list is short and generally useful, using the type system seems reasonable.

@nalimilan
Copy link
Member

nalimilan commented Apr 5, 2014

I think you can find a good list here: http://www.stata.com/help.cgi?weight and http://books.google.fr/books?id=L96ludyhFBsC&pg=PP17&lpg=PP17

  • What they call "probability weights" are also called "sampling weights". The only assumption which can be made with them is that they represent the inverse of the sampling probability, therefore their sum is the size of the target population. (More information is usually provided with surveys to compute statistics taking into account the survey design.)
  • "Frequency weights" are also called "case weights" or "repeat weights".
  • "Importance weights" should be the fallback type where others do not apply.
  • "Analytic weights" are also called "precision weights" or "inverse variance weights".

I think that covers all cases. One may add "replicate weights" to the list: these simply reflect resampling replicates (e.g. bootstrap). I'm not sure they deserve their own type since they are actually just frequency weights (though sometimes combined with sampling weights, i.e. an observation appearing twice in the replicate will have its weight doubled -- such weights are sometimes shipped with survey data for privacy reasons when details about the survey design cannot be made public.)

@StefanKarpinski
Copy link
Contributor

I'm concerned that such a fine-grained classification is too fussy and not user friendly.

@johnmyleswhite
Copy link
Member

I'm very sympathetic to that concern, but I've also heard my coworkers complain many times about how hard it is to know what kinds of weights a function in R expects as input. Using the type system here could remove that kind of uncertainty.

@lindahua
Copy link
Contributor Author

lindahua commented Apr 6, 2014

Frankly, I did not realize that there are so many kinds of weights with subtly different meanings.

In machine learning, weights usually come from certain applications where samples are associated with confidence, inverse variances, and assignment probabilities (e.g. EM). Correction is thus not really necessary in such cases.

I raise this issue as I feel that statisticians may have deeper understanding here. Are there any literature related to this discussion?

@nalimilan
Copy link
Member

@StefanKarpinski It's not user-friendly, but it's just how the world works. ;-) Seriously, anybody dealing with weights will have to check what's the exact type of weights they have. And while for example I would personally only need sampling weights, @lindahua appears to be more familiar with inverse variance weights. That said, people only working with arbitrary weights with no precise meaning could still specify them as the most generic type, and they would get errors to prevent them from doing things that require more assumptions where applicable (like for corrected variance).

@lindahua What kind of literature are you looking for? Something giving details about each type of weights are their use? http://books.google.fr/books?id=L96ludyhFBsC&pg=PP15 (and the rest of the chapter, as well as the whole book) is a good reference for sampling weights.

While using the type system to convey information about the type of weights sounds logical, I realized it does not suit much the case where weights are stored in a matrix or DataFrame, as a column, together with variables -- which is the use case I think is the most common. If you do so, you lose the type, since you need weights to be a vector, not a random object, to fit into a matrix or DataFrame. So you're going to have to specify the type of the weights each time you call the function.

That's not a criticism of the current interface as long as its goal is to take arbitrary vectors; but we may want to find a better mechanism for people working with DataFrames, where you'd be able to specify the type of the weights only once (to be stored as an attribute of the DataFrame). If we are to create a few types corresponding to different kinds of weights, we should keep this in mind and check how it would fit with a DataFrames interface. Else it's probably not worth creating these types, if another competing mechanism has to be implemented, making the whole system too complex.

@nalimilan
Copy link
Member

This page provides a detailed summary of how (corrected) weighted variance is computed in Stata for analytic and sampling weights, with detailed formulas (see bottom of the page): http://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/

In their terminology, aweights are analytic/inverse variance weights, fweights are frequency weights, pweights are probability/sampling weights, and iweights are importance/arbitrary weights.

@lindahua
Copy link
Contributor Author

lindahua commented Apr 6, 2014

Thanks @nalimilan. I will look at the Stata page.

@simonster
Copy link
Member

@lindahua Are you working on this? If not, I may give it a shot.

@lindahua
Copy link
Contributor Author

@simonster Please go ahead to give it a shot. I am not working on this issue right now.

@lindahua lindahua added this to the version 0.5 milestone Jun 1, 2014
@lindahua lindahua modified the milestones: version 0.5, version 0.6 Jun 22, 2014
@lindahua
Copy link
Contributor Author

I have added weighted covariance.

Currently, it scales by inv(sum(wv)) without correction. I think this is a reasonable default behavior (documented), as it is possible that sum(wv) < 1.

Please feel free to reopen if there is better idea as to how this should be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants