-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weighted variance, standard deviation, covariance & correlation. #53
Comments
Does the correction by subtracting 1 produce an unbiased estimator? |
Wikipedia offers a detailed explanation of the problem: In summary, the unbiased estimator is only defined if the weights represent an integer number of cases, which is what @lindahua's first corrected formula above does. But it is quite common to have other types of weights, e.g. inverse-variance weights or sampling weights; the latter are even sometimes expressed as integers (like case weights). Numpy does not offer weighted variance at all, and MATLAB doesn't talk about correction when weights are used. So I'd say by default you'd better return the uncorrected version (second one), but support the corrected version via the same keyword argument as for the unweighted variance, with a warning if weights are not integers, and a clear documentation about the fact that weights need to be case/repeat weights. |
If the default is to correct in the non-weighted case, it seems both more correct and more consistent to correct for weighted variance as well, the only trouble being to figure out the right correction. Can we just figure out what the correction ought to be in the non-integer case? I somehow doubt that this group of people can't figure out the right way to do this correction. For that matter, @dmbates might just know it off the top of his head. |
Ah, I see that the corrected weighted variance is not well defined when the weights don't represent sample counts. So yeah, it's not just a matter of not knowing, but that it can't be known. |
A possibly interesting solution could be to define different kind of weights as different Julia types, so that one would do This may look overkill just to compute the variance, but I'm thinking about the potential benefits for more complex methods like regression models. In R it's always difficult to know exactly what kind of weights a specific modeling function expects, and it is dangerous if you get it wrong. If the meaning of weights was specified using the type system, everything would be clear, and functions could accept different types of weights and still do the correct computations (or print an error). |
In theory, I like the idea of using weight types to distinguish different cases. The question becomes: are there just a few cases we need to support and can we put them into StatsBase so that other stats packages will support them? If the types become canonical, they'll be great. If they're not, they'll just seem verbose. My reading of the Wikipedia article is that the weighted case can always be made unbiased if you know the sample size. Is that right? If so, that seems to argue for having three definitions: (1) The uncorrected, biased estimator defined when using just weights. (2) A corrected, unbiased estimator that requires that you specify the sample size explicitly. (3) A corrected, biased estimator that just hopes that the sample size is the sum of the weights. |
Is there a situation where weighted vectors make sense without a sample size? I guess if you're describing something that's not derived from a sample then it makes sense – but in that case correction is actually wrong. Perhaps the important distinction here is whether the data represents a sample or an ideal. Computing variance of a sample requires knowing the sample size and should be corrected, whereas for ideal distribution, the sample size doesn't even make sense. |
@johnmyleswhite What do you mean by "a few cases we need to support"? A few types of weights? A few methods? @StefanKarpinski Yeah, that's why I suggested that the generic |
One place where weighted variances that might not have a well-defined sample size come up is EM for Gaussian mixture models. Milan, I was thinking of a few types of weights. CountWeights, ArbitraryWeights, etc. If that list is short and generally useful, using the type system seems reasonable. |
I think you can find a good list here: http://www.stata.com/help.cgi?weight and http://books.google.fr/books?id=L96ludyhFBsC&pg=PP17&lpg=PP17
I think that covers all cases. One may add "replicate weights" to the list: these simply reflect resampling replicates (e.g. bootstrap). I'm not sure they deserve their own type since they are actually just frequency weights (though sometimes combined with sampling weights, i.e. an observation appearing twice in the replicate will have its weight doubled -- such weights are sometimes shipped with survey data for privacy reasons when details about the survey design cannot be made public.) |
I'm concerned that such a fine-grained classification is too fussy and not user friendly. |
I'm very sympathetic to that concern, but I've also heard my coworkers complain many times about how hard it is to know what kinds of weights a function in R expects as input. Using the type system here could remove that kind of uncertainty. |
Frankly, I did not realize that there are so many kinds of weights with subtly different meanings. In machine learning, weights usually come from certain applications where samples are associated with confidence, inverse variances, and assignment probabilities (e.g. EM). Correction is thus not really necessary in such cases. I raise this issue as I feel that statisticians may have deeper understanding here. Are there any literature related to this discussion? |
@StefanKarpinski It's not user-friendly, but it's just how the world works. ;-) Seriously, anybody dealing with weights will have to check what's the exact type of weights they have. And while for example I would personally only need sampling weights, @lindahua appears to be more familiar with inverse variance weights. That said, people only working with arbitrary weights with no precise meaning could still specify them as the most generic type, and they would get errors to prevent them from doing things that require more assumptions where applicable (like for corrected variance). @lindahua What kind of literature are you looking for? Something giving details about each type of weights are their use? http://books.google.fr/books?id=L96ludyhFBsC&pg=PP15 (and the rest of the chapter, as well as the whole book) is a good reference for sampling weights. While using the type system to convey information about the type of weights sounds logical, I realized it does not suit much the case where weights are stored in a matrix or That's not a criticism of the current interface as long as its goal is to take arbitrary vectors; but we may want to find a better mechanism for people working with |
This page provides a detailed summary of how (corrected) weighted variance is computed in Stata for analytic and sampling weights, with detailed formulas (see bottom of the page): http://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/ In their terminology, |
Thanks @nalimilan. I will look at the Stata page. |
@lindahua Are you working on this? If not, I may give it a shot. |
@simonster Please go ahead to give it a shot. I am not working on this issue right now. |
I have added weighted covariance. Currently, it scales by Please feel free to reopen if there is better idea as to how this should be implemented. |
I am going to add these functionalities to the package soon.
One question needs to be decided: should we correct the scale like we do in unweighted cases? like:
The text was updated successfully, but these errors were encountered: