Adding support for different weight vector types #250

rofinn · 2017-04-25T19:44:39Z

This PR is intended to address concerns raised in issues #53 and #249

Requirements:

nalimilan

Thanks for doing this! I've made a few comments, without having reviewed everything yet.

nalimilan · 2017-04-25T21:18:20Z

src/weights.jl

+        bias::Int
+    end
+
+    immutable ProbabilityWeights{S<:Real, T<:Real, V<:RealVector} <: AbstractWeights{S, T, V}


It turns out "sampling weights" is a much more common expression than "probability weights". Maybe we should use the former term?

I don't have very strong feelings either way, but sampling just seemed more ambiguous to me.

I've heard both. In my line of work I don't come across these kinds of weights much so I can't speak to the prominence of either term overall, but I think "probability weights" is what I learned to call them in grad school.

nalimilan · 2017-04-25T21:20:19Z

src/weights.jl

 If omitted, `wsum` is computed.
 """
-function WeightVec{S<:Real, V<:RealVector}(vs::V, s::S=sum(vs))
-    return WeightVec{S, eltype(vs), V}(vs, s)
+function Weights{S<:Real, V<:RealVector}(vs::V, s::S=sum(vs); corrected::Bool=true)


Wouldn't it make more sense to decide whether to apply the correction when calling std? For most functions (like mean), the correction does not apply and the results will be the same whatever the kind of weights you have. It seems a bit weird to require people to decide about correction in advance.

Yeah, I guess that would fit with base better. I was initially thinking that it might make sense to precompute the value, but we only use it in a few places so it probably wouldn't cost us much.

I've made this change in my next commit, but it ended up being a lot of little changes.

nalimilan · 2017-04-25T21:26:46Z

src/weights.jl

 """
-weights(vs::RealVector) = WeightVec(vs)
-weights(vs::RealArray) = WeightVec(vec(vs))
+    frequency(vs)


frequency and exponential are really too broad terms to be claimed for weights. We should find more specific names, like fweights and expweights. One (smallà advantage is that it would mirror Stata (with also pweights and aweights).

I tend to prefer full word function names. I'd be fine with keeping frequency and exponential, but only exporting the aliases of pweights, eweights, fweights. That way folks can choose to use the full word versions, but we won't be cluttering up the global namespace on them.

I don't think we should have two ways to access the same function. IMO we should choose one interface and stick to it. I'm with Milan on this one; I don't think we should use frequency and exponential for these, as those names are too general. freqweights and expweights seem reasonable enough names to me.

Looks like scipy also uses the aweights and fweights names https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html

Actually, why not just use the type constructors rather than separate functions?

Iono, cause FrequencyWeights seems long...?

Yeah, I suppose so. Perhaps it's best then to follow suit with Stata and SciPy and use aweights and fweights.

nalimilan · 2017-04-25T21:50:11Z

src/weights.jl

 end

+# TODO: constructor for ProbabilityWeights, but I'm not familiar with how bias correction works with these
+# types of weights or if bias correction even makes sense.
+# https://en.wikipedia.org/wiki/Inverse_probability_weighting


See http://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/ for a summary regarding analytical weights (aweights) and probability weights (pweights). Wikipedia has interesting data on frequency weights and analytical/precision weights: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance

I'll try to find more references.

As it stands, my understanding is that these weights are intended to provide inferential statistics about the population (vs sample)? Also, I guess analytical/precision weights = reliability weights on the wiki page?

As it stands, my understanding is that these weights are intended to provide inferential statistics about the population (vs sample)?

Yes, probability/sampling weights reflect the fact that some population groups were over/under-sampled, so individuals do not all represent the same number of people in the target population.

Also, I guess analytical/precision weights = reliability weights on the wiki page?

Right. The most explicit name is "inverse-variance weights", and this is what the link given on Wikipedia mentions.

nalimilan · 2017-04-25T21:51:22Z

src/weights.jl

-    immutable WeightVec{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T}
+    abstract AbstractWeights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T}
+
+    immutable Weights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractWeights{S, T, V}


I'm not sure we still need this. Do we expect people to need kinds of weights which cannot have their own type?

Well I get the impression that the kinds of weights should be kept relatively small and folks might still want to dispatch based on the storage of those weights (e.g., NullableArray vs Vector).

What I meant is that the Weights family might not be needed since all use cases should fall into a well-defined weights type.

Oh, yeah, I've renamed Weights to AnalyticWeights in the next commit to be more specific. I've also added an aweights method and made weights an alias to aweights, but maybe I should just deprecate the weights method in favour of making folks think more about what type of weights they want to use?

Yes, let's deprecate it, as that's the best way to make people discover the new weight types and choose the appropriate one for them. If it turns out some people need other weight types, they will let us know.

ararslan · 2017-04-25T23:04:17Z

src/deprecates.jl

+@deprecate _moment3(v::RealArray, m::Real, wv::AbstractWeights) _moment3(v, wv, m)
+@deprecate _moment4(v::RealArray, m::Real, wv::AbstractWeights) _moment4(v, wv, m)
+@deprecate _momentk(v::RealArray, k::Int, m::Real, wv::AbstractWeights) _momentk(v, k, wv, m)
+@deprecate moment(v::RealArray, k::Int, m::Real, wv::AbstractWeights) moment(v, k, wv, m)


If WeightVec is going away it should have a deprecation, and if it isn't then these deprecations should maintain the existing signatures. (They've probably been deprecated long enough that we can just remove them, but that's an aside.)

Yeah, I've deprecated WeightVec and I was contemplating deprecating weights in favour of aweights to be more specific.

Ideally, the deprecation warning should mention all weight types. I'm also not sure we can call aweights as a replacement, since the current cov does not apply any correction, the deprecation should do the same: maybe we need to keep the WeightsVec type (deprecating) to avoid breaking packages?

That seems reasonable and would also help notify people about the other changes occurring with weights without introducing breaking changes.

How do folks want to handle internal calls to weights(x) if it (and WeightVec) are deprecated?

We leave the code as is and accept that functions like wsample, wmedian, etc will throw deprecation warnings when they need to creating a WeightVec from an array.

Change the default behaviour to creating AnalyticWeights which may break existing behaviour for some people.

AFAIK the kind of weights doesn't matter for wsample or wmedian, it only makes a difference for var and similar functions. IOW the type of weights created for the former functions is an implementation detail, we can choose any type we want (though I would choose frequency weights, as they are simpler to understand).

Yeah, I guess wsample and wmedian weren't good examples. Alright, I'll switch that over to avoid the warnings. Thanks.

ararslan · 2017-04-25T23:06:28Z

src/weights.jl

@@ -2,43 +2,112 @@
 ###### Weight vector #####

 if VERSION < v"0.6.0-dev.2123"
-    immutable WeightVec{S<:Real, T<:Real, V<:RealVector} <: RealVector{T}
+    abstract AbstractWeights{S<:Real, T<:Real, V<:RealVector} <: RealVector{T}


Should be @compat abstract type ... end. Also why is this being conditioned on the version? Looks like that predates this PR, but I don't see why we would need to condition for these definitions.

AbstractWeights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T} isn't valid syntax cause 0.5 doesn't support triangular dispatch.

Oh dang you're right. 😞

ararslan · 2017-04-25T23:10:30Z

src/weights.jl

+# Arguments
+* `n::Integer`: the desired length of the `Weights`
+* `λ::Real`: is a smoothing factor or rate paremeter between 0 .. 1.
+    As this value approaches 0 the resulting weights will be almost equal(),


Why the parentheses after "equal"? Also should remove "is" on the previous line, "paremeter" -> "parameter", and ".." -> "and".

ararslan · 2017-04-25T23:13:49Z

src/weights.jl

+    while values closer to 1 will put higher weight on the end elements of the vector.
+"""
+function exponential(n::Integer, λ::Real=0.99)
+    @assert 0 <= λ <= 1 && n > 0


Typically for argument checking it's better to use condition || throw(ArgumentError("...")). In this case,

n > 0 || throw(ArgumentError("cannot construct weights of length < 1")) 0 <= λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1"))

That provides more descriptive error messages for the user, plus it's better to use specifically-typed exceptions where possible.

ararslan · 2017-04-25T23:15:31Z

src/weights.jl

 """
-wmedian(v::RealVector, w::RealVector) = median(v, weights(w))
-wmedian{W<:Real}(v::RealVector, w::WeightVec{W}) = median(v, w)
+wmedian(v::RealVector, w::RealVector) = median(v, weights(w, false))


If the weights API is changing to require a boolean parameter, the existing API should have a formal deprecation.

nalimilan · 2017-04-26T09:29:31Z

I have found more references for the variance correction to apply to different types of weights.

Regarding sampling/probability weights, a good reference is svyvar from the R survey package. What it does (in the simple case where there is no stratification) is simply to apply the Bessel correction with n equal to the number of non-zero weights:

n = count(!iszero, w)
sw = sum(w)
xbar = sum(w .* x)/sw
sum(w .* (x - xbar).^2)/sw * n/(n - 1)

(Of course this is just a Julia illustration of the algorithm. This code should be optimized to avoid making copies.)

A good reference for analytical/precision weights is wtd.var from the Hmisc R package. For these weights, we should use the same formula as wtd.var when passed normwt=TRUE, i.e. normalize the weights so that they sum to the vector length (or maybe to the number of non-zero weights for consistency with above). Then the basic formula is:

n = length(w) # Could be count(!iszero, w) instead
w = w * n/sum(w)
sw = sum(w) # This is now equal to n, but maybe we should support non-normalized weights?
xbar = sum(w .* x)/sw
sum(w * (x - xbar).^2)/(sw - sum(w.^2)/sw)

This corresponds to the Wikipedia formula, and indeed it was written by the same person. Regarding the actual implementation, we could normalize the weights when creating the vector if that's always needed.

Another interesting reference (for analytical/precision weights) is SAS:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473731.htm#a000068934

nalimilan · 2017-04-26T15:07:35Z

src/weights.jl

+        bias::Real
+    end
+
+    immutable FrequencyWeights{S<:Integer, T<:Integer, V<:IntegerVector} <: AbstractWeights{S, T, V}


I don't think we should restrict frequency weights to integers. While this seems obvious at first, formulas generalize to non-integer weights, and other software generally accepts arbitrary values. This can be useful in some cases (e.g. sometimes you need to split observations between groups). Also this will be more practical if your variable is stored as a float for some reason.

rofinn · 2017-04-26T22:12:19Z

Sorry about the code churn in these commits folks. Turns out moving the corrected option to the stats functions required changing a lot of code. Here is a summary of the changes to help keep things organized.

Changes:

Reverts many test changes from the last commit
Changed a bunch of test cases to use corrected=false for now
This included a lot of little changes to method definition and some resulting formatting changes where necessary
Added a few test cases for the corrected variances ( I just used the code @nalimilan posted above )
Added bias correction for ProbabililtyWeights.
Deprecated WeightVec
Added a macro for easier creation of weight types.
Renamed weight creation function to fweights, pweights, etc.

ararslan · 2017-04-26T22:32:29Z

src/weights.jl

+end
+
+"""
+    `@weights name`


Backticks shouldn't be used for indented blocks like this, since Markdown code formatting is applied by virtue of being indented. So this will actually show up in the docstring and docs like

`@weights name`

ararslan · 2017-04-26T22:33:50Z

src/weights.jl

+    bias(w::ProbabilityWeights, [corrected])
+
+```math
+\fraction{n}{∑w × (n - 1)}


I think this should be \frac{n}{(n - 1) \sum w}

(Similarly \frac and LaTeX commands rather than Unicode elsewhere)

I'm fine with just using LaTeX commands, but the julia docs recommend using unicode characters.

Use Unicode characters rather than their LaTeX escape sequence, i.e. α = 1 rather than \\alpha = 1.

Oh, weird. Anyway, should still be \frac rather than \fraction, though I guess with the backslash escaped.

I don't think that remark applies to \sum, as it does more than just printing the ∑ character: it also uses the right size for the symbol. "characters" in this context mostly means "letters" AFAICT.

ararslan · 2017-04-26T22:34:46Z

test/cov.jl

@@ -88,62 +88,62 @@ end
 # weighted covariance

 if VERSION < v"0.5.0-dev+679"
-    @test cov(X, wv1)           ≈ S1w ./ sum(wv1)
-    @test cov(X, wv2; vardim=2) ≈ S2w ./ sum(wv2)
+    @test cov(X, wv1; corrected=false)           ≈ S1w ./ sum(wv1)


Having to change all of the tests suggests that this is a breaking change. To keep from breaking users' code, we should probably make corrected=false the default.

I'm fine with doing that for this PR, but it would be nice to switch it back to corrected=true in a later release?

It's tricky to change things like this, since we'd need some kind of deprecation to avoid suddenly and silently breaking others' code. I think @nalimilan has gone through such a process recently; perhaps he could be of help here.

As I just noted in a comment above, I think the best deprecation path is to deprecate WeightsVec but keep it around so that existing code does not use the correction, and point to new weights types in the deprecation warning.

@nalimilan To clarify, you'd like var, std and cov to always return the uncorrected version for WeightVecs even if corrected=true?

The corrected argument wouldn't be supported at all for WeightVec, or at least and error would be raised if corrected=true is passed.

nalimilan · 2017-04-27T12:30:01Z

src/StatsBase.jl

+    AbstractWeights,   # the abstract type to represent any weight vector
+    AnalyticWeights,   # the default type for representing a analytic/precision/reliability weight vectors
+    FrequencyWeights,  # the type for representing a frequency weight vectors
+    ProbabilityWeights,# the type for representing a probability/sampling weight vectors


Space before #. Also align with the rest of the block.

nalimilan · 2017-04-27T12:31:51Z

src/moments.jl

-Base.varm!(R::AbstractArray, A::RealArray, wv::WeightVec, M::RealArray, dim::Int) =
-    scale!(_wsum_centralize!(R, @functorize(abs2), A, values(wv), M, dim, true), inv(sum(wv)))
+function Base.varm!(R::AbstractArray, A::RealArray, wv::AbstractWeights, M::RealArray, dim::Int; corrected=true)
+    scale!(


Put the first argument on this line, and the closing parenthesis on the same line as the second argument. Same below.

BTW, @functorize is no longer needed with Julia 0.5 and above.

Again, maybe a style guide would be helpful. The argument I've heard for spreading args (for long function calls) across multiple lines is that it minimizes the size of you diffs later on. I'm not saying that's the right approach, but given that there are multiple styles floating around and the julia style guide doesn't discuss them maybe it's worth clarifying.

Everyone has a different preferred style; it's mostly about being consistent within a particular package. That said, I'd love for the style guide in the manual to be more specific about things.

This kind of thing is worth clarifying, but I think it's pretty clear from the existing code base of JuliaStats packages that this style is against the convention.

nalimilan · 2017-04-27T12:34:13Z

src/moments.jl

@@ -15,10 +15,12 @@ whereas it's `length(x)-1` in `Base.varm`. The impact is that this is not a
 weighted estimate of the population variance based on the sample; it's the weighted
 variance of the sample.
 """
-Base.varm(v::RealArray, wv::WeightVec, m::Real) = _moment2(v, wv, m)
+function Base.varm(v::RealArray, wv::AbstractWeights, m::Real; corrected=true)


FWIW, you can also add a line break while keeping the short form without function... end.

I'm aware that the short form support line breaks, but I think that's harder for people to parse. The function ... end syntax with proper indentation provides a more consistent read through. I tend to have similar views on ternaries that cover multiple lines (particularly if they're nested). I'm fine with keeping to the short form if that's the desired style for this repo, but maybe we should add a style guide to make that clearer?

I don't really care. There's a short style guide in Julia's CONTRIBUTING.md, but I don't think it's mentioned.

nalimilan · 2017-04-27T12:35:01Z

src/moments.jl

    n = length(v)
    s = 0.0
    w = values(wv)
    for i = 1:n
        @inbounds z = v[i] - m
        @inbounds s += (z * z) * w[i]
    end
-    s / sum(wv)
+
+    result = s * bias(wv, corrected)


result isn't needed.

Agreed. Sorry, this is leftover from some debugging I was doing (removed the print statements, but forgot to reverse that).

nalimilan · 2017-04-27T12:36:35Z

src/moments.jl

    n = length(v)
    s = 0.0
    for i = 1:n
        @inbounds z = v[i] - m
        s += z * z
    end
-    s / n
+    s * bias(n, corrected)


Multiplying by a bias is backwards. Maybe call that function cov_correction, moment2_correction or something like that? It's also good to choose a specific name.

I agree that bias isn't really descriptive of what it represents anymore. Would something like bias_factor, scale_factor, bias_coef, scale_coef, etc make more sense? The nice thing about keeping a relatively general name is that we can use the same function call var, std and cov.

var and std are in some way based on covariance, so cov should be fine.

nalimilan · 2017-04-27T13:03:11Z

src/weights.jl

+@weights AnalyticWeights
+
+"""
+    AnalyticWeights(vs, [wsum])


wsum=sum(vs), without brackets.

nalimilan · 2017-04-27T13:04:28Z

src/weights.jl

 end

 """
+    aweights(vs)
+
+Construct a `AnalyticWeights` type from a given array.


type -> vector.

Maybe add something like "See the documentation for that type for more information"?

nalimilan · 2017-04-27T13:06:21Z

src/weights.jl

+Construct a `AnalyticWeights` type from a given array.
+"""
+aweights(vs::RealVector) = AnalyticWeights(vs)
+aweights(vs::RealArray) = AnalyticWeights(vec(vs))


Is this method really needed? I don't think we perform this kind of conversion automatically in general.

nalimilan · 2017-04-27T13:11:18Z

src/weights.jl

+    n > 0 || throw(ArgumentError("cannot construct weights of length < 1"))
+    0 <= λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1"))
+    w0 = map(i -> λ * (1 - λ)^(1 - i), 1:n)
+    return weights(w0)


Shouldn't this be ExponentialWeights? But why do we need a specific weights type: don't them enter into one of the above families? Or maybe they should just use a generic weights type (without any correction)?

nalimilan · 2017-04-27T13:23:23Z

src/weights.jl

+function bias(w::AbstractWeights, corrected=true)
+    s = sum(w)
+    if corrected
+        return inv(s * (1 - sum(normalize(values(w), 1) .^ 2)))


normalize and .^2 are going to create copies of the weights, which should be avoided. Formula should be adapted so that no allocation happens at all. Maybe you'll need a loop for that.

JeffreySarnoff · 2017-04-28T01:02:12Z

I just found WeightVec and the much cleaner reworking above. I will use the new abstraction and its support to hold the vector of weights used with rolling, windowed stats. Usually, these weights are normalized in some way and then used. It would be pleasant to do (should this as yet not be).

my_weights = new_abstraction_above( my_weights_as_values_in_sequence )

my_normalized_weights = normalize(my_weights)
my_normalized_weights = normalize(my_weights, p_norm = 2)

normalize!(my_weights)
normalize!(my_weights, p_norm = 1.618)

without reaching inside your type to access the sum

rofinn

This still isn't ready yet, but I thought I should push what I have and point out a couple spots where I could use a second opinion. @nalimilan and @ararslan Thanks for bearing with this PR, I know it's been a lot of work.

rofinn · 2017-04-28T04:37:02Z

src/StatsBase.jl

+    AnalyticWeights,    # the default type for representing a analytic/precision/reliability weight vectors
+    FrequencyWeights,   # the type for representing a frequency weight vectors
+    ProbabilityWeights, # the type for representing a probability/sampling weight vectors
+    ExponentialWeights, # the type for representing exponential weights


Forgot to delete ExponentialWeights from export.

rofinn · 2017-04-28T04:38:32Z

src/deprecates.jl

@@ -43,3 +37,28 @@ findat(a::AbstractArray, b::AbstractArray) = findat!(Array{Int}(size(b)), a, b)

 @deprecate df(obj::StatisticalModel) dof(obj)
 @deprecate df_residual(obj::StatisticalModel) dof_residual(obj)
+
+@weights WeightVec


Not sure if this WeightVec should go here or in weights.jl

If it's deprecated it should live in this file, just needs @deprecate as appropriate

I'm guessing using depwarn inside the WeightVec and weights methods is the correct approach here since we want to deprecate the functionality rather than the signature?

Yes. There's also @deprecate_binding to deprecate WeightVec itself.

Doesn't @deprecate_binding only work if you're just changing the name of the binding?

Yes, but it's needed to print a warning when people use WeightVec directly. You can deprecate it in favor of FrequencyWeights, even if that's not fully correct it's better than nothing.

rofinn · 2017-04-28T04:40:28Z

src/weights.jl

+(ie: [Bessel's correction](https://en.wikipedia.org/wiki/Bessel's_correction)),
+otherwise it will return ``\\frac{1}{n}``.
+"""
+cfactor(n::Integer, corrected=false) = 1 / (n - Int(corrected))


cfactor (for correction factor) seemed like the best choice for this function now.

I'm not a fan of this name either: "factor" really is secondary here, what matters is 1) that it's a correction, 2) that it applies to var/cov/std.

varcorrection, maybe biascorrection?

varcorrection seems a little long, maybe cvar?

That's even less descriptive than cfactor, IMO.

Alright, varcorrection it is then. I don't really care about the name since we're not exporting it.

Wait, what about varden (variance denominator)? I feel like the fact that it takes a corrected argument should cover that it may provide bias correction. I'd need to slightly change what's returned, but that might be more understandable.

I like varcorrection better. Or maybe bessel_correction?

rofinn · 2017-04-28T04:41:52Z

src/weights.jl

+``\\frac{1}{\sum w}``
+"""
+cfactor(wv::AbstractWeights, ::Type{Val{false}}) = 1 / sum(wv)
+cfactor(wv::AbstractWeights, ::Type{Val{true}}) =


I'd understand if folks are opposed to the Type{Val{True}}, but this means subtypes only need to implement the correction condition.

Val is really not what you want here: since the value of corrected will only be known at runtime, you're forcing the call to cfactor to go via dispatch at runtime, while it would have been inlined if you used ::Bool.

Again, I don't think we should provide a fallback correction for AbstractWeight: it's unlikely to be valid, which is worse than raising an error. Creating new weight types shouldn't be a frequent need, people can afford the cost of defining this simple method if it applies.

Oops, I had missed that you throw an error. But still I think it's better to repeat the uncorrected equation in each method than to use Val.

rofinn · 2017-04-28T04:47:49Z

test/scalarstats.jl

-@test zscore(a)    ≈ zscore(a, mean(a), std(a))
-@test zscore(a, 1) ≈ zscore(a, mean(a,1), std(a,1))
-@test zscore(a, 2) ≈ zscore(a, mean(a,2), std(a,2))
+@test zscore(a)    ≈ zscore(a, mean(a), std(a; corrected=false))


The corrected=false is because mean_and stdnow has corrected=false. In order to support all of the existing behaviour we may need to mix and match which methods have corrected=true.

I don't understand: why not adapt mean_and_std to default to corrected=true? To avoid breaking existing code, we could keep corrected=false for a while, but print a warning when corrected isn't specified so that we can make the switch later.

I'm clearly missing something here. How am I suppose to check if a keyword argument is set?

NOTE: We could add that warning if we chose to make corrected a regular (vs keyword) argument. We'd just need to have a function which doesn't take the corrected argument, so that it can print a warning and call the function with the default setting? I initially thought making corrected a keyword argument was the best way to stay consistent with base, but var, std and cov aren't even consistent with each other in base, so it might make more sense to do whatever is best for our use case.

std( ... std(A::AbstractArray, region; corrected, mean) in Base at statistics.jl:263 julia> cov( cov(x::AbstractArray{T,1} where T, corrected::Bool) in Base at statistics.jl:346 ...

You can just have corrected=nothing to detect whether a keyword argument has been left to its default. Then you just need to replace its value with false if it's equal to nothing after printing the warning using Base.depwarn.

nalimilan · 2017-04-28T08:32:52Z

src/StatsBase.jl

-    wmedian,     # weighted median
-    wquantile,   # weighted quantile
+    AbstractWeights,    # the abstract type to represent any weight vector
+    AnalyticWeights,    # the default type for representing a analytic/precision/reliability weight vectors


Singular "vector". "an analytic". Since these lines are wider than 92 chars, make them a bit shorter e.g. by removing "the", "default", and "for representing"/"to represent".

nalimilan · 2017-04-28T08:34:38Z

src/StatsBase.jl

+    FrequencyWeights,   # the type for representing a frequency weight vectors
+    ProbabilityWeights, # the type for representing a probability/sampling weight vectors
+    ExponentialWeights, # the type for representing exponential weights
+    weights,            # alias for aweights


Shouldn't this be removed?

Don't we still want to export it (even if it's deprecated) to avoid breaking functionality?

@deprecate automatically exports for this reason.

nalimilan · 2017-04-28T08:37:22Z

src/cov.jl

-    scattermat(x::DenseMatrix, wv::WeightVec, vardim::Int=1) =
-        scattermatm(x, Base.mean(x, wv, vardim), wv, vardim)
+## weighted cov
+function Base.covm(x::DenseMatrix, mean, wv::AbstractWeights, vardim::Int=1, corrected::Bool=false)


This line is wider than 92 chars, either remove function or split arguments on multiple lines. Since this file seems to be using the short form for one-line method definitions, it would be more consistent to use it at least here.

nalimilan · 2017-04-28T08:38:23Z

src/cov.jl

@@ -58,66 +58,41 @@ cov


 """
-    mean_and_cov(x, [wv::WeightVec]; vardim=1) -> (mean, cov)
+    mean_and_cov(x, [wv::AbstractWeights]; vardim=1) -> (mean, cov)


Need to document corrected. Same for mean_and_std and mean_and_var, var, varm, std, stdm, cov, covm.

Yeah, I was holding off updating these docstrings until I was more comfortable with the behaviour. Since, folks don't seem to have an issue with the corrected=false keyword for all these functions I'll update that.

nalimilan · 2017-04-28T08:44:38Z

src/hist.jl

@@ -249,14 +247,15 @@ function append!{T,N}(h::AbstractHistogram{T,N}, vs::NTuple{N,AbstractVector})
    end
    h
 end
+


This is unrelated, and it's not justified since there's no line break before the last append! definition either.

Yeah, that was a typo from fixing conflicts when I rebased with master.

nalimilan · 2017-04-28T09:31:58Z

src/weights.jl

+    eweights(n, [λ])
+
+Constructs an `AnalyticWeights` vector with a desired length `n` and smoothing factor `λ`,
+where each element is set to ``λ * (1 - λ)^(1 - i)``.


"element in position ``i`` "

nalimilan · 2017-04-28T09:33:11Z

src/weights.jl


+# Arguments


As noted in the guidelines, the "arguments" section isn't recommended, except when there are many arguments. Here, the description of n is redundant with what is said above. You can just keep the last sentence about λ in the main description or in a separate paragraph.

nalimilan · 2017-04-28T09:34:32Z

src/weights.jl

-Base.getindex(wv::WeightVec, i) = getindex(wv.values, i)
-Base.size(wv::WeightVec) = size(wv.values)
+"""
+    eweights(n, [λ])


λ=0.99, and drop the brackets. Or maybe drop the default value, as it seems quite arbitrary and unlikely to be exactly what people need?

Yeah, I guess the 0.99 is pretty arbitrary.

nalimilan · 2017-04-28T09:35:43Z

src/weights.jl

+    n > 0 || throw(ArgumentError("cannot construct weights of length < 1"))
+    0 <= λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1"))
+    w0 = map(i -> λ * (1 - λ)^(1 - i), 1:n)
+    aweights(w0)


Are these really analytical weights? The docs should mention this if that's the case. Can you explain what exponential weights are used for?

I often use exponential weights to describe the relative importance of observations in temporal data. For example, I often want to get summary statistics about some lookback data, but putting greater value on more recent observations (as they better reflect the "current" state of the system). Since, these types of weights definitely aren't probability or frequency weights I got the impression that analytic or reliability weights best fit this. I'm fine with removing this a niche use case, but my understanding is that exponential weights are pretty common when working with temporal data.

Hmm, these aren't strictly exponential weights. Maybe this means we should keep a generic Weights type... Feel free to reintroduce it (to replace WeightsVec), though I'm not sure what's the best name for them.

nalimilan · 2017-04-28T09:38:06Z

test/moments.jl

+# AnalyticWeights
+@test var(x, aweights(ones(10)); corrected=true) ≈ var(x)
+
+w = aweights(rand(10))


Rather than applying the formulas, this should use fixed values and hardcode the result. Then I can check that it's correct in other software if you like.

Then I can check that it's correct in other software if you like

That's a great idea, I'd really appreciate that! What kind of software were you thinking of testing against? This PR is probably big enough as it is, but I was wondering if it would make sense to automate testing against R using RCall.

I was thinking of the survey and Hmisc R packages, of Stata and maybe SAS. I wouldn't worry about using RCall, we can just hardcode the results for a few cases.

nalimilan · 2017-04-28T09:40:08Z

@JeffreySarnoff AbstractWeight should probably implement setindex! so that they can be modified if needed. @rofinn just added support for getindex, you could open another PR for setindex!.

ararslan · 2017-04-28T19:29:19Z

I just want to say thanks so much for your hard work and perseverence here, @rofinn. You're doing fantastic work and I'm excited to see this get merged.

rofinn · 2017-04-29T05:58:04Z

Is it possible to turn deprecation warnings on and off during testing in julia? It would be nice to check that I haven't broken any existing behaviour without producing a bunch of deprecation warning during testing.

nalimilan · 2017-04-29T20:15:29Z

Is it possible to turn deprecation warnings on and off during testing in julia? It would be nice to check that I haven't broken any existing behaviour without producing a bunch of deprecation warning during testing.

julia --depwarn=no is what you need.

nalimilan · 2017-04-29T20:18:45Z

src/cov.jl

@@ -44,80 +37,73 @@ that the data are centered and hence there's no need to subtract the mean.
 When `vardim = 1`, the variables are considered columns with observations in rows;
 when `vardim = 2`, variables are in rows with observations in columns.
 """
-function scattermat end


This line was actually correct since the docstring documents several methods, not just the following one. Same below.

Right, but is that line even necessary (calling help on the method name still works the same)? Also, the other line is just cov vs function cov end, what would the preference be to maintain consistency?

cov refers to an existing object; function cov end creates a function with 0 methods. Typically the latter is only used for forward-declaring functions to applying docstrings. I think it's usually preferable to use the former when the function has already been defined.

In general it doesn't really make a big difference whether you attach the docstring to a method or to the function, but it matters in some cases: the docstring for the function will (or should) always appear first, and when a link to the source is provided (in the online manual) it wouldn't be really correct to point to a particular method.

Ah, okay, to clarify, the preference moving forward is that we want the docstring attached to function <name> end with all other methods below that?

No, just keep the existing organization. Always better to avoid making changes unrelated to the PR, else it's very hard to review. For example, moving scattermat_zm makes it very hard to see where its code changed.

My point was that the docstring should be attached to the method whose signature matches that shown in the first line. If that method doesn't exist, then attach the docstring to the function itself.

No, just keep the existing organization. Always better to avoid making changes unrelated to the PR, else it's very hard to review. For example, moving scattermat_zm makes it very hard to see where its code changed.

Fair point. I do tend to get carried away when I'm editing code.

nalimilan · 2017-04-29T20:20:36Z

src/cov.jl

-        mean == nothing ? scattermat_zm(x .- Base.mean(x, wv, vardim), wv, vardim) :
-        scattermat_zm(x .- mean, wv, vardim)
-    end
+* AnalyticWeights: ``\\frac{1}{\sum w - \sum {w^2} / \sum{w}^2}``


Formula is incorrect AFAICT.

I guess it should be \\frac{1}{\sum w - \sum {w^2} / \sum w}, which wouldn't be a problem if w was prenormalized, but still.

julia> cvar1(w) = sum(w) * (1 - sum(normalize(w, 1) .^ 2)) cvar1 (generic function with 1 method) julia> cvar2(w) = sum(w) - (sum(w .^2) / sum(w)) cvar2 (generic function with 1 method) julia> w = rand(10) 10-element Array{Float64,1}: 0.608948 0.980859 0.656496 0.994447 0.912615 0.654867 0.0928164 0.308836 0.782151 0.952136 julia> cvar1(w) 6.132433099237098 julia> cvar2(w) 6.132433099237098

rofinn · 2017-05-01T23:15:11Z

Alright, apart from a couple remaining points about how to use @deprecate_binding and documenting groups of methods with function <name> end that's all the code changes...? I think I just need to update the documentation now if folks are mostly alright with this? @nalimilan I've updated the tests/moments.jl file to use @testset and hard coded values for the corrected variances, but I'd really appreciate if you could double check those values make sense.

nalimilan

Thanks!

I have check other software with the data you used in the tests, and here are the (positive) results:

AnalyticalWeights: Confirmed 0.0694434 with R's Hmisc::wtd.var and norm=TRUE.
FrequencyWeights: Confirmed 0.054666 with R's Hmisc::wtd.var and norm=FALSE, with Stata's summarize x [iweight w] (since fweight does not accept non-integer weights), and with SAS's proc means ... vardef=wdf and with custom computation code.
ProbabilityWeights: confirmed 0.06628969 with R's survey::svyvar, with Stata's svy: mean x; estat sd and mean x [aweight=w] (which confusingly is known to give the correct estimation for population variance with... probability weights).

I still have many comments, but the core features are OK.

nalimilan · 2017-05-01T12:03:05Z

test/weights.jl

-@test wquantile(data[1], weights(w), 0.5)   ≈  answer
+@test quantile(data[1], fweights(w), 0.5)    ≈  answer
+@test wquantile(data[1], fweights(w), [0.5]) ≈ [answer]
+@test wquantile(data[1], fweights(w), 0.5)   ≈  answer
 @test wquantile(data[1], w, [0.5])          ≈ [answer]


Preserve alignment.

nalimilan · 2017-05-01T12:04:24Z

test/weights.jl

-@test isa(weights([1, 2, 3]), WeightVec{Int})
-@test isa(weights([1., 2., 3.]), WeightVec{Float64})
-@test isa(weights([1 2 3; 4 5 6]), WeightVec{Int})
+@test isa(fweights([1, 2, 3]), AbstractWeights{Int})


Most of the tests in this file should be put inside a loop and run for all types of weights (AbstractWeights should be replaced with the specific type).

nalimilan · 2017-05-01T12:11:33Z

test/deprecates.jl

+
+@testset "StatsBase.Deprecates" begin
+
+@testset "Deprecates WeightVec and weights" begin


In general we don't test deprecated features, we just test their replacement. So just remove this file since you test the new types with a similar code below.

It seems like a good idea to test deprecated features as it ensures the deprecations don't break functionality. Adding these tests actually helped me catch a few issues.

Unfortunately, doing this will print lots of deprecation warnings, which will make the output hard to read, and deprecations from Base that need to be fixed won't be easy to spot. As long as you've tested it once locally, it's OK.

Would it make sense to keep the ENV["TEST_DEPRECATES"] behaviour around to make testing locally easier?

I'd rather not. Deprecated code is supposed to be removed relatively soon anyway.

Okay, I'll add an item to my checklist to remove it before dropping the "[WIP]" from this PR, but please just ignore that until then.

nalimilan · 2017-05-01T12:13:13Z

src/weights.jl



 """
    wquantile(v, w, p)

 Compute the `p`th quantile(s) of `v` with weights `w`, given as either a vector
-or a `WeightVec`.
+or a `AbstractWeights`.


"An AbstractWeights object/vector" (since vec is no longer in the name).

nalimilan · 2017-05-01T12:17:05Z

src/deprecates.jl

+
+Construct a `WeightVec` with weight values `vs` and sum of weights `wsum`.
+"""
+function WeightVec{S<:Real, V<:RealVector}(vs::V, s::S=sum(vs))


All code for WeightVec should be moved to deprecated.jl. That way it's easy to remove in the next version.

I'm confused, this is in deprecates.jl.

Sorry, I'm not sure how I missed that...

nalimilan · 2017-05-02T13:56:24Z

test/moments.jl

+    @test moment(x, 4, 4.0) ≈ sum((x .- 4).^4) / length(x)
+    @test moment(x, 5, 4.0) ≈ sum((x .- 4).^5) / length(x)
+
+    w = fweights([1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0])


Can you test other weight types too?

nalimilan · 2017-05-02T13:58:07Z

test/deprecates.jl

+    end
+end
+
+@testset "Covariance" begin


Should check corrected=true too, and for all weights types. That will be easier if you just hardcode the expected values, or call scattermat instead of copying the full formulas here.

nalimilan · 2017-05-02T13:59:20Z

test/runtests.jl

@@ -1,5 +1,13 @@
 using StatsBase

+opts = Base.JLOptions()


This shouldn't be needed either.

nalimilan · 2017-05-02T14:01:24Z

test/weights.jl

+    @testset "eweights" begin
+        λ = 0.2
+        wv = eweights(4, λ)
+        @test round(values(wv), 4) == [0.2, 0.25, 0.3125, 0.3906]


Copy the full precision numbers rather than rounding. That could help spotting precision issues if somebody tweaks the formula.

Should also check the type of wv.

nalimilan · 2017-05-02T14:05:20Z

test/weights.jl


-## the sum and mean syntax
+    @testset "Sum" begin
+        @test sum([1.0, 2.0, 3.0], fweights([1.0, 0.5, 0.5])) ≈ 3.5


Can you put this inside a loop and test all weights types? Same below (and everywhere this applies).

I'll note that this causes tests to take significantly longer to run (particularly test/weights.jl). Would it make sense to reduce the median and quantile testsets slightly?

Unless some tests are really redundant, I'd rather have too many tests than too few of them.

nalimilan · 2017-05-03T13:03:03Z

src/weights.jl

+"""
+    varcorrection(w::ProbabilityWeights, corrected=false)
+
+``\\frac{n}{(n - 1) \sum w}`` where `n = length(w)`


While we're at it, we should use the more correct n = count(!iszero, w). The code and the other docstring will need to be adjusted. To test this, you could simply add an element with a zero weight to the data: the computed variance must remain the same.

As a future optimization, we could store the number of non-zero weights at construction, when the sum is computed, but that's not a priority for this PR.

rofinn · 2017-05-03T18:35:44Z

Alright, I think that addresses all the comments from the last review iteration. Please ignore the test/deprecates.jl and the corresponding changes in runtests.jl as I'm going to remove that once this PR is no longer a WIP.

… a brief description of different weight types.

…ize) or `f` where appropriate.

…tions and decided to only reference the `var`, `std` and `cov` docstrings.

…testset` changes (`cov` was kind of a lost cause).

…e in `var`, `std` and `cov` docstrings.

…propriate docs.

nalimilan · 2017-05-06T22:26:45Z

Yeah, that's painful, sorry about that. Unfortunately, the new .rst docstrings need some more adjustments: some single backticks need to be changed to double (around Julia expressions), lists should use * rather than -, and the indent should be four spaces. At least this is what I understand looking at other files and at the RST syntax reference.

nalimilan · 2017-05-07T08:29:12Z

Thanks!

If you're still willing to work on weighting, it would be great to use the new types in GLM.jl, which interprets weights as frequency weights and should therefore only accept FrequencyWeights. Support for other types of weights shouldn't be hard to add, at least for probability weights.

rofinn · 2017-05-07T15:36:23Z

Awesome, thanks @nalimilan and @ararslan for putting so much work into reviewing this! Hopefully, my future PRs will be a bit smoother.

rofinn · 2017-05-07T15:38:47Z

@nalimilan I'll take a look at GLM.jl, but I might not have time to get that PR ready till later this week.

tkelman · 2017-05-18T19:15:31Z

src/moments.jl

+Base.std(v::RealArray, w::AbstractWeights; mean=nothing, corrected::DepBool=nothing) =
+    sqrt.(var(v, w; mean=mean, corrected=depcheck(:std, corrected)))
+
+Base.stdm(v::RealArray, m::RealArray, dim::Int; corrected::DepBool=nothing) =


guess this signature was here before, but it's a bit of piracy, isn't it?

Yeah, that's been there for a while, but I didn't notice when I updated it. We had a discussion on #248 about it and I think the plan was to move this method to base julia and add a version check, but I haven't gotten around to doing it yet.

matthieugomez · 2017-05-26T14:11:58Z

Great pull request! Thanks everyone for doing it. Just to be sure, are we sure about the plural form Weights (used in R) rather than the singular form Weight (used in SAS, Stata, Python)?

ararslan · 2017-05-26T17:56:10Z

Yes, because a vector of weights contains more than one weight. 😉

rofinn mentioned this pull request Apr 25, 2017

Switching over to using @testsets #251

Open

nalimilan reviewed Apr 25, 2017

View reviewed changes

ararslan reviewed Apr 25, 2017

View reviewed changes

nalimilan reviewed Apr 26, 2017

View reviewed changes

ararslan reviewed Apr 26, 2017

View reviewed changes

nalimilan reviewed Apr 27, 2017

View reviewed changes

rofinn force-pushed the weightvec-types branch from 080fcb0 to 6ab08f6 Compare April 28, 2017 04:33

rofinn commented Apr 28, 2017

View reviewed changes

nalimilan reviewed Apr 28, 2017

View reviewed changes

nalimilan mentioned this pull request Apr 28, 2017

wtd.var: normwt and documentation harrelfe/Hmisc#22

Open

rofinn force-pushed the weightvec-types branch from 6ab08f6 to 43a7c6f Compare April 29, 2017 05:50

nalimilan reviewed Apr 29, 2017

View reviewed changes

nalimilan reviewed May 2, 2017

View reviewed changes

nalimilan reviewed May 3, 2017

View reviewed changes

rofinn added 17 commits May 6, 2017 16:47

Added testing of all weights to test/cov.jl

1f01bc0

Removed unnecessary 0 mean condition from var

bdec9e1

Reverted changes to skewness and kurtosis.

c0f6488

Updated docs to refer to AbstractWeighs vs WeightVec and included…

a8624cd

… a brief description of different weight types.

More random fixes. Mostly to docstrings.

745c419

More doc fixes.

8d85af7

Removed fweights from tests in favour of weights (to reduces PR s…

05a3cd7

…ize) or `f` where appropriate.

Moved description of different weight types in an Implementations sec…

2370595

…tions and decided to only reference the `var`, `std` and `cov` docstrings.

Moved Weights description later in the docs.

3f84e71

Removed two argument example from weightvec docs.

8bcf448

Moved description of weight vector benefits to the top of the file.

dedb155

Not sure how much this helped, but tried to minimize the amount of `@…

281654d

…testset` changes (`cov` was kind of a lost cause).

Added comment about unsupported bias correction for the Weights typ…

5bdf58b

…e in `var`, `std` and `cov` docstrings.

Removed deprecation tests and corresponding hacks.

b991667

Removed more deprecation test hacks and convert wv -> w in the ap…

926678e

…propriate docs.

Missing depcheck on a stdm call.

85ace2a

Updated the rst scalarstats and cov docs with the updated docstrings.

a0a2ad6

rofinn force-pushed the weightvec-types branch from ea88b21 to a0a2ad6 Compare May 6, 2017 21:54

More rst docstring updates.

6087b7f

nalimilan merged commit 6a78bce into JuliaStats:master May 7, 2017

rofinn deleted the weightvec-types branch May 7, 2017 15:36

tkelman reviewed May 18, 2017

View reviewed changes

nalimilan mentioned this pull request Oct 28, 2019

Import StatsBase into Statistics JuliaStats/Statistics.jl#2

Draft

21 tasks

nalimilan mentioned this pull request Jan 18, 2022

Weighted sem #754

Merged


		@testset "StatsBase.Deprecates" begin

		@testset "Deprecates WeightVec and weights" begin

Adding support for different weight vector types #250

Adding support for different weight vector types #250

Conversation

rofinn commented Apr 25, 2017 • edited Loading

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ararslan Apr 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn Apr 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn Apr 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn Apr 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Apr 26, 2017

Choose a reason for hiding this comment

rofinn commented Apr 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan Apr 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeffreySarnoff commented Apr 28, 2017

rofinn left a comment

rofinn commented Apr 25, 2017 •

edited

Loading

ararslan Apr 25, 2017 •

edited

Loading

rofinn Apr 25, 2017 •

edited

Loading

rofinn Apr 25, 2017 •

edited

Loading

rofinn Apr 27, 2017 •

edited

Loading

rofinn commented Apr 26, 2017 •

edited

Loading

nalimilan Apr 27, 2017 •

edited

Loading