Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for different weight vector types #250

Merged
merged 50 commits into from
May 7, 2017

Conversation

rofinn
Copy link
Member

@rofinn rofinn commented Apr 25, 2017

This PR is intended to address concerns raised in issues #53 and #249

Requirements:

  • WeightVec -> AbstractWeights
  • Weights (aka default reliability weights)
  • FrequencyWeights (limited to vectors of integers)
  • ProbabilityWeights
  • Allow all weights to take a corrected argument which defaults to true.
  • Add an exponential function which creates a set of exponential Weights.
  • Update existing test cases to work with the appropriate types and test against both corrected and uncorrected versions of the weights
  • Deprecate API changes
  • Test deprecation changes to ensure we didn't actively break any functionality?
  • Check that we haven't reduced test coverage.
  • Remove deprecation tests.
  • Update docs

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I've made a few comments, without having reviewed everything yet.

src/weights.jl Outdated
bias::Int
end

immutable ProbabilityWeights{S<:Real, T<:Real, V<:RealVector} <: AbstractWeights{S, T, V}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out "sampling weights" is a much more common expression than "probability weights". Maybe we should use the former term?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have very strong feelings either way, but sampling just seemed more ambiguous to me.

Copy link
Member

@ararslan ararslan Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've heard both. In my line of work I don't come across these kinds of weights much so I can't speak to the prominence of either term overall, but I think "probability weights" is what I learned to call them in grad school.

src/weights.jl Outdated
If omitted, `wsum` is computed.
"""
function WeightVec{S<:Real, V<:RealVector}(vs::V, s::S=sum(vs))
return WeightVec{S, eltype(vs), V}(vs, s)
function Weights{S<:Real, V<:RealVector}(vs::V, s::S=sum(vs); corrected::Bool=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to decide whether to apply the correction when calling std? For most functions (like mean), the correction does not apply and the results will be the same whatever the kind of weights you have. It seems a bit weird to require people to decide about correction in advance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess that would fit with base better. I was initially thinking that it might make sense to precompute the value, but we only use it in a few places so it probably wouldn't cost us much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change in my next commit, but it ended up being a lot of little changes.

src/weights.jl Outdated
"""
weights(vs::RealVector) = WeightVec(vs)
weights(vs::RealArray) = WeightVec(vec(vs))
frequency(vs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frequency and exponential are really too broad terms to be claimed for weights. We should find more specific names, like fweights and expweights. One (smallà advantage is that it would mirror Stata (with also pweights and aweights).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to prefer full word function names. I'd be fine with keeping frequency and exponential, but only exporting the aliases of pweights, eweights, fweights. That way folks can choose to use the full word versions, but we won't be cluttering up the global namespace on them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have two ways to access the same function. IMO we should choose one interface and stick to it. I'm with Milan on this one; I don't think we should use frequency and exponential for these, as those names are too general. freqweights and expweights seem reasonable enough names to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like scipy also uses the aweights and fweights names https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why not just use the type constructors rather than separate functions?

Copy link
Member Author

@rofinn rofinn Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iono, cause FrequencyWeights seems long...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suppose so. Perhaps it's best then to follow suit with Stata and SciPy and use aweights and fweights.

src/weights.jl Outdated
end

# TODO: constructor for ProbabilityWeights, but I'm not familiar with how bias correction works with these
# types of weights or if bias correction even makes sense.
# https://en.wikipedia.org/wiki/Inverse_probability_weighting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See http://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/ for a summary regarding analytical weights (aweights) and probability weights (pweights). Wikipedia has interesting data on frequency weights and analytical/precision weights: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance

I'll try to find more references.

Copy link
Member Author

@rofinn rofinn Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, my understanding is that these weights are intended to provide inferential statistics about the population (vs sample)? Also, I guess analytical/precision weights = reliability weights on the wiki page?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, my understanding is that these weights are intended to provide inferential statistics about the population (vs sample)?

Yes, probability/sampling weights reflect the fact that some population groups were over/under-sampled, so individuals do not all represent the same number of people in the target population.

Also, I guess analytical/precision weights = reliability weights on the wiki page?

Right. The most explicit name is "inverse-variance weights", and this is what the link given on Wikipedia mentions.

src/weights.jl Outdated
immutable WeightVec{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T}
abstract AbstractWeights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T}

immutable Weights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractWeights{S, T, V}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we still need this. Do we expect people to need kinds of weights which cannot have their own type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I get the impression that the kinds of weights should be kept relatively small and folks might still want to dispatch based on the storage of those weights (e.g., NullableArray vs Vector).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that the Weights family might not be needed since all use cases should fall into a well-defined weights type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, I've renamed Weights to AnalyticWeights in the next commit to be more specific. I've also added an aweights method and made weights an alias to aweights, but maybe I should just deprecate the weights method in favour of making folks think more about what type of weights they want to use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's deprecate it, as that's the best way to make people discover the new weight types and choose the appropriate one for them. If it turns out some people need other weight types, they will let us know.

@deprecate _moment3(v::RealArray, m::Real, wv::AbstractWeights) _moment3(v, wv, m)
@deprecate _moment4(v::RealArray, m::Real, wv::AbstractWeights) _moment4(v, wv, m)
@deprecate _momentk(v::RealArray, k::Int, m::Real, wv::AbstractWeights) _momentk(v, k, wv, m)
@deprecate moment(v::RealArray, k::Int, m::Real, wv::AbstractWeights) moment(v, k, wv, m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WeightVec is going away it should have a deprecation, and if it isn't then these deprecations should maintain the existing signatures. (They've probably been deprecated long enough that we can just remove them, but that's an aside.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've deprecated WeightVec and I was contemplating deprecating weights in favour of aweights to be more specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the deprecation warning should mention all weight types. I'm also not sure we can call aweights as a replacement, since the current cov does not apply any correction, the deprecation should do the same: maybe we need to keep the WeightsVec type (deprecating) to avoid breaking packages?

Copy link
Member Author

@rofinn rofinn Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable and would also help notify people about the other changes occurring with weights without introducing breaking changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do folks want to handle internal calls to weights(x) if it (and WeightVec) are deprecated?

  1. We leave the code as is and accept that functions like wsample, wmedian, etc will throw deprecation warnings when they need to creating a WeightVec from an array.
  2. Change the default behaviour to creating AnalyticWeights which may break existing behaviour for some people.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the kind of weights doesn't matter for wsample or wmedian, it only makes a difference for var and similar functions. IOW the type of weights created for the former functions is an implementation detail, we can choose any type we want (though I would choose frequency weights, as they are simpler to understand).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess wsample and wmedian weren't good examples. Alright, I'll switch that over to avoid the warnings. Thanks.

src/weights.jl Outdated
@@ -2,43 +2,112 @@
###### Weight vector #####

if VERSION < v"0.6.0-dev.2123"
immutable WeightVec{S<:Real, T<:Real, V<:RealVector} <: RealVector{T}
abstract AbstractWeights{S<:Real, T<:Real, V<:RealVector} <: RealVector{T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be @compat abstract type ... end. Also why is this being conditioned on the version? Looks like that predates this PR, but I don't see why we would need to condition for these definitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractWeights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T} isn't valid syntax cause 0.5 doesn't support triangular dispatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dang you're right. 😞

src/weights.jl Outdated
# Arguments
* `n::Integer`: the desired length of the `Weights`
* `λ::Real`: is a smoothing factor or rate paremeter between 0 .. 1.
As this value approaches 0 the resulting weights will be almost equal(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the parentheses after "equal"? Also should remove "is" on the previous line, "paremeter" -> "parameter", and ".." -> "and".

src/weights.jl Outdated
while values closer to 1 will put higher weight on the end elements of the vector.
"""
function exponential(n::Integer, λ::Real=0.99)
@assert 0 <= λ <= 1 && n > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically for argument checking it's better to use condition || throw(ArgumentError("...")). In this case,

n > 0 || throw(ArgumentError("cannot construct weights of length < 1"))
0 <= λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1"))

That provides more descriptive error messages for the user, plus it's better to use specifically-typed exceptions where possible.

src/weights.jl Outdated
"""
wmedian(v::RealVector, w::RealVector) = median(v, weights(w))
wmedian{W<:Real}(v::RealVector, w::WeightVec{W}) = median(v, w)
wmedian(v::RealVector, w::RealVector) = median(v, weights(w, false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the weights API is changing to require a boolean parameter, the existing API should have a formal deprecation.

@nalimilan
Copy link
Member

I have found more references for the variance correction to apply to different types of weights.

Regarding sampling/probability weights, a good reference is svyvar from the R survey package. What it does (in the simple case where there is no stratification) is simply to apply the Bessel correction with n equal to the number of non-zero weights:

n = count(!iszero, w)
sw = sum(w)
xbar = sum(w .* x)/sw
sum(w .* (x - xbar).^2)/sw * n/(n - 1)

(Of course this is just a Julia illustration of the algorithm. This code should be optimized to avoid making copies.)

A good reference for analytical/precision weights is wtd.var from the Hmisc R package. For these weights, we should use the same formula as wtd.var when passed normwt=TRUE, i.e. normalize the weights so that they sum to the vector length (or maybe to the number of non-zero weights for consistency with above). Then the basic formula is:

n = length(w) # Could be count(!iszero, w) instead
w = w * n/sum(w)
sw = sum(w) # This is now equal to n, but maybe we should support non-normalized weights?
xbar = sum(w .* x)/sw
sum(w * (x - xbar).^2)/(sw - sum(w.^2)/sw)

This corresponds to the Wikipedia formula, and indeed it was written by the same person. Regarding the actual implementation, we could normalize the weights when creating the vector if that's always needed.

Another interesting reference (for analytical/precision weights) is SAS:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473731.htm#a000068934

src/weights.jl Outdated
bias::Real
end

immutable FrequencyWeights{S<:Integer, T<:Integer, V<:IntegerVector} <: AbstractWeights{S, T, V}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should restrict frequency weights to integers. While this seems obvious at first, formulas generalize to non-integer weights, and other software generally accepts arbitrary values. This can be useful in some cases (e.g. sometimes you need to split observations between groups). Also this will be more practical if your variable is stored as a float for some reason.

@rofinn
Copy link
Member Author

rofinn commented Apr 26, 2017

Sorry about the code churn in these commits folks. Turns out moving the corrected option to the stats functions required changing a lot of code. Here is a summary of the changes to help keep things organized.

Changes:

  • Reverts many test changes from the last commit
  • Changed a bunch of test cases to use corrected=false for now
  • This included a lot of little changes to method definition and some resulting formatting changes where necessary
  • Added a few test cases for the corrected variances ( I just used the code @nalimilan posted above )
  • Added bias correction for ProbabililtyWeights.
  • Deprecated WeightVec
  • Added a macro for easier creation of weight types.
  • Renamed weight creation function to fweights, pweights, etc.

src/weights.jl Outdated
end

"""
`@weights name`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks shouldn't be used for indented blocks like this, since Markdown code formatting is applied by virtue of being indented. So this will actually show up in the docstring and docs like

`@weights name`

src/weights.jl Outdated
bias(w::ProbabilityWeights, [corrected])

```math
\fraction{n}{∑w × (n - 1)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be \frac{n}{(n - 1) \sum w}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Similarly \frac and LaTeX commands rather than Unicode elsewhere)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with just using LaTeX commands, but the julia docs recommend using unicode characters.

Use Unicode characters rather than their LaTeX escape sequence, i.e. α = 1 rather than \\alpha = 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, weird. Anyway, should still be \frac rather than \fraction, though I guess with the backslash escaped.

Copy link
Member

@nalimilan nalimilan Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that remark applies to \sum, as it does more than just printing the character: it also uses the right size for the symbol. "characters" in this context mostly means "letters" AFAICT.

test/cov.jl Outdated
@@ -88,62 +88,62 @@ end
# weighted covariance

if VERSION < v"0.5.0-dev+679"
@test cov(X, wv1) ≈ S1w ./ sum(wv1)
@test cov(X, wv2; vardim=2) ≈ S2w ./ sum(wv2)
@test cov(X, wv1; corrected=false) ≈ S1w ./ sum(wv1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to change all of the tests suggests that this is a breaking change. To keep from breaking users' code, we should probably make corrected=false the default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with doing that for this PR, but it would be nice to switch it back to corrected=true in a later release?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky to change things like this, since we'd need some kind of deprecation to avoid suddenly and silently breaking others' code. I think @nalimilan has gone through such a process recently; perhaps he could be of help here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I just noted in a comment above, I think the best deprecation path is to deprecate WeightsVec but keep it around so that existing code does not use the correction, and point to new weights types in the deprecation warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan To clarify, you'd like var, std and cov to always return the uncorrected version for WeightVecs even if corrected=true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corrected argument wouldn't be supported at all for WeightVec, or at least and error would be raised if corrected=true is passed.

src/StatsBase.jl Outdated
AbstractWeights, # the abstract type to represent any weight vector
AnalyticWeights, # the default type for representing a analytic/precision/reliability weight vectors
FrequencyWeights, # the type for representing a frequency weight vectors
ProbabilityWeights,# the type for representing a probability/sampling weight vectors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space before #. Also align with the rest of the block.

src/moments.jl Outdated
Base.varm!(R::AbstractArray, A::RealArray, wv::WeightVec, M::RealArray, dim::Int) =
scale!(_wsum_centralize!(R, @functorize(abs2), A, values(wv), M, dim, true), inv(sum(wv)))
function Base.varm!(R::AbstractArray, A::RealArray, wv::AbstractWeights, M::RealArray, dim::Int; corrected=true)
scale!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the first argument on this line, and the closing parenthesis on the same line as the second argument. Same below.

BTW, @functorize is no longer needed with Julia 0.5 and above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe a style guide would be helpful. The argument I've heard for spreading args (for long function calls) across multiple lines is that it minimizes the size of you diffs later on. I'm not saying that's the right approach, but given that there are multiple styles floating around and the julia style guide doesn't discuss them maybe it's worth clarifying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone has a different preferred style; it's mostly about being consistent within a particular package. That said, I'd love for the style guide in the manual to be more specific about things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of thing is worth clarifying, but I think it's pretty clear from the existing code base of JuliaStats packages that this style is against the convention.

src/moments.jl Outdated
@@ -15,10 +15,12 @@ whereas it's `length(x)-1` in `Base.varm`. The impact is that this is not a
weighted estimate of the population variance based on the sample; it's the weighted
variance of the sample.
"""
Base.varm(v::RealArray, wv::WeightVec, m::Real) = _moment2(v, wv, m)
function Base.varm(v::RealArray, wv::AbstractWeights, m::Real; corrected=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, you can also add a line break while keeping the short form without function... end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware that the short form support line breaks, but I think that's harder for people to parse. The function ... end syntax with proper indentation provides a more consistent read through. I tend to have similar views on ternaries that cover multiple lines (particularly if they're nested). I'm fine with keeping to the short form if that's the desired style for this repo, but maybe we should add a style guide to make that clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really care. There's a short style guide in Julia's CONTRIBUTING.md, but I don't think it's mentioned.

src/moments.jl Outdated
n = length(v)
s = 0.0
w = values(wv)
for i = 1:n
@inbounds z = v[i] - m
@inbounds s += (z * z) * w[i]
end
s / sum(wv)

result = s * bias(wv, corrected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result isn't needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Sorry, this is leftover from some debugging I was doing (removed the print statements, but forgot to reverse that).

src/moments.jl Outdated
n = length(v)
s = 0.0
for i = 1:n
@inbounds z = v[i] - m
s += z * z
end
s / n
s * bias(n, corrected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiplying by a bias is backwards. Maybe call that function cov_correction, moment2_correction or something like that? It's also good to choose a specific name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that bias isn't really descriptive of what it represents anymore. Would something like bias_factor, scale_factor, bias_coef, scale_coef, etc make more sense? The nice thing about keeping a relatively general name is that we can use the same function call var, std and cov.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var and std are in some way based on covariance, so cov should be fine.

src/weights.jl Outdated
@weights AnalyticWeights

"""
AnalyticWeights(vs, [wsum])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wsum=sum(vs), without brackets.

src/weights.jl Outdated
end

"""
aweights(vs)

Construct a `AnalyticWeights` type from a given array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type -> vector.

Maybe add something like "See the documentation for that type for more information"?

src/weights.jl Outdated
Construct a `AnalyticWeights` type from a given array.
"""
aweights(vs::RealVector) = AnalyticWeights(vs)
aweights(vs::RealArray) = AnalyticWeights(vec(vs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method really needed? I don't think we perform this kind of conversion automatically in general.

src/weights.jl Outdated
n > 0 || throw(ArgumentError("cannot construct weights of length < 1"))
0 <= λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1"))
w0 = map(i -> λ * (1 - λ)^(1 - i), 1:n)
return weights(w0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be ExponentialWeights? But why do we need a specific weights type: don't them enter into one of the above families? Or maybe they should just use a generic weights type (without any correction)?

src/weights.jl Outdated
function bias(w::AbstractWeights, corrected=true)
s = sum(w)
if corrected
return inv(s * (1 - sum(normalize(values(w), 1) .^ 2)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize and .^2 are going to create copies of the weights, which should be avoided. Formula should be adapted so that no allocation happens at all. Maybe you'll need a loop for that.

@JeffreySarnoff
Copy link

I just found WeightVec and the much cleaner reworking above. I will use the new abstraction and its support to hold the vector of weights used with rolling, windowed stats. Usually, these weights are normalized in some way and then used. It would be pleasant to do (should this as yet not be).

my_weights = new_abstraction_above( my_weights_as_values_in_sequence )

my_normalized_weights = normalize(my_weights)
my_normalized_weights = normalize(my_weights, p_norm = 2)

normalize!(my_weights)
normalize!(my_weights, p_norm = 1.618)

without reaching inside your type to access the sum

Copy link
Member Author

@rofinn rofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still isn't ready yet, but I thought I should push what I have and point out a couple spots where I could use a second opinion. @nalimilan and @ararslan Thanks for bearing with this PR, I know it's been a lot of work.

src/StatsBase.jl Outdated
AnalyticWeights, # the default type for representing a analytic/precision/reliability weight vectors
FrequencyWeights, # the type for representing a frequency weight vectors
ProbabilityWeights, # the type for representing a probability/sampling weight vectors
ExponentialWeights, # the type for representing exponential weights
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to delete ExponentialWeights from export.

@@ -43,3 +37,28 @@ findat(a::AbstractArray, b::AbstractArray) = findat!(Array{Int}(size(b)), a, b)

@deprecate df(obj::StatisticalModel) dof(obj)
@deprecate df_residual(obj::StatisticalModel) dof_residual(obj)

@weights WeightVec
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this WeightVec should go here or in weights.jl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's deprecated it should live in this file, just needs @deprecate as appropriate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing using depwarn inside the WeightVec and weights methods is the correct approach here since we want to deprecate the functionality rather than the signature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There's also @deprecate_binding to deprecate WeightVec itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't @deprecate_binding only work if you're just changing the name of the binding?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's needed to print a warning when people use WeightVec directly. You can deprecate it in favor of FrequencyWeights, even if that's not fully correct it's better than nothing.

src/weights.jl Outdated
(ie: [Bessel's correction](https://en.wikipedia.org/wiki/Bessel's_correction)),
otherwise it will return ``\\frac{1}{n}``.
"""
cfactor(n::Integer, corrected=false) = 1 / (n - Int(corrected))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfactor (for correction factor) seemed like the best choice for this function now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this name either: "factor" really is secondary here, what matters is 1) that it's a correction, 2) that it applies to var/cov/std.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

varcorrection, maybe biascorrection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

varcorrection seems a little long, maybe cvar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's even less descriptive than cfactor, IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, varcorrection it is then. I don't really care about the name since we're not exporting it.

Copy link
Member Author

@rofinn rofinn Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what about varden (variance denominator)? I feel like the fact that it takes a corrected argument should cover that it may provide bias correction. I'd need to slightly change what's returned, but that might be more understandable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like varcorrection better. Or maybe bessel_correction?

src/weights.jl Outdated
``\\frac{1}{\sum w}``
"""
cfactor(wv::AbstractWeights, ::Type{Val{false}}) = 1 / sum(wv)
cfactor(wv::AbstractWeights, ::Type{Val{true}}) =
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd understand if folks are opposed to the Type{Val{True}}, but this means subtypes only need to implement the correction condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Val is really not what you want here: since the value of corrected will only be known at runtime, you're forcing the call to cfactor to go via dispatch at runtime, while it would have been inlined if you used ::Bool.

Again, I don't think we should provide a fallback correction for AbstractWeight: it's unlikely to be valid, which is worse than raising an error. Creating new weight types shouldn't be a frequent need, people can afford the cost of defining this simple method if it applies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I had missed that you throw an error. But still I think it's better to repeat the uncorrected equation in each method than to use Val.

@test zscore(a) ≈ zscore(a, mean(a), std(a))
@test zscore(a, 1) ≈ zscore(a, mean(a,1), std(a,1))
@test zscore(a, 2) ≈ zscore(a, mean(a,2), std(a,2))
@test zscore(a) ≈ zscore(a, mean(a), std(a; corrected=false))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corrected=false is because mean_and stdnow has corrected=false. In order to support all of the existing behaviour we may need to mix and match which methods have corrected=true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: why not adapt mean_and_std to default to corrected=true? To avoid breaking existing code, we could keep corrected=false for a while, but print a warning when corrected isn't specified so that we can make the switch later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm clearly missing something here. How am I suppose to check if a keyword argument is set?

Copy link
Member Author

@rofinn rofinn Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: We could add that warning if we chose to make corrected a regular (vs keyword) argument. We'd just need to have a function which doesn't take the corrected argument, so that it can print a warning and call the function with the default setting? I initially thought making corrected a keyword argument was the best way to stay consistent with base, but var, std and cov aren't even consistent with each other in base, so it might make more sense to do whatever is best for our use case.

std(
...
std(A::AbstractArray, region; corrected, mean) in Base at statistics.jl:263

julia> cov(
cov(x::AbstractArray{T,1} where T, corrected::Bool) in Base at statistics.jl:346
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just have corrected=nothing to detect whether a keyword argument has been left to its default. Then you just need to replace its value with false if it's equal to nothing after printing the warning using Base.depwarn.

src/StatsBase.jl Outdated
wmedian, # weighted median
wquantile, # weighted quantile
AbstractWeights, # the abstract type to represent any weight vector
AnalyticWeights, # the default type for representing a analytic/precision/reliability weight vectors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singular "vector". "an analytic". Since these lines are wider than 92 chars, make them a bit shorter e.g. by removing "the", "default", and "for representing"/"to represent".

src/StatsBase.jl Outdated
FrequencyWeights, # the type for representing a frequency weight vectors
ProbabilityWeights, # the type for representing a probability/sampling weight vectors
ExponentialWeights, # the type for representing exponential weights
weights, # alias for aweights
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still want to export it (even if it's deprecated) to avoid breaking functionality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deprecate automatically exports for this reason.

src/cov.jl Outdated
scattermat(x::DenseMatrix, wv::WeightVec, vardim::Int=1) =
scattermatm(x, Base.mean(x, wv, vardim), wv, vardim)
## weighted cov
function Base.covm(x::DenseMatrix, mean, wv::AbstractWeights, vardim::Int=1, corrected::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is wider than 92 chars, either remove function or split arguments on multiple lines. Since this file seems to be using the short form for one-line method definitions, it would be more consistent to use it at least here.

src/cov.jl Outdated
@@ -58,66 +58,41 @@ cov


"""
mean_and_cov(x, [wv::WeightVec]; vardim=1) -> (mean, cov)
mean_and_cov(x, [wv::AbstractWeights]; vardim=1) -> (mean, cov)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to document corrected. Same for mean_and_std and mean_and_var, var, varm, std, stdm, cov, covm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was holding off updating these docstrings until I was more comfortable with the behaviour. Since, folks don't seem to have an issue with the corrected=false keyword for all these functions I'll update that.

src/hist.jl Outdated
@@ -249,14 +247,15 @@ function append!{T,N}(h::AbstractHistogram{T,N}, vs::NTuple{N,AbstractVector})
end
h
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated, and it's not justified since there's no line break before the last append! definition either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was a typo from fixing conflicts when I rebased with master.

src/weights.jl Outdated
eweights(n, [λ])

Constructs an `AnalyticWeights` vector with a desired length `n` and smoothing factor `λ`,
where each element is set to ``λ * (1 - λ)^(1 - i)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"element in position ``i`` "

src/weights.jl Outdated

# Arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the guidelines, the "arguments" section isn't recommended, except when there are many arguments. Here, the description of n is redundant with what is said above. You can just keep the last sentence about λ in the main description or in a separate paragraph.

src/weights.jl Outdated
Base.getindex(wv::WeightVec, i) = getindex(wv.values, i)
Base.size(wv::WeightVec) = size(wv.values)
"""
eweights(n, [λ])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

λ=0.99, and drop the brackets. Or maybe drop the default value, as it seems quite arbitrary and unlikely to be exactly what people need?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the 0.99 is pretty arbitrary.

src/weights.jl Outdated
n > 0 || throw(ArgumentError("cannot construct weights of length < 1"))
0 <= λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1"))
w0 = map(i -> λ * (1 - λ)^(1 - i), 1:n)
aweights(w0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these really analytical weights? The docs should mention this if that's the case. Can you explain what exponential weights are used for?

Copy link
Member Author

@rofinn rofinn Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often use exponential weights to describe the relative importance of observations in temporal data. For example, I often want to get summary statistics about some lookback data, but putting greater value on more recent observations (as they better reflect the "current" state of the system). Since, these types of weights definitely aren't probability or frequency weights I got the impression that analytic or reliability weights best fit this. I'm fine with removing this a niche use case, but my understanding is that exponential weights are pretty common when working with temporal data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, these aren't strictly exponential weights. Maybe this means we should keep a generic Weights type... Feel free to reintroduce it (to replace WeightsVec), though I'm not sure what's the best name for them.

test/moments.jl Outdated
# AnalyticWeights
@test var(x, aweights(ones(10)); corrected=true) ≈ var(x)

w = aweights(rand(10))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than applying the formulas, this should use fixed values and hardcode the result. Then I can check that it's correct in other software if you like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I can check that it's correct in other software if you like

That's a great idea, I'd really appreciate that! What kind of software were you thinking of testing against? This PR is probably big enough as it is, but I was wondering if it would make sense to automate testing against R using RCall.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of the survey and Hmisc R packages, of Stata and maybe SAS. I wouldn't worry about using RCall, we can just hardcode the results for a few cases.

@nalimilan
Copy link
Member

@JeffreySarnoff AbstractWeight should probably implement setindex! so that they can be modified if needed. @rofinn just added support for getindex, you could open another PR for setindex!.

@ararslan
Copy link
Member

I just want to say thanks so much for your hard work and perseverence here, @rofinn. You're doing fantastic work and I'm excited to see this get merged.

@rofinn
Copy link
Member Author

rofinn commented Apr 29, 2017

Is it possible to turn deprecation warnings on and off during testing in julia? It would be nice to check that I haven't broken any existing behaviour without producing a bunch of deprecation warning during testing.

@nalimilan
Copy link
Member

Is it possible to turn deprecation warnings on and off during testing in julia? It would be nice to check that I haven't broken any existing behaviour without producing a bunch of deprecation warning during testing.

julia --depwarn=no is what you need.

src/cov.jl Outdated
@@ -44,80 +37,73 @@ that the data are centered and hence there's no need to subtract the mean.
When `vardim = 1`, the variables are considered columns with observations in rows;
when `vardim = 2`, variables are in rows with observations in columns.
"""
function scattermat end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was actually correct since the docstring documents several methods, not just the following one. Same below.

Copy link
Member Author

@rofinn rofinn Apr 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but is that line even necessary (calling help on the method name still works the same)? Also, the other line is just cov vs function cov end, what would the preference be to maintain consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cov refers to an existing object; function cov end creates a function with 0 methods. Typically the latter is only used for forward-declaring functions to applying docstrings. I think it's usually preferable to use the former when the function has already been defined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it doesn't really make a big difference whether you attach the docstring to a method or to the function, but it matters in some cases: the docstring for the function will (or should) always appear first, and when a link to the source is provided (in the online manual) it wouldn't be really correct to point to a particular method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, to clarify, the preference moving forward is that we want the docstring attached to function <name> end with all other methods below that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just keep the existing organization. Always better to avoid making changes unrelated to the PR, else it's very hard to review. For example, moving scattermat_zm makes it very hard to see where its code changed.

My point was that the docstring should be attached to the method whose signature matches that shown in the first line. If that method doesn't exist, then attach the docstring to the function itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just keep the existing organization. Always better to avoid making changes unrelated to the PR, else it's very hard to review. For example, moving scattermat_zm makes it very hard to see where its code changed.

Fair point. I do tend to get carried away when I'm editing code.

src/cov.jl Outdated
mean == nothing ? scattermat_zm(x .- Base.mean(x, wv, vardim), wv, vardim) :
scattermat_zm(x .- mean, wv, vardim)
end
* AnalyticWeights: ``\\frac{1}{\sum w - \sum {w^2} / \sum{w}^2}``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formula is incorrect AFAICT.

Copy link
Member Author

@rofinn rofinn Apr 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it should be \\frac{1}{\sum w - \sum {w^2} / \sum w}, which wouldn't be a problem if w was prenormalized, but still.

julia> cvar1(w) = sum(w) * (1 - sum(normalize(w, 1) .^ 2))
cvar1 (generic function with 1 method)

julia> cvar2(w) = sum(w) - (sum(w .^2) / sum(w))
cvar2 (generic function with 1 method)

julia> w = rand(10)
10-element Array{Float64,1}:
 0.608948
 0.980859
 0.656496
 0.994447
 0.912615
 0.654867
 0.0928164
 0.308836
 0.782151
 0.952136

julia> cvar1(w)
6.132433099237098

julia> cvar2(w)
6.132433099237098

@rofinn
Copy link
Member Author

rofinn commented May 1, 2017

Alright, apart from a couple remaining points about how to use @deprecate_binding and documenting groups of methods with function <name> end that's all the code changes...? I think I just need to update the documentation now if folks are mostly alright with this? @nalimilan I've updated the tests/moments.jl file to use @testset and hard coded values for the corrected variances, but I'd really appreciate if you could double check those values make sense.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I have check other software with the data you used in the tests, and here are the (positive) results:

  • AnalyticalWeights: Confirmed 0.0694434 with R's Hmisc::wtd.var and norm=TRUE.
  • FrequencyWeights: Confirmed 0.054666 with R's Hmisc::wtd.var and norm=FALSE, with Stata's summarize x [iweight w] (since fweight does not accept non-integer weights), and with SAS's proc means ... vardef=wdf and with custom computation code.
  • ProbabilityWeights: confirmed 0.06628969 with R's survey::svyvar, with Stata's svy: mean x; estat sd and mean x [aweight=w] (which confusingly is known to give the correct estimation for population variance with... probability weights).

I still have many comments, but the core features are OK.

test/weights.jl Outdated
@test wquantile(data[1], weights(w), 0.5) ≈ answer
@test quantile(data[1], fweights(w), 0.5) ≈ answer
@test wquantile(data[1], fweights(w), [0.5]) ≈ [answer]
@test wquantile(data[1], fweights(w), 0.5) ≈ answer
@test wquantile(data[1], w, [0.5]) ≈ [answer]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preserve alignment.

test/weights.jl Outdated
@test isa(weights([1, 2, 3]), WeightVec{Int})
@test isa(weights([1., 2., 3.]), WeightVec{Float64})
@test isa(weights([1 2 3; 4 5 6]), WeightVec{Int})
@test isa(fweights([1, 2, 3]), AbstractWeights{Int})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the tests in this file should be put inside a loop and run for all types of weights (AbstractWeights should be replaced with the specific type).


@testset "StatsBase.Deprecates" begin

@testset "Deprecates WeightVec and weights" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we don't test deprecated features, we just test their replacement. So just remove this file since you test the new types with a similar code below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a good idea to test deprecated features as it ensures the deprecations don't break functionality. Adding these tests actually helped me catch a few issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, doing this will print lots of deprecation warnings, which will make the output hard to read, and deprecations from Base that need to be fixed won't be easy to spot. As long as you've tested it once locally, it's OK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to keep the ENV["TEST_DEPRECATES"] behaviour around to make testing locally easier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not. Deprecated code is supposed to be removed relatively soon anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll add an item to my checklist to remove it before dropping the "[WIP]" from this PR, but please just ignore that until then.

src/weights.jl Outdated


"""
wquantile(v, w, p)

Compute the `p`th quantile(s) of `v` with weights `w`, given as either a vector
or a `WeightVec`.
or a `AbstractWeights`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"An AbstractWeights object/vector" (since vec is no longer in the name).


Construct a `WeightVec` with weight values `vs` and sum of weights `wsum`.
"""
function WeightVec{S<:Real, V<:RealVector}(vs::V, s::S=sum(vs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All code for WeightVec should be moved to deprecated.jl. That way it's easy to remove in the next version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, this is in deprecates.jl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure how I missed that...

test/moments.jl Outdated
@test moment(x, 4, 4.0) ≈ sum((x .- 4).^4) / length(x)
@test moment(x, 5, 4.0) ≈ sum((x .- 4).^5) / length(x)

w = fweights([1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test other weight types too?

end
end

@testset "Covariance" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check corrected=true too, and for all weights types. That will be easier if you just hardcode the expected values, or call scattermat instead of copying the full formulas here.

test/runtests.jl Outdated
@@ -1,5 +1,13 @@
using StatsBase

opts = Base.JLOptions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed either.

test/weights.jl Outdated
@testset "eweights" begin
λ = 0.2
wv = eweights(4, λ)
@test round(values(wv), 4) == [0.2, 0.25, 0.3125, 0.3906]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy the full precision numbers rather than rounding. That could help spotting precision issues if somebody tweaks the formula.

Should also check the type of wv.

test/weights.jl Outdated

## the sum and mean syntax
@testset "Sum" begin
@test sum([1.0, 2.0, 3.0], fweights([1.0, 0.5, 0.5])) ≈ 3.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this inside a loop and test all weights types? Same below (and everywhere this applies).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note that this causes tests to take significantly longer to run (particularly test/weights.jl). Would it make sense to reduce the median and quantile testsets slightly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless some tests are really redundant, I'd rather have too many tests than too few of them.

src/weights.jl Outdated
"""
varcorrection(w::ProbabilityWeights, corrected=false)

``\\frac{n}{(n - 1) \sum w}`` where `n = length(w)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it, we should use the more correct n = count(!iszero, w). The code and the other docstring will need to be adjusted. To test this, you could simply add an element with a zero weight to the data: the computed variance must remain the same.

As a future optimization, we could store the number of non-zero weights at construction, when the sum is computed, but that's not a priority for this PR.

@rofinn
Copy link
Member Author

rofinn commented May 3, 2017

Alright, I think that addresses all the comments from the last review iteration. Please ignore the test/deprecates.jl and the corresponding changes in runtests.jl as I'm going to remove that once this PR is no longer a WIP.

@nalimilan
Copy link
Member

Yeah, that's painful, sorry about that. Unfortunately, the new .rst docstrings need some more adjustments: some single backticks need to be changed to double (around Julia expressions), lists should use * rather than -, and the indent should be four spaces. At least this is what I understand looking at other files and at the RST syntax reference.

@nalimilan nalimilan merged commit 6a78bce into JuliaStats:master May 7, 2017
@nalimilan
Copy link
Member

Thanks!

If you're still willing to work on weighting, it would be great to use the new types in GLM.jl, which interprets weights as frequency weights and should therefore only accept FrequencyWeights. Support for other types of weights shouldn't be hard to add, at least for probability weights.

@rofinn
Copy link
Member Author

rofinn commented May 7, 2017

Awesome, thanks @nalimilan and @ararslan for putting so much work into reviewing this! Hopefully, my future PRs will be a bit smoother.

@rofinn rofinn deleted the weightvec-types branch May 7, 2017 15:36
@rofinn
Copy link
Member Author

rofinn commented May 7, 2017

@nalimilan I'll take a look at GLM.jl, but I might not have time to get that PR ready till later this week.

Base.std(v::RealArray, w::AbstractWeights; mean=nothing, corrected::DepBool=nothing) =
sqrt.(var(v, w; mean=mean, corrected=depcheck(:std, corrected)))

Base.stdm(v::RealArray, m::RealArray, dim::Int; corrected::DepBool=nothing) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess this signature was here before, but it's a bit of piracy, isn't it?

Copy link
Member Author

@rofinn rofinn May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's been there for a while, but I didn't notice when I updated it. We had a discussion on #248 about it and I think the plan was to move this method to base julia and add a version check, but I haven't gotten around to doing it yet.

@matthieugomez
Copy link
Contributor

matthieugomez commented May 26, 2017

Great pull request! Thanks everyone for doing it. Just to be sure, are we sure about the plural form Weights (used in R) rather than the singular form Weight (used in SAS, Stata, Python)?

@ararslan
Copy link
Member

Yes, because a vector of weights contains more than one weight. 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants