-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong results for weighted quantile #435
Comments
There are multiple ways to extend the base function But it is not obvious to me that the existing function has the wrong output. You reject that 3.5 is the median because you say: if there is only one value But this is not true with the definition of x = [1, 2, 3, 4]
df= DataFrame(plow = cumsum([0.25, 0.25, 0.25, 0.25]), phigh = reverse(cumsum(reverse([0.25, 0.25, 0.25, 0.25]))), x = x)
#>│ Row │ plow │ phigh │ x │
#>│ │ Float64 │ Float64 │ Int64 │
#>├─────┼─────────┼─────────┼───────┤
#>│ 1 │ 0.25 │ 1.0 │ 1 │
#>│ 2 │ 0.5 │ 0.75 │ 2 │
#>│ 3 │ 0.75 │ 0.5 │ 3 │
#>│ 4 │ 1.0 │ 0.25 │ 4 │ Julia says that the 0.49 quantile is 2.47 but, according to your definition, it should be 2. That being said, I think the thread shows two problems with the existing implementation:
|
Btw, note that the current definition of weighted median([1, 2, 3, 4], fweights([2, 1, 3, 2]))
# 2.5
median([1, 1, 2, 3, 3, 3, 4, 4])
# 3 |
At the end of the day, quantile type 7 is just not straighforward to extend to weights. I would be happier to have quantile defined as type 1 or 3 in base. It would make it simpler to extend it to weight. It would also make it possible to use quantile with types that are only ordinal. JuliaLang/julia#27367 |
Using the same default as R and NumPy is nice, though. For ordinal data, it sounds OK to require people to choose a different method.
Can't we just have it call
Cf. preivous discussion here: #313 (comment). I guess several generalizations are possible, we just need one which is reasonable (for some definition of that term). Does the current answer (7.0) fit that bill? |
x = rand(10_000_000)
w = rand(10_000_000)
@time wquantile(x, w, 0.5)
1.908000 seconds (44 allocations: 382.676 MiB, 2.88% gc time)
0.5001186970922967
julia> @time wmedian(x, w)
1.911641 seconds (27 allocations: 231.275 MiB, 2.02% gc time)
0.5001187598955092
|
For many algorithms there's an obvious generalization of frequency weights to non-integer values. For example, to compute the mean it doesn't matter whether the weight is integer or not. Generally speaking, it sounds simple to give a meaning to non-integer weights: having two observations with weight Regarding the weighted median, let's remove the separate implementation then. |
But, in your example, it should then be interpreted as a probability weight.I am having a hard time thinking of a case where someone wants non frequency weights to be different than probability weights. From Stata:
|
I'm not saying they have to be different when computing quantiles. That's the case e.g. for inference of when computing the Bessel correction for variance, but not necessarily for descriptive statistics. |
While I am doing a pull request, why is |
I guess that's just an oversight. Do we have any idea why Julia uses a separate method for |
So apart from #436, does something need to be addressed here @matthieugomez? We could reject non-integer frequency weights, but do you have an example where they give a clearly incorrect result? |
I don't really know what correct or incorrect result would mean in this
situation, so I prefer to return an error. This is what Stata does (not
sure about whether other softwares have a concept of frequency weights).
…On Fri, Feb 1, 2019 at 10:33 AM Milan Bouchet-Valat < ***@***.***> wrote:
So apart from #436 <#436>,
does something need to be addressed here @matthieugomez
<https://github.com/matthieugomez>? We could reject non-integer frequency
weights, but do you have an example where they give a clearly incorrect
result?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#435 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF733f6O6AECWPKCokhZ_mcVzUsTIiasks5vJF5JgaJpZM4ZIayv>
.
|
For example, wouldn't the rule I describe above make sense? "having two observations with weight |
I'm not sure this rule would work. Remember the example: quantile([1, 2], fweights([2, 2]), [0.25]) = 1
#but
quantile([1, 2], fweights([1, 1]), [0.25]) > 1 which violates your rule. |
Does it really? What I suggest would only implies |
Oh ok I misunderstood your rule. But then I'm not sure how it generalizes beyond the case where the weights for each value happen to sum up to an integer. In any case, my point is just that I wrote this code under the wrong assumption that frequency weights were always integer. So until someone develops a new definition, which I'm not capable of, I think it is better to return an error. |
Fine with me. I suspect there's nothing special about integers in this algorithm, but better safe than sorry. |
Fixed by #436. |
Hi, I clearly lack subtelty about the various definitions of weighted quantiles (and I passed quickly over the above discussion as a result), but I thought I'd share another, more obvious example of what what the current implementation is doing: The frequency weights seems to do what I'd expect: julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1000, 1, 1, 1]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1000, 1, 1]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1000, 1]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1, 1000]))
4.0 unlike the ProbabilityWeights: julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1000, 1, 1, 1]/1003))
2.500000000000056
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1000, 1, 1]/1003))
1.501
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1000, 1]/1003))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1, 1000]/1003))
3.499 which seems to:
Worse, it is numerically inaccurate: julia> eps = 1e-50 ;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
4.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
1.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
3.5 whereas setting epsilon to exactly zero yields the same (and for me, expected) result as FrequencyWeights julia> eps = 0.;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
4.0 IMO the above shows surprising results that may go beyond the difference between various definitions (especially that the first weight is ignored, and possibly the discontinuity at the limit when the weights tend toward being concentrated on one element, though OK, discontinuities are parts of mathematics -- but they often don't help when analyzing real data). Anyway, for me the workaround will be to multiply my weights by a large number, convert to integers, and use FrequencyWeights instead of ProbabilityWeights. |
From https://discourse.julialang.org/t/median-vs-50th-quantile-giving-different-answers/18414/4
but
so
0.5
is not a median.cc @matthieugomez
The text was updated successfully, but these errors were encountered: