Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New distribution: Maximum #1655

Closed
wants to merge 4 commits into from
Closed

New distribution: Maximum #1655

wants to merge 4 commits into from

Conversation

Vilin97
Copy link

@Vilin97 Vilin97 commented Dec 28, 2022

A bare bones implementation of the distribution of maximum of n iid random variables.

@codecov-commenter
Copy link

codecov-commenter commented Dec 28, 2022

Codecov Report

Base: 85.52% // Head: 85.46% // Decreases project coverage by -0.06% ⚠️

Coverage data is based on head (71dd3cc) compared to base (432a7f9).
Patch coverage: 41.66% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1655      +/-   ##
==========================================
- Coverage   85.52%   85.46%   -0.07%     
==========================================
  Files         130      131       +1     
  Lines        8153     8165      +12     
==========================================
+ Hits         6973     6978       +5     
- Misses       1180     1187       +7     
Impacted Files Coverage Δ
src/Distributions.jl 100.00% <ø> (ø)
src/univariates.jl 74.07% <ø> (ø)
src/univariate/continuous/maximum.jl 41.66% <41.66%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@Vilin97 Vilin97 marked this pull request as draft December 28, 2022 03:05
@sethaxen
Copy link
Contributor

I like the idea of having such a distribution. I imagine though that users would find an analogous Minimum or Median useful. Which makes me wonder if it would be better to add a general OrderStatistic(dist, n, i) distribution that takes a parameter n (sample size) and a parameter i (order statistic). Maximum would correspond to OrderStatistic(dist, n, n), while Minimum would be OrderStatistic(dist, n, 1), and Median would be OrderStatistic(dist, n, fld(n, 2)). One could even have a Quantile(dist, n, p) alias that is just OrderStatistic(dist, n, floor(Int, p*n)). Though I'm in general not a fan of things that look like constructors returning other objects.

@Vilin97
Copy link
Author

Vilin97 commented Dec 30, 2022

Thank you for your interest, @sethaxen! I agree that OrderStatistic would be a good generalization but the formulas for its cdf and pdf are considerably more complicated. AFAIK, there is no closed form formula for the inverse of the cdf. Therefore, I don't think it makes sense to implement OrderStatistic.

@sethaxen
Copy link
Contributor

sethaxen commented Dec 30, 2022

Thank you for your interest, @sethaxen! I agree that OrderStatistic would be a good generalization but the formulas for its cdf and pdf are considerably more complicated.

They seem complicated, but they're much simpler when written as a function of the PDF and CDF of the binomial distribution (Edit: probably need to double-check that the pdf works even when dist is discrete):

function orderstat_pdf(dist, n, r, x)
    d = Binomial(n - 1, cdf(dist, x))
    return n * pdf(d, r - 1) * pdf(dist, x)
end

function orderstat_cdf(dist, n, r, x)
    d = Binomial(n, cdf(dist, x))
    return cdf(d, n) - cdf(d, r - 1)
end

AFAIK, there is no closed form formula for the inverse of the cdf. Therefore, I don't think it makes sense to implement OrderStatistic.

Several other distributions in this package don't have closed-form expressions for the quantile function e.g.

function quantile(d::UnivariateMixture{Continuous}, p::Real)
ps = probs(d)
min_q, max_q = extrema(quantile(component(d, i), p) for (i, pi) in enumerate(ps) if pi > 0)
quantile_bisect(d, p, min_q, max_q)
end

There are several quantile algorithms in https://github.com/JuliaStats/Distributions.jl/blob/master/src/quantilealgs.jl that can be used here. There's likely a way to set absolute lower and upper bounds for the quantile for quantile_bisect (Edit: you could use the quantiles for the minimum and maximum as the extrema, but maybe tighter bounds are possible).

@sethaxen
Copy link
Contributor

Hi @Vilin97 are you interested in tackling this more general OrderStatistics distribution?

@Vilin97
Copy link
Author

Vilin97 commented Jan 27, 2023

Hi @Vilin97 are you interested in tackling this more general OrderStatistics distribution?

I'm not, sorry! I understand the desire for more generality but in this case it will come at the expense of less certainty, e.g. with quantiles. I will leave the implementation of orderStatistics to someone else.

Copy link
Contributor

@sethaxen sethaxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, even if we have an OrderStatistic distribution, it makes sense to have independent Maximum and Minimum distributions. I'll tackle the other two once this is finished, but I have some suggestions:

@@ -0,0 +1,25 @@
"""
The maximum of n iid random variables with continuous univariate distribution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason to limit this to continuous distributions, since it's useful for discrete ones as well. Could you then move maximum.jl to be in the univariate/ directory?

Also, could you expand this docstring similar to others in this package, e.g. Binomial?

"""
The maximum of n iid random variables with continuous univariate distribution
"""
struct Maximum{D<:ContinuousUnivariateDistribution} <: ContinuousUnivariateDistribution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct Maximum{D<:ContinuousUnivariateDistribution} <: ContinuousUnivariateDistribution
struct Maximum{D<:UnivariateDistribution,S<:ValueSupport} <: UnivariateDistribution{S}

struct Maximum{D<:ContinuousUnivariateDistribution} <: ContinuousUnivariateDistribution
dist::D
n::Int
Maximum{D}(dist, n) where {D<:ContinuousUnivariateDistribution} = new{D}(dist, n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Maximum{D}(dist, n) where {D<:ContinuousUnivariateDistribution} = new{D}(dist, n)
function Maximum{D}(dist, n) where {D<:UnivariateDistribution}
new{D,value_support(D)}(dist, n)
end

Maximum{D}(dist, n) where {D<:ContinuousUnivariateDistribution} = new{D}(dist, n)
end

function Maximum(dist::D, n::Integer; check_args::Bool=true) where {D <: ContinuousUnivariateDistribution}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function Maximum(dist::D, n::Integer; check_args::Bool=true) where {D <: ContinuousUnivariateDistribution}
function Maximum(dist::D, n::Integer; check_args::Bool=true) where {D <: UnivariateDistribution}

return Maximum{D}(dist, n)
end

rand(rng::AbstractRNG, d::Maximum{D}) where {D} = maximum([rand(rng, d.dist) for _ in 1:d.n])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the D keyword since we don't use it. Also, for continuous d we can use that Maximum(Uniform(0, 1), n) is the same as Beta(n, 1) to sample more efficiently. Might be able to do something similar in the discrete case, but we can at least use an iterator to avoid allocating a large array for large n:

Suggested change
rand(rng::AbstractRNG, d::Maximum{D}) where {D} = maximum([rand(rng, d.dist) for _ in 1:d.n])
rand(rng::AbstractRNG, d::Maximum) = maximum(rand(rng, d.dist) for _ in 1:d.n)
function rand(rng::AbstractRNG, d::Maximum{<:ContinuousUnivariateDistribution})
return quantile(d.dist, rand(rng, Beta(d.n, 1)))
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better, since rand(Beta(n, 1) is equivalent to rand()^(1//n)

Suggested change
rand(rng::AbstractRNG, d::Maximum{D}) where {D} = maximum([rand(rng, d.dist) for _ in 1:d.n])
rand(rng::AbstractRNG, d::Maximum) = maximum(rand(rng, d.dist) for _ in 1:d.n)
function rand(rng::AbstractRNG, d::Maximum{<:ContinuousUnivariateDistribution})
return quantile(d.dist, rand(rng)^(1//d.n))
end


#### Evaluation

cdf(d::Maximum{D}, x::Real) where {D} = cdf(d.dist, x)^d.n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for D

Suggested change
cdf(d::Maximum{D}, x::Real) where {D} = cdf(d.dist, x)^d.n
cdf(d::Maximum, x::Real) = cdf(d.dist, x)^d.n

#### Evaluation

cdf(d::Maximum{D}, x::Real) where {D} = cdf(d.dist, x)^d.n
pdf(d::Maximum{D}, x::Real) where {D} = d.n*pdf(d.dist, x)*cdf(d.dist, x)^(d.n-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for D, and we can implement pdf for the discrete case as cdf(d, x) - cdf(d, xminus) where xminus is the largest element in the support less than x:

Suggested change
pdf(d::Maximum{D}, x::Real) where {D} = d.n*pdf(d.dist, x)*cdf(d.dist, x)^(d.n-1)
pdf(d::Maximum, x::Real) = d.n*pdf(d.dist, x)*cdf(d.dist, x)^(d.n-1)
function pdf(d::Maximum{<:DiscreteUnivariateDistribution}, x::Real)
p = cdf(d.dist, x)
n = d.n
return p^n - (p - pdf(d.dist, x))^n
end


cdf(d::Maximum{D}, x::Real) where {D} = cdf(d.dist, x)^d.n
pdf(d::Maximum{D}, x::Real) where {D} = d.n*pdf(d.dist, x)*cdf(d.dist, x)^(d.n-1)
logpdf(d::Maximum{D}, x::Real) where {D} = log(d.n)+logpdf(d.dist, x)+(d.n-1)*logcdf(d.dist, x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If x is e.g. a Float32, taking log(d.n) will promote it to a Float64. We can also support discrete case as above:

Suggested change
logpdf(d::Maximum{D}, x::Real) where {D} = log(d.n)+logpdf(d.dist, x)+(d.n-1)*logcdf(d.dist, x)
function logpdf(d::Maximum, x::Real)
n = d.n
dist = d.dist
lp = logpdf(dist, x)+(n-1)*logcdf(dist, x)
return lp + log(oftype(lp, n))
end
function logpdf(d::Maximum{<:DiscreteUnivariateDistribution}, x::Real)
dist = d.dist
n = d.n
logp = logcdf(d.dist, x)
return n*logp + log1mexp(n*log1mexp(logpdf(d.dist, x) - logp))
end

Comment on lines +23 to +25
minimum(d::Maximum{D}) where {D} = minimum(d.dist)
maximum(d::Maximum{D}) where {D} = maximum(d.dist)
insupport(d::Maximum{D}, x::Real) where {D} = insupport(d.dist, x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for D:

Suggested change
minimum(d::Maximum{D}) where {D} = minimum(d.dist)
maximum(d::Maximum{D}) where {D} = maximum(d.dist)
insupport(d::Maximum{D}, x::Real) where {D} = insupport(d.dist, x)
minimum(d::Maximum) = minimum(d.dist)
maximum(d::Maximum) = maximum(d.dist)
insupport(d::Maximum, x::Real) = insupport(d.dist, x)

cdf(d::Maximum{D}, x::Real) where {D} = cdf(d.dist, x)^d.n
pdf(d::Maximum{D}, x::Real) where {D} = d.n*pdf(d.dist, x)*cdf(d.dist, x)^(d.n-1)
logpdf(d::Maximum{D}, x::Real) where {D} = log(d.n)+logpdf(d.dist, x)+(d.n-1)*logcdf(d.dist, x)
quantile(d::Maximum{D}, q::Real) where {D} = quantile(d.dist, q^(1/d.n))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for D, and we can use // to avoid unnecessarily promoting to a Float64. Also, I'd have to think more about how if this needs to be changed to handle the discrete case.

Suggested change
quantile(d::Maximum{D}, q::Real) where {D} = quantile(d.dist, q^(1/d.n))
quantile(d::Maximum, q::Real) = quantile(d.dist, q^(1//d.n))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems nothing special needs to be done, see #1655 (comment).

@sethaxen
Copy link
Contributor

...in this case it will come at the expense of less certainty, e.g. with quantiles.

Interestingly, https://epubs.siam.org/doi/10.1137/1.9780898719062.ch4, they point out that the pth quantile of the order statistic i for sample size n can be computed as quantile(d.dist, quantile(Beta(i, n-i+1), p)).

@Vilin97
Copy link
Author

Vilin97 commented Mar 9, 2023

In light of #1668 is this now obsolete, @sethaxen ? I have not had the time to address your comments but if it's not obsolete, I will address your comments within a week to make this merge-ready.

@sethaxen
Copy link
Contributor

In light of #1668 is this now obsolete, @sethaxen ?

I'm not certain. I just benchmarked, and pdf, cdf, and logpdf in this PR are all 1.5-3x faster than representing the distribution of the maximum using OrderStatistic, while quantile is ~20x faster. I don't know whether it's better to have a custom type for Maximum or just to special-case the maximum (and minimum) in these functions in OrderStatistic.

@Vilin97 Vilin97 closed this Jun 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants