Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OrderStatistic and JointOrderStatistics distributions #1668

Merged
merged 46 commits into from
May 22, 2023
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
f74d4ac
Add OrderStatistic distribution
sethaxen Jan 30, 2023
d0eb010
Add JointOrderStatistics distribution
sethaxen Jan 30, 2023
d084189
Simplify implementation of logpdf
sethaxen Jan 31, 2023
4266f87
Bug fix
sethaxen Jan 31, 2023
f725eab
Add complementary methods
sethaxen Feb 5, 2023
57dd262
Avoid code duplication
sethaxen Feb 5, 2023
f610dea
Repair cquantile
sethaxen Feb 5, 2023
5325fd4
Add OrderStatistic tests
sethaxen Feb 5, 2023
c5d458e
Fix cquantile and add comments
sethaxen Feb 6, 2023
a3e8844
Actually repair cquantile function
sethaxen Feb 6, 2023
c78f614
Add more tests
sethaxen Feb 6, 2023
ac5f421
Remove suffix
sethaxen Feb 6, 2023
e59b9d7
Remove scrU
sethaxen Feb 6, 2023
0bcabf3
Use correction for multiple tests
sethaxen Feb 6, 2023
f9b2c09
Simplify code
sethaxen Feb 6, 2023
e5dd13a
Use correct distribution name
sethaxen Feb 6, 2023
f314d88
Use larger sample size
sethaxen Feb 6, 2023
5431c36
Ensure return type is constant
sethaxen Feb 9, 2023
2c3fa12
Swap i and j
sethaxen Feb 9, 2023
42c1ba7
Ensure eltype is returned
sethaxen Feb 9, 2023
6a6487f
Add JointOrderStatistics tests
sethaxen Feb 9, 2023
60d3b98
Add reference
sethaxen Feb 9, 2023
548acac
Link between docstrings
sethaxen Feb 9, 2023
606ab3c
Unify OrderStatistic tests
sethaxen Feb 9, 2023
23b48ff
More stringently compute number of checks
sethaxen Feb 9, 2023
edb05e2
Update testset name
sethaxen Feb 9, 2023
8a770a6
Increase number of draws
sethaxen Feb 9, 2023
cf5d188
Set seed in testset
sethaxen Feb 10, 2023
9e7fb7d
Merge branch 'master' into orderstats
sethaxen May 14, 2023
6d2a660
Add an order statistics docs page
sethaxen May 14, 2023
50bf3da
Merge branch 'master' into orderstats
sethaxen May 15, 2023
8c87e01
Rename rank variables to `rank` and `ranks`
sethaxen May 15, 2023
5e18eba
Make arg checking more efficient
sethaxen May 15, 2023
9a75518
Support tuple of ranks
sethaxen May 15, 2023
018b431
Apply suggestions from code review
sethaxen May 15, 2023
b6ddcba
Merge branch 'orderstats' of https://github.com/sethaxen/Distribution…
sethaxen May 15, 2023
56d028f
Use Fill
sethaxen May 15, 2023
8bd0bc2
Reduce chances of type-instability
sethaxen May 15, 2023
ebcabbc
Ensure type-stability and use GammaMTSampler
sethaxen May 15, 2023
433dfe1
Update docstrings
sethaxen May 15, 2023
f96f191
Apply suggestions from code review
sethaxen May 17, 2023
f9a424e
Add tests that check for empty ranks
sethaxen May 17, 2023
fa4dd6b
Add comment explaining choice of gamma sampler
sethaxen May 17, 2023
f888acd
Apply suggestions from code review
sethaxen May 18, 2023
d4a0c7e
Test for 0 density out of support
sethaxen May 21, 2023
2804662
Return -Inf for logpdf if out of support
sethaxen May 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/Distributions.jl
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ export
InverseGaussian,
IsoNormal,
IsoNormalCanon,
JointOrderStatistics,
Kolmogorov,
KSDist,
KSOneSided,
Expand Down Expand Up @@ -141,6 +142,7 @@ export
Normal,
NormalCanon,
NormalInverseGaussian,
OrderStatistic,
Pareto,
PGeneralizedGaussian,
SkewedExponentialPower,
Expand Down
132 changes: 132 additions & 0 deletions src/multivariate/jointorderstatistics.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Implementation based on chapters 2-4 of
# Arnold, Barry C., Narayanaswamy Balakrishnan, and Haikady Navada Nagaraja.
# A first course in order statistics. Society for Industrial and Applied Mathematics, 2008.

"""
JointOrderStatistics <: ContinuousMultivariateDistribution

The joint distribution of a subset of order statistics from a sample from a continuous univariate distribution.

JointOrderStatistics(
dist::ContinuousUnivariateDistribution,
n::Int,
r::AbstractVector{Int}=1:n;
check_args::Bool=true,
)

Construct the joint distribution of order statistics `r` from an IID sample of size `n` from `dist`.

The ``i``th order statistic of a sample is the ``i``th element of the sorted sample.
For example, the 1st order statistic is the sample minimum, while the ``n``th order
statistic is the sample maximum.

`r` must be a sorted vector of unique integers between 1 and `n`.
sethaxen marked this conversation as resolved.
Show resolved Hide resolved

For a single order statistic, use [`OrderStatistic`](@ref) instead.

## Examples

```julia
JointOrderStatistics(Normal(), 10) # Product(fill(Normal(), 10)) restricted to ordered vectors
JointOrderStatistics(Cauchy(), 10, [1, 10]) # joint distribution of the extrema
```
"""
struct JointOrderStatistics{D<:ContinuousUnivariateDistribution,R<:AbstractVector{Int}} <:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the continuous case, we can write down the PDF and not much else (except the mean and covariance for a handful of dists, for a future PR).

For the discrete case, the PDF is much more complicated and probably only tractable when length(r) <= 2. For a discrete dist whose support is a finite set, we can define a distribution of the order statistics for a sample without replacement, at least in the bivariate case. The PMF, mean, and covariance are known. I wonder if that should be a special case of this distribution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if that should be a special case of this distribution.

Nah, seems complicated and not all that useful.

ContinuousMultivariateDistribution
dist::D
n::Int
r::R
function JointOrderStatistics(
dist::ContinuousUnivariateDistribution,
n::Int,
r::AbstractVector{Int}=1:n;
check_args::Bool=true,
)
@check_args(
JointOrderStatistics,
(n, n ≥ 1, "`n` must be a positive integer."),
(
r,
1 ≤ first(r) && last(r) ≤ n && issorted(r) && allunique(r),
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
"`r` must be a sorted vector of unique integers between 1 and `n`.",
),
)
return new{typeof(dist),typeof(r)}(dist, n, r)
end
end

length(d::JointOrderStatistics) = length(d.r)
function insupport(d::JointOrderStatistics, x::AbstractVector)
return length(d) == length(x) && issorted(x) && all(Base.Fix1(insupport, d.dist), x)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
end
minimum(d::JointOrderStatistics) = fill(minimum(d.dist), length(d))
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
maximum(d::JointOrderStatistics) = fill(maximum(d.dist), length(d))

params(d::JointOrderStatistics) = tuple(params(d.dist)..., d.n, d.r)
partype(d::JointOrderStatistics) = partype(d.dist)
Base.eltype(::Type{<:JointOrderStatistics{D}}) where {D} = Base.eltype(D)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved

function logpdf(d::JointOrderStatistics, x::AbstractVector{<:Real})
n = d.n
r = d.r
lp = sum(Base.Fix1(logpdf, d.dist), x)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
T = eltype(lp)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
lp += loggamma(T(n + 1))
length(r) == n && return lp
i = 0
xᵢ = oftype(float(first(x)), -Inf)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
for (j, xⱼ) in zip(r, x)
lp += _marginalize_range(d.dist, n, i, j, xᵢ, xⱼ, T)
i = j
xᵢ = xⱼ
end
j = n + 1
xⱼ = oftype(xᵢ, Inf)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
lp += _marginalize_range(d.dist, n, i, j, xᵢ, xⱼ, T)
return lp
end

# given ∏ₖf(xₖ), marginalize all xₖ for i < k < j
function _marginalize_range(dist, n, i, j, xᵢ, xⱼ, T)
k = j - i - 1
k == 0 && return zero(T)
lpdiff = if i == 0
T(logcdf(dist, xⱼ))
elseif j == n + 1
T(logccdf(dist, xᵢ))
else
T(logdiffcdf(dist, xⱼ, xᵢ))
end
return k * lpdiff - loggamma(T(k + 1))
end

function _rand!(rng::AbstractRNG, d::JointOrderStatistics, x::AbstractVector{<:Real})
n = d.n
if n == length(d.r) # r == 1:n
# direct method, slower than inversion method for large `n` and distributions with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we should maybe use some heuristic for choosing one of the two algorithms, depending on n and possibly the type of distribution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain what heuristic we could use here, since which strategy is faster strongly depends on the relative speed of the quantile function and the rand method of the wrapped distribution. Also n, since the direct method has an O(nlog(n)) sort step that for n large enough will make it slower than the exponential method, which is always O(n), though potentially with a large constant factor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This motivated the idea of implementing two samplers and then just choosing the sampler based on a heuristic. But the idea of the user choosing a sampler themselves does not seem to be documented; are there any cases where a sampler is intended for direct usage and exported?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no I'm not aware of such an example. Might be good enough for now.

# fast quantile function or that use inversion sampling
rand!(rng, d.dist, x)
sort!(x)
else
# use exponential generation method with inversion, where for gaps in the ranks, we
# use the fact that the sum Y of k IID variables xₘ ~ Exp(1) is Y ~ Gamma(k, 1).
# Lurie, D., and H. O. Hartley. "Machine-generation of order statistics for Monte
# Carlo computations." The American Statistician 26.1 (1972): 26-27.
# this is slow if length(d.r) is close to n and quantile for d.dist is expensive,
# but this branch is probably taken when length(d.r) is small or much smaller than n.
Copy link
Contributor Author

@sethaxen sethaxen Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably maybe worth creating samplers for these two cases.

T = typeof(one(eltype(x)))
s = zero(one(T))
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
i = 0
for (m, j) in zip(eachindex(x), d.r)
k = j - i
s += k > 1 ? rand(rng, Gamma(k, one(T); check_args=false)) : randexp(rng, T)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
i = j
x[m] = s
end
j = n + 1
k = j - i
s += k > 1 ? rand(rng, Gamma(k, one(T); check_args=false)) : randexp(rng, T)
x .= quantile.(d.dist, x ./ s)
end
return x
end
1 change: 1 addition & 0 deletions src/multivariates.jl
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ end
for fname in ["dirichlet.jl",
"multinomial.jl",
"dirichletmultinomial.jl",
"jointorderstatistics.jl",
"mvnormal.jl",
"mvnormalcanon.jl",
"mvlognormal.jl",
Expand Down
106 changes: 106 additions & 0 deletions src/univariate/orderstatistic.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Implementation based on chapters 2-4 of
# Arnold, Barry C., Narayanaswamy Balakrishnan, and Haikady Navada Nagaraja.
# A first course in order statistics. Society for Industrial and Applied Mathematics, 2008.

"""
OrderStatistic{D<:UnivariateDistribution,S<:ValueSupport} <: UnivariateDistribution{S}

The distribution of an order statistic from IID samples from a univariate distribution
sethaxen marked this conversation as resolved.
Show resolved Hide resolved

OrderStatistic(dist::UnivariateDistribution, n::Int, i::Int; check_args::Bool=true)

Construct the distribution of the `i`th order statistic from `n` independent samples from `dist`.

The ``i``th order statistic of a sample is the ``i``th element of the sorted sample.
For example, the 1st order statistic is the sample minimum, while the ``n``th order
statistic is the sample maximum.

If ``f`` is probability density (mass) function of `dist` with distribution function ``F``,
then the probability density function ``g`` of the order statistic for continuous
`dist` is
```math
g(x; n, i) = {n \\choose i} [F(x)]^{i-1} [1 - F(x)]^{n-i} f(x),
```
and the probability mass function ``g`` of the order statistic for discrete `dist` is
```math
g(x; n, i) = \\sum_{k=i}^n {n \\choose k} \\left( [F(x)]^k [1 - F(x)]^{n-k} - [F(x_-)]^k [1 - F(x_-)]^{n-k} \\right),
```
where ``x_-`` is the largest element in the support of `dist` less than ``x``.

For the joint distribution of a subset of order statistics, use
[`JointOrderStatistics`](@ref) instead.

## Examples

```julia
OrderStatistic(Cauchy(), 10, 1) # distribution of the sample minimum
OrderStatistic(DiscreteUniform(10), 10, 10) # distribution of the sample maximum
OrderStatistic(Gamma(1, 1), 11, 5) # distribution of the sample median
```
"""
struct OrderStatistic{D<:UnivariateDistribution,S<:ValueSupport} <:
UnivariateDistribution{S}
dist::D
n::Int
i::Int
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
function OrderStatistic(
dist::UnivariateDistribution, n::Int, i::Int; check_args::Bool=true
)
@check_args(OrderStatistic, 1 ≤ i ≤ n)
return new{typeof(dist),value_support(typeof(dist))}(dist, n, i)
end
end

minimum(d::OrderStatistic) = minimum(d.dist)
maximum(d::OrderStatistic) = maximum(d.dist)
insupport(d::OrderStatistic, x::Real) = insupport(d.dist, x)

params(d::OrderStatistic) = tuple(params(d.dist)..., d.n, d.i)
partype(d::OrderStatistic) = partype(d.dist)
Base.eltype(::Type{<:OrderStatistic{D}}) where {D} = Base.eltype(D)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved

# distribution of the ith order statistic from an IID uniform distribution, with CDF Uᵢₙ(x)
function _uniform_orderstatistic(d::OrderStatistic)
n = d.n
i = d.i
return Beta{Int}(i, n - i + 1)
end

function logpdf(d::OrderStatistic, x::Real)
b = _uniform_orderstatistic(d)
p = cdf(d.dist, x)
if value_support(typeof(d)) === Continuous
return logpdf(b, p) + logpdf(d.dist, x)
else
return logdiffcdf(b, p, p - pdf(d.dist, x))
end
end

for f in [:logcdf, :logccdf, :cdf, :ccdf]
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
@eval begin
function $f(d::OrderStatistic, x::Real)
b = _uniform_orderstatistic(d)
return $f(b, cdf(d.dist, x))
end
end
end

for f in [:quantile, :cquantile]
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
@eval begin
function $f(d::OrderStatistic, p::Real)
# since cdf is Fᵢₙ(x) = Uᵢₙ(Fₓ(x)), and Uᵢₙ is invertible and increasing, we
# have Fₓ(x) = Uᵢₙ⁻¹(Fᵢₙ(x)). then quantile function is
# Qᵢₙ(p) = inf{x: p ≤ Fᵢₙ(x)} = inf{x: Uᵢₙ⁻¹(p) ≤ Fₓ(x)} = Qₓ(Uᵢₙ⁻¹(p))
b = _uniform_orderstatistic(d)
return quantile(d.dist, $f(b, p))
end
end
end

function rand(rng::AbstractRNG, d::OrderStatistic)
# inverse transform sampling. Since quantile function is Qₓ(Uᵢₙ⁻¹(p)), we draw a random
# variable from Uᵢₙ and pass it through the quantile function of `d.dist`
T = eltype(d.dist)
b = _uniform_orderstatistic(d)
return T(quantile(d.dist, rand(rng, b)))
end
2 changes: 2 additions & 0 deletions src/univariates.jl
Original file line number Diff line number Diff line change
Expand Up @@ -729,3 +729,5 @@ end
for dname in continuous_distributions
include(joinpath("univariate", "continuous", "$(dname).jl"))
end

include(joinpath("univariate", "orderstatistic.jl"))
Loading