Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OrderStatistic and JointOrderStatistics distributions #1668

Merged
merged 46 commits into from
May 22, 2023
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
f74d4ac
Add OrderStatistic distribution
sethaxen Jan 30, 2023
d0eb010
Add JointOrderStatistics distribution
sethaxen Jan 30, 2023
d084189
Simplify implementation of logpdf
sethaxen Jan 31, 2023
4266f87
Bug fix
sethaxen Jan 31, 2023
f725eab
Add complementary methods
sethaxen Feb 5, 2023
57dd262
Avoid code duplication
sethaxen Feb 5, 2023
f610dea
Repair cquantile
sethaxen Feb 5, 2023
5325fd4
Add OrderStatistic tests
sethaxen Feb 5, 2023
c5d458e
Fix cquantile and add comments
sethaxen Feb 6, 2023
a3e8844
Actually repair cquantile function
sethaxen Feb 6, 2023
c78f614
Add more tests
sethaxen Feb 6, 2023
ac5f421
Remove suffix
sethaxen Feb 6, 2023
e59b9d7
Remove scrU
sethaxen Feb 6, 2023
0bcabf3
Use correction for multiple tests
sethaxen Feb 6, 2023
f9b2c09
Simplify code
sethaxen Feb 6, 2023
e5dd13a
Use correct distribution name
sethaxen Feb 6, 2023
f314d88
Use larger sample size
sethaxen Feb 6, 2023
5431c36
Ensure return type is constant
sethaxen Feb 9, 2023
2c3fa12
Swap i and j
sethaxen Feb 9, 2023
42c1ba7
Ensure eltype is returned
sethaxen Feb 9, 2023
6a6487f
Add JointOrderStatistics tests
sethaxen Feb 9, 2023
60d3b98
Add reference
sethaxen Feb 9, 2023
548acac
Link between docstrings
sethaxen Feb 9, 2023
606ab3c
Unify OrderStatistic tests
sethaxen Feb 9, 2023
23b48ff
More stringently compute number of checks
sethaxen Feb 9, 2023
edb05e2
Update testset name
sethaxen Feb 9, 2023
8a770a6
Increase number of draws
sethaxen Feb 9, 2023
cf5d188
Set seed in testset
sethaxen Feb 10, 2023
9e7fb7d
Merge branch 'master' into orderstats
sethaxen May 14, 2023
6d2a660
Add an order statistics docs page
sethaxen May 14, 2023
50bf3da
Merge branch 'master' into orderstats
sethaxen May 15, 2023
8c87e01
Rename rank variables to `rank` and `ranks`
sethaxen May 15, 2023
5e18eba
Make arg checking more efficient
sethaxen May 15, 2023
9a75518
Support tuple of ranks
sethaxen May 15, 2023
018b431
Apply suggestions from code review
sethaxen May 15, 2023
b6ddcba
Merge branch 'orderstats' of https://github.com/sethaxen/Distribution…
sethaxen May 15, 2023
56d028f
Use Fill
sethaxen May 15, 2023
8bd0bc2
Reduce chances of type-instability
sethaxen May 15, 2023
ebcabbc
Ensure type-stability and use GammaMTSampler
sethaxen May 15, 2023
433dfe1
Update docstrings
sethaxen May 15, 2023
f96f191
Apply suggestions from code review
sethaxen May 17, 2023
f9a424e
Add tests that check for empty ranks
sethaxen May 17, 2023
fa4dd6b
Add comment explaining choice of gamma sampler
sethaxen May 17, 2023
f888acd
Apply suggestions from code review
sethaxen May 18, 2023
d4a0c7e
Test for 0 density out of support
sethaxen May 21, 2023
2804662
Return -Inf for logpdf if out of support
sethaxen May 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ makedocs(
"reshape.md",
"cholesky.md",
"mixture.md",
"order_statistics.md",
"convolution.md",
"fit.md",
"extends.md",
Expand Down
16 changes: 16 additions & 0 deletions docs/src/order_statistics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Order Statistics

The $i$th [Order Statistic](https://en.wikipedia.org/wiki/Order_statistic) of a random sample of size $n$ from a univeriate distribution is the $i$th element after sorting in increasing order.
As a special case, the $i$th and $n$th order statistics are the minimum and maximum of the sample, while for odd $n$, the $\lceil \frac{n}{2} \rceil$th entry is the sample median.
sethaxen marked this conversation as resolved.
Show resolved Hide resolved

Given any univariate distribution and the sample size $n$, we can construct the distribution of its $i$th order statistic:

```@docs
OrderStatistic
```

If we are interested in more than one order statistic, for continuous univariate distributions we can also construct the joint distribution of order statistics:

```@docs
JointOrderStatistics
```
2 changes: 2 additions & 0 deletions src/Distributions.jl
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ export
IsoNormal,
IsoNormalCanon,
JohnsonSU,
JointOrderStatistics,
Kolmogorov,
KSDist,
KSOneSided,
Expand Down Expand Up @@ -142,6 +143,7 @@ export
Normal,
NormalCanon,
NormalInverseGaussian,
OrderStatistic,
Pareto,
PGeneralizedGaussian,
SkewedExponentialPower,
Expand Down
156 changes: 156 additions & 0 deletions src/multivariate/jointorderstatistics.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Implementation based on chapters 2-4 of
# Arnold, Barry C., Narayanaswamy Balakrishnan, and Haikady Navada Nagaraja.
# A first course in order statistics. Society for Industrial and Applied Mathematics, 2008.

"""
JointOrderStatistics <: ContinuousMultivariateDistribution

The joint distribution of a subset of order statistics from a sample from a continuous
univariate distribution.

JointOrderStatistics(
dist::ContinuousUnivariateDistribution,
n::Int,
ranks=Base.OneTo(n);
check_args::Bool=true,
)

Construct the joint distribution of order statistics for the specified `ranks` from an IID
sample of size `n` from `dist`.

The ``i``th order statistic of a sample is the ``i``th element of the sorted sample.
For example, the 1st order statistic is the sample minimum, while the ``n``th order
statistic is the sample maximum.

`ranks` must be a sorted vector or tuple of unique `Int`s between 1 and `n`.

For a single order statistic, use [`OrderStatistic`](@ref) instead.

## Examples

```julia
JointOrderStatistics(Normal(), 10) # Product(fill(Normal(), 10)) restricted to ordered vectors
JointOrderStatistics(Cauchy(), 10, 2:9) # joint distribution of all but the extrema
JointOrderStatistics(Cauchy(), 10, (1, 10)) # joint distribution of only the extrema
```
"""
struct JointOrderStatistics{
D<:ContinuousUnivariateDistribution,R<:Union{AbstractVector{Int},Tuple{Int,Vararg{Int}}}
} <: ContinuousMultivariateDistribution
dist::D
n::Int
ranks::R
function JointOrderStatistics(
dist::ContinuousUnivariateDistribution,
n::Int,
ranks::Union{AbstractVector{Int},Tuple{Int,Vararg{Int}}}=Base.OneTo(n);
check_args::Bool=true,
)
@check_args(
JointOrderStatistics,
(n, n ≥ 1, "`n` must be a positive integer."),
(
ranks,
_are_ranks_valid(ranks, n),
"`ranks` must be a sorted vector or tuple of unique integers between 1 and `n`.",
),
)
return new{typeof(dist),typeof(ranks)}(dist, n, ranks)
end
end

_islesseq(x, y) = isless(x, y) || isequal(x, y)

function _are_ranks_valid(ranks, n)
first(ranks) ≥ 1 || return false
last(ranks) ≤ n || return false
# combined with the above checks, this is equivalent to but faster than
# issorted(ranks) && allunique(ranks)
return issorted(ranks; lt=_islesseq)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
end
function _are_ranks_valid(ranks::AbstractRange, n)
first(ranks) ≥ 1 || return false
last(ranks) ≤ n || return false
return issorted(ranks) && allunique(ranks)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
end

length(d::JointOrderStatistics) = length(d.ranks)
function insupport(d::JointOrderStatistics, x::AbstractVector)
return length(d) == length(x) && issorted(x) && all(Base.Fix1(insupport, d.dist), x)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
end
minimum(d::JointOrderStatistics) = Fill(minimum(d.dist), length(d))
maximum(d::JointOrderStatistics) = Fill(maximum(d.dist), length(d))

params(d::JointOrderStatistics) = tuple(params(d.dist)..., d.n, d.ranks)
partype(d::JointOrderStatistics) = partype(d.dist)
Base.eltype(::Type{<:JointOrderStatistics{D}}) where {D} = Base.eltype(D)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
Base.eltype(d::JointOrderStatistics) = eltype(d.dist)

function logpdf(d::JointOrderStatistics, x::AbstractVector{<:Real})
n = d.n
ranks = d.ranks
lp = loglikelihood(d.dist, x)
T = typeof(lp)
lp += loggamma(T(n + 1))
length(ranks) == n && return lp
i = first(ranks)
xᵢ = first(x)
if i > 1 # _marginalize_range(d.dist, 0, i, -Inf, xᵢ, T)
lp += (i - 1) * logcdf(d.dist, xᵢ) - loggamma(T(i))
end
for (j, xⱼ) in Iterators.drop(zip(ranks, x), 1)
lp += _marginalize_range(d.dist, i, j, xᵢ, xⱼ, T)
i = j
xᵢ = xⱼ
end
if i < n # _marginalize_range(d.dist, i, n + 1, xᵢ, Inf, T)
lp += (n - i) * logccdf(d.dist, xᵢ) - loggamma(T(n - i + 1))
end
return lp
end

# given ∏ₖf(xₖ), marginalize all xₖ for i < k < j
function _marginalize_range(dist, i, j, xᵢ, xⱼ, T)
k = j - i - 1
k == 0 && return zero(T)
return k * T(logdiffcdf(dist, xⱼ, xᵢ)) - loggamma(T(k + 1))
end

function _rand!(rng::AbstractRNG, d::JointOrderStatistics, x::AbstractVector{<:Real})
n = d.n
if n == length(d.ranks) # ranks == 1:n
# direct method, slower than inversion method for large `n` and distributions with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we should maybe use some heuristic for choosing one of the two algorithms, depending on n and possibly the type of distribution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain what heuristic we could use here, since which strategy is faster strongly depends on the relative speed of the quantile function and the rand method of the wrapped distribution. Also n, since the direct method has an O(nlog(n)) sort step that for n large enough will make it slower than the exponential method, which is always O(n), though potentially with a large constant factor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This motivated the idea of implementing two samplers and then just choosing the sampler based on a heuristic. But the idea of the user choosing a sampler themselves does not seem to be documented; are there any cases where a sampler is intended for direct usage and exported?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no I'm not aware of such an example. Might be good enough for now.

# fast quantile function or that use inversion sampling
rand!(rng, d.dist, x)
sort!(x)
else
# use exponential generation method with inversion, where for gaps in the ranks, we
# use the fact that the sum Y of k IID variables xₘ ~ Exp(1) is Y ~ Gamma(k, 1).
# Lurie, D., and H. O. Hartley. "Machine-generation of order statistics for Monte
# Carlo computations." The American Statistician 26.1 (1972): 26-27.
# this is slow if length(d.ranks) is close to n and quantile for d.dist is expensive,
# but this branch is probably taken when length(d.ranks) is small or much smaller than n.
T = typeof(one(eltype(x)))
s = zero(eltype(x))
i = 0
for (m, j) in zip(eachindex(x), d.ranks)
k = j - i
if k > 1
s += T(rand(rng, GammaMTSampler(Gamma{T}(T(k), T(1)))))
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
else
s += randexp(rng, T)
end
i = j
x[m] = s
end
j = n + 1
k = j - i
if k > 1
s += T(rand(rng, GammaMTSampler(Gamma{T}(T(k), T(1)))))
else
s += randexp(rng, T)
end
x .= quantile.(d.dist, x ./ s)
end
return x
end
1 change: 1 addition & 0 deletions src/multivariates.jl
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ end
for fname in ["dirichlet.jl",
"multinomial.jl",
"dirichletmultinomial.jl",
"jointorderstatistics.jl",
"mvnormal.jl",
"mvnormalcanon.jl",
"mvlognormal.jl",
Expand Down
108 changes: 108 additions & 0 deletions src/univariate/orderstatistic.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Implementation based on chapters 2-4 of
# Arnold, Barry C., Narayanaswamy Balakrishnan, and Haikady Navada Nagaraja.
# A first course in order statistics. Society for Industrial and Applied Mathematics, 2008.

"""
OrderStatistic{D<:UnivariateDistribution,S<:ValueSupport} <: UnivariateDistribution{S}

The distribution of an order statistic from IID samples from a univariate distribution.

OrderStatistic(dist::UnivariateDistribution, n::Int, rank::Int; check_args::Bool=true)

Construct the distribution of the `rank` ``=i``th order statistic from `n` independent
samples from `dist`.

The ``i``th order statistic of a sample is the ``i``th element of the sorted sample.
For example, the 1st order statistic is the sample minimum, while the ``n``th order
statistic is the sample maximum.

If ``f`` is the probability density (mass) function of `dist` with distribution function
``F``, then the probability density function ``g`` of the order statistic for continuous
`dist` is
```math
g(x; n, i) = {n \\choose i} [F(x)]^{i-1} [1 - F(x)]^{n-i} f(x),
```
and the probability mass function ``g`` of the order statistic for discrete `dist` is
```math
g(x; n, i) = \\sum_{k=i}^n {n \\choose k} \\left( [F(x)]^k [1 - F(x)]^{n-k} - [F(x_-)]^k [1 - F(x_-)]^{n-k} \\right),
```
where ``x_-`` is the largest element in the support of `dist` less than ``x``.

For the joint distribution of a subset of order statistics, use
[`JointOrderStatistics`](@ref) instead.

## Examples

```julia
OrderStatistic(Cauchy(), 10, 1) # distribution of the sample minimum
OrderStatistic(DiscreteUniform(10), 10, 10) # distribution of the sample maximum
OrderStatistic(Gamma(1, 1), 11, 5) # distribution of the sample median
```
"""
struct OrderStatistic{D<:UnivariateDistribution,S<:ValueSupport} <:
UnivariateDistribution{S}
dist::D
n::Int
rank::Int
function OrderStatistic(
dist::UnivariateDistribution, n::Int, rank::Int; check_args::Bool=true
)
@check_args(OrderStatistic, 1 ≤ rank ≤ n)
return new{typeof(dist),value_support(typeof(dist))}(dist, n, rank)
end
end

minimum(d::OrderStatistic) = minimum(d.dist)
maximum(d::OrderStatistic) = maximum(d.dist)
insupport(d::OrderStatistic, x::Real) = insupport(d.dist, x)

params(d::OrderStatistic) = tuple(params(d.dist)..., d.n, d.rank)
partype(d::OrderStatistic) = partype(d.dist)
Base.eltype(::Type{<:OrderStatistic{D}}) where {D} = Base.eltype(D)
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
Base.eltype(d::OrderStatistic) = eltype(d.dist)

# distribution of the ith order statistic from an IID uniform distribution, with CDF Uᵢₙ(x)
function _uniform_orderstatistic(d::OrderStatistic)
n = d.n
rank = d.rank
return Beta{Int}(rank, n - rank + 1)
end

function logpdf(d::OrderStatistic, x::Real)
b = _uniform_orderstatistic(d)
p = cdf(d.dist, x)
if value_support(typeof(d)) === Continuous
return logpdf(b, p) + logpdf(d.dist, x)
else
return logdiffcdf(b, p, p - pdf(d.dist, x))
end
end

for f in [:logcdf, :logccdf, :cdf, :ccdf]
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
@eval begin
function $f(d::OrderStatistic, x::Real)
b = _uniform_orderstatistic(d)
return $f(b, cdf(d.dist, x))
end
end
end

for f in [:quantile, :cquantile]
sethaxen marked this conversation as resolved.
Show resolved Hide resolved
@eval begin
function $f(d::OrderStatistic, p::Real)
# since cdf is Fᵢₙ(x) = Uᵢₙ(Fₓ(x)), and Uᵢₙ is invertible and increasing, we
# have Fₓ(x) = Uᵢₙ⁻¹(Fᵢₙ(x)). then quantile function is
# Qᵢₙ(p) = inf{x: p ≤ Fᵢₙ(x)} = inf{x: Uᵢₙ⁻¹(p) ≤ Fₓ(x)} = Qₓ(Uᵢₙ⁻¹(p))
b = _uniform_orderstatistic(d)
return quantile(d.dist, $f(b, p))
end
end
end

function rand(rng::AbstractRNG, d::OrderStatistic)
# inverse transform sampling. Since quantile function is Qₓ(Uᵢₙ⁻¹(p)), we draw a random
# variable from Uᵢₙ and pass it through the quantile function of `d.dist`
T = eltype(d.dist)
b = _uniform_orderstatistic(d)
return T(quantile(d.dist, rand(rng, b)))
end
2 changes: 2 additions & 0 deletions src/univariates.jl
Original file line number Diff line number Diff line change
Expand Up @@ -730,3 +730,5 @@ end
for dname in continuous_distributions
include(joinpath("univariate", "continuous", "$(dname).jl"))
end

include(joinpath("univariate", "orderstatistic.jl"))
Loading