Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OrderStatistic and JointOrderStatistics distributions #1668

Merged
merged 46 commits into from
May 22, 2023

Conversation

sethaxen
Copy link
Contributor

This PR will add two new distributions:

  • OrderStatistic: a univariate distribution representing the distribution of the rank ith draw in an IID sample of length n from some (continuous or discrete) univariate distribution dist.
  • JointOrderStatistics: a continuous multivariate distribution representing the joint distribution of the rank r draws in the same sample.

These have a number of uses. For example, they can represent transformations that have been performed to data (e.g. outlier removal and calculation of summary statistics). They also can be used to draw confidence intervals, e.g. of ECDF plots. For JointOrderStatistics, in the extreme case where r=1:n, i.e. when the distribution is over all ranks, it can be used in a PPL like Turing to restrict an IID array of parameters to the subset of arrays that are ordered.

Relates #1643 #1655, closes #1284

@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2023

Codecov Report

Patch coverage: 98.11% and project coverage change: -0.03 ⚠️

Comparison is base (9cf6a74) 85.92% compared to head (2804662) 85.89%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1668      +/-   ##
==========================================
- Coverage   85.92%   85.89%   -0.03%     
==========================================
  Files         139      142       +3     
  Lines        8376     8560     +184     
==========================================
+ Hits         7197     7353     +156     
- Misses       1179     1207      +28     
Impacted Files Coverage Δ
src/Distributions.jl 100.00% <ø> (ø)
src/multivariates.jl 44.82% <ø> (ø)
src/univariate/orderstatistic.jl 96.66% <96.66%> (ø)
src/multivariate/jointorderstatistics.jl 98.66% <98.66%> (ø)
src/univariates.jl 75.45% <100.00%> (+0.90%) ⬆️

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sethaxen
Copy link
Contributor Author

I've only seen the expression of the density of joint order statistics 1:n or [i,j], so for completeness, here's the derivation for the more general case used in JointOrderStatistics:

Derivation

Given the density function $f_X$ of the sample distribution, with CDF $F_X$, the joint density for all order statistics is
$$f_{1\ldots n:n}(x_1, \ldots, x_n) = n! \prod_{k=1}^n f_X(x_k)$$

Suppose now we marginalize all $x_k$ for $i &lt; k &lt; j$, where $i$ and $j$ are integers in $1\ldots n$. The resulting density is
$$f_{1\ldots i,j\ldots n:n}(x_1, \ldots, x_i, x_j, \ldots x_n) = n! \left(\prod_{k=1}^i f_X(x_k)\right) \left(\prod_{k=j}^n f_X(x_k)\right) C_{ij}(x_i, x_j),$$
where
$$C_{ij}(x_i, x_j) = \int_{x_i}^{x_j} \cdots \int_{x_i}^{x_{i+3}}\int_{x_i}^{x_{i+2}} f_X(x_{i+1}) f_X(x_{i+2}) \cdots f_X(x_{j-1}) \mathrm{d}x_{i+1} \mathrm{d}x_{i+2} \cdots \mathrm{d}x_{j-1}.$$

Let $u_k = F_X(x_k) - F_X(x_i)$. Then $\mathrm{d}u_k = f_X(x_k) \mathrm{d}x_k$, and
$$C_{ij}(x_i, x_j) = \int_{0}^{u_j} \cdots \int_{0}^{u_{i+3}}\int_{0}^{u_{i+2}} \mathrm{d}u_{i+1} \mathrm{d}u_{i+2} \cdots \mathrm{d}u_{j-1} = \frac{u_j^{j-i-1}}{(j-i-1)!} = \frac{(F_X(x_j) - F_X(x_i))^{j-i-1}}{(j-i-1)!}$$

We can repeat this to get the joint density of any subset of order statistics.

Copy link
Contributor Author

@sethaxen sethaxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added them to the docs yet. Univariate distributions are currently separated into Continuous and Discrete categories, but OrderStatistic can be either. Plus, JointOrderStatistics and OrderStatistic are so closely related, it might make sense to give them their own docs page. Could even be a page devoted to distributions of statistics of samples, in case in the future someone adds asymptotic distributions of statistics like quantiles, etc.

Comment on lines 115 to 116
# this is slow if length(d.r) is close to n and quantile for d.dist is expensive,
# but this branch is probably taken when length(d.r) is small or much smaller than n.
Copy link
Contributor Author

@sethaxen sethaxen Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably maybe worth creating samplers for these two cases.

JointOrderStatistics(Cauchy(), 10, [1, 10]) # joint distribution of the extrema
```
"""
struct JointOrderStatistics{D<:ContinuousUnivariateDistribution,R<:AbstractVector{Int}} <:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the continuous case, we can write down the PDF and not much else (except the mean and covariance for a handful of dists, for a future PR).

For the discrete case, the PDF is much more complicated and probably only tractable when length(r) <= 2. For a discrete dist whose support is a finite set, we can define a distribution of the order statistics for a sample without replacement, at least in the bivariate case. The PMF, mean, and covariance are known. I wonder if that should be a special case of this distribution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if that should be a special case of this distribution.

Nah, seems complicated and not all that useful.

@sethaxen sethaxen marked this pull request as ready for review February 9, 2023 19:09
@sethaxen
Copy link
Contributor Author

Bump @devmotion

Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the embarassingly long delay, I had completely forgotten this PR (again).

I made a few comments. I also wonder if we could add a comparison with existing R implementations, to exploit additionallu the existing test infrastructure and run standardized tests for evaluations and sampling?

Plus, JointOrderStatistics and OrderStatistic are so closely related, it might make sense to give them their own docs page

I agree, I think a separate page in the docs could be useful.

src/multivariate/jointorderstatistics.jl Outdated Show resolved Hide resolved
src/multivariate/jointorderstatistics.jl Outdated Show resolved Hide resolved
src/multivariate/jointorderstatistics.jl Show resolved Hide resolved
src/multivariate/jointorderstatistics.jl Outdated Show resolved Hide resolved
src/multivariate/jointorderstatistics.jl Outdated Show resolved Hide resolved
src/multivariate/jointorderstatistics.jl Outdated Show resolved Hide resolved
src/univariate/orderstatistic.jl Outdated Show resolved Hide resolved
src/univariate/orderstatistic.jl Show resolved Hide resolved
src/univariate/orderstatistic.jl Outdated Show resolved Hide resolved
src/multivariate/jointorderstatistics.jl Outdated Show resolved Hide resolved
@sethaxen
Copy link
Contributor Author

Sorry for the embarassingly long delay, I had completely forgotten this PR (again).

No problem!

I also wonder if we could add a comparison with existing R implementations, to exploit additionallu the existing test infrastructure and run standardized tests for evaluations and sampling?

Unfortunately I don't think there are any standard R implementations of distributions of order statistics against which we can compare.

I agree, I think a separate page in the docs could be useful.

Done!

@sethaxen sethaxen requested a review from devmotion May 17, 2023 16:32
src/multivariate/jointorderstatistics.jl Outdated Show resolved Hide resolved
src/multivariate/jointorderstatistics.jl Outdated Show resolved Hide resolved
src/multivariate/jointorderstatistics.jl Outdated Show resolved Hide resolved
function _rand!(rng::AbstractRNG, d::JointOrderStatistics, x::AbstractVector{<:Real})
n = d.n
if n == length(d.r) # r == 1:n
# direct method, slower than inversion method for large `n` and distributions with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no I'm not aware of such an example. Might be good enough for now.

src/multivariate/jointorderstatistics.jl Show resolved Hide resolved
@sethaxen sethaxen requested a review from devmotion May 18, 2023 17:00
Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR, looks good to me @sethaxen!

I spotted only a few final things that, IMO, should be addressed before merging the PR. But I already approve now to indicate that it seems fine to me otherwise.

docs/src/order_statistics.md Outdated Show resolved Hide resolved
src/univariate/orderstatistic.jl Outdated Show resolved Hide resolved
src/univariate/orderstatistic.jl Outdated Show resolved Hide resolved
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
@sethaxen
Copy link
Contributor Author

Great PR, looks good to me @sethaxen!

Great, thanks for the review!

@sethaxen
Copy link
Contributor Author

I realized that when we evaluated pdf(::JointOrderStatistics, x) for unsorted x, an error would be raised by logdiffcdf, so I fixed that. This should now be ready to merge.

@devmotion
Copy link
Member

Great! I'll merge it within the next days, if there are no objections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding order statistics sampling
3 participants