Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics from measures #14

Open
sethaxen opened this issue Aug 24, 2021 · 14 comments
Open

Statistics from measures #14

sethaxen opened this issue Aug 24, 2021 · 14 comments

Comments

@sethaxen
Copy link
Member

There are a number of properties we probably want to support computing from our measures:

  • mean
  • median
  • mode
  • std/var
  • cov
  • skewness/kurtosis/moment

Most of these will be unknown for our measures, but some of them are known and can be implemented. Some considerations:

Intrinsic or extrinsic?

The intrinsic (i.e. Riemannian) mean is a point on the manifold that minimizes the Riemannian variance of the measure. The extrinsic mean is the point in the embedding of the manifold that does the same. The intrinsic mean is guaranteed to be on the manifold but not necessarily in the support of the measure, while the extrinsic mean is generally only in the embedding. One could I suppose also want the point in the support of the measure that minimizes the variance, but I haven't seen this. How to support computing both intrinsic and extrinsic means? My idea is that we follow Manifolds.jl's lead and implement these functions with the manifold as the first argument, e.g.

Statistics.mean(d::SomeManifoldMeasure) = Statistics.mean(base_manifold(d), d) # default to intrinsic mean
Statistics.mean(::M, d::SomeManifoldMeasure{M}) where {M<:AbstractManifold}) = ... # intrinsic mean
Statistics.mean(::Euclidean{𝔽}, d::SomeManifoldMeasure{<:AbstractManifold{𝔽}}) = ... # extrinsic mean

Defaults without ambiguity may be easier if all of our measures inherit an AbstractManifoldMeasure{M<:AbstractManifold} type. The interface could also be defined for other measures though, e.g. Dirichlet defined in MeasureTheory.jl.

Uniqueness

While for distributions on Euclidean space, the mode is not necessarily unique, for distributions on manifolds, the intrinsic mean also often is not unique. For example, every point on the sphere minimizes the variance of the normalized Hausdorff measure. Similarly, the mean/mode of the Watson distribution on the sphere is always either a set of two antipodes or all points on a great circle. For these functions to be useful at all, I propose we return any mode (point of maximal density) and any intrinsic mean.

MeasureTheory.jl doesn't implement these functions yet, but it's under discussion, see JuliaMath/MeasureTheory.jl#131.

Any thoughts, @kellertuer @mateuszbaran @cscherrer?

@kellertuer
Copy link
Member

In general I would prefer the intrinsic one – but I am always in favour of the intrinsic one.
The intrinsic mean on the support seems to be an interesting constrained optimisation problem on its own – actually I am working in that area currently, so if the measure has nice constraints, I might take a look, since the intrinsic mean is such a nice and good example itself.

I like the idea of the default (first signature), intrinsic and extrinsic; though for full generality the third line could also be just some ::N where N <: AbstractManifold – sure Euclidean is the most used embedding, but you can also embed manifolds into other manifolds.
An example would be the SPD matrices and you use their embedding into the symmetric matrices. Maybe just because that gives you a symmetric matrix as a result, maybe because you have some fancy representation of points/tangents that are only defined on symmetric matrices (and spare half the memory)

Concerning uniqueness and returning all results. This might be a little hard if you can not get from one result (e.g. computed by some optimisation algorithm for the mean) to all of them. Then to return all you have to use “tricky starting points” so that your optimisation algorithm is converging to each result at least once.
So I agree that returning any point would be fine, but I would add “in a deterministic manner”, that is if you call it with a measure twice, you get the same result.

@sethaxen
Copy link
Member Author

The intrinsic mean on the support seems to be an interesting constrained optimisation problem on its own – actually I am working in that area currently, so if the measure has nice constraints, I might take a look, since the intrinsic mean is such a nice and good example itself.

I don't have any good examples right now. But it's also a bit relative. e.g. suppose we have a measure whose support is on a manifold M a submanifold of Euclidean, but since we don't have an implementation of M, we for convenience set the base manifold as Euclidean. The intrinsic mean is equivalent to the extrinsic one! Now we define M and set the base manifold to be M. The extrinsic mean stays the same, but the intrinsic mean changes. So I think the proposed interface is actually a little more explicit without using the terms "intrinsic" and "extrinsic". The measure defines the support used to determine variance, while the first argument defines the set over which we compute the minimizer.

I like the idea of the default (first signature), intrinsic and extrinsic; though for full generality the third line could also be just some ::N where N <: AbstractManifold – sure Euclidean is the most used embedding, but you can also embed manifolds into other manifolds.

I don't follow what the third argument would be doing here. Currently we require that all manifold measures carry in their type the manifold on which they are defined. The first argument is used to specify the constraint that is applied to the mean. The "intrinsic" mean is when the first argument and the base manifold of the measure are identical. The extrinsic mean is when the first argument is the result of get_embedding called on the base manifold of the measure. But one could in principle put other manifolds there. Where there's an obvious preferred embedding of the base manifold in the first argument, this could be easily implemented, but otherwise one might need to provide an embedding map, and that may be more advanced than we want to be right now.

Concerning uniqueness and returning all results. This might be a little hard if you can not get from one result (e.g. computed by some optimisation algorithm for the mean) to all of them. Then to return all you have to use “tricky starting points” so that your optimisation algorithm is converging to each result at least once.

Yes, I don't think we should return all results. If we had a nice way to define arbitrary sets, then maybe
, but we don't, so I think not.

So I agree that returning any point would be fine, but I would add “in a deterministic manner”, that is if you call it with a measure twice, you get the same result.

I agree! Similar to how log for the Sphere when the two points are antipodes deterministically returns one tangent vector from the ball of possibilities.

@kellertuer
Copy link
Member

The intrinsic mean on the support seems to be an interesting constrained optimisation problem on its own – actually I am working in that area currently, so if the measure has nice constraints, I might take a look, since the intrinsic mean is such a nice and good example itself.

I don't have any good examples right now. But it's also a bit relative. e.g. suppose we have a measure whose support is on a manifold M a submanifold of Euclidean, but since we don't have an implementation of M, we for convenience set the base manifold as Euclidean. The intrinsic mean is equivalent to the extrinsic one! Now we define M and set the base manifold to be M. The extrinsic mean stays the same, but the intrinsic mean changes. So I think the proposed interface is actually a little more explicit without using the terms "intrinsic" and "extrinsic". The measure defines the support used to determine variance, while the first argument defines the set over which we compute the minimizer.

Oh this was not a critique to the interface just to the point that the constrained (to the support) problem might be interesting :)

I like the idea of the default (first signature), intrinsic and extrinsic; though for full generality the third line could also be just some ::N where N <: AbstractManifold – sure Euclidean is the most used embedding, but you can also embed manifolds into other manifolds.

I don't follow what the third argument would be doing here. Currently we require that all manifold measures carry in their type the manifold on which they are defined. The first argument is used to specify the constraint that is applied to the mean. The "intrinsic" mean is when the first argument and the base manifold of the measure are identical. The extrinsic mean is when the first argument is the result of get_embedding called on the base manifold of the measure. But one could in principle put other manifolds there. Where there's an obvious preferred embedding of the base manifold in the first argument, this could be easily implemented, but otherwise one might need to provide an embedding map, and that may be more advanced than we want to be right now.

Not a third argument, just in the third signature the first argument could be a manifold instead of Euclidean (just not necessarily the manifold from the measure). So the first (constraint) manifold can be the symmetric matrices, that is what I meant.

So I agree that returning any point would be fine, but I would add “in a deterministic manner”, that is if you call it with a measure twice, you get the same result.

I agree! Similar to how log for the Sphere when the two points are antipodes deterministically returns one tangent vector from the ball of possibilities.

Exactly.

@mateuszbaran
Copy link
Member

I don't really like the two-argument variants of mean: a measure knows its manifold, and the manifold knows its embededing. So there is not much point to providing the manifold separately. What about an AbstractStatisticType with IntrinsicStatistic and ExtrinsicStatistic as two possible concrete types and then we'd have:

Statistics.mean(d::SomeManifoldMeasure, ::IntrinsicStatistic) = ...
Statistics.mean(d::SomeManifoldMeasure, ::ExtrinsicStatistic) = ...
Statistics.mean(d::SomeManifoldMeasure, ::ExtrinsicStatisticInADifferentEmbedding) = ...

etc.? It could also be a keyword argument.

AbstractStatisticType would apply to all of these functions.

@kellertuer
Copy link
Member

...Then you could also do the Extrinsic / Intrinsic variants with an Explicit EmbeddedManifold specification, that is
the Sphere (“only” AbstractEmbeddedManifold). would be intrinsic but EmbeddedManifold(Sphere(n), Euclidean(n+1)) would be extrinsic.
That would also resolve my pint with different embeddings, since for the symmetric matrices you would just specify Embeddedmanifold(SymmetricPositiveDefinite(n), Symmetric(n)).

Internally I would maybe first call get_manifold() and get_embedding() and dispatch on those, though.

@mateuszbaran
Copy link
Member

Yes, we could definitely encode the embedding in Symmetric as a constraint in AbstractStatisticType. It would also be possible to include some "empirical" estimation methods as AbstractStatisticType (drawing samples and computing the statistic from them). I think this would be the most flexible approach.

@kellertuer
Copy link
Member

Oh I meant to encode it in the manifold, then you would not need the AbstractStatisticType, if you specify the embedding by an explicit (i.e. not the abstract one) embedded manifold..

@mateuszbaran
Copy link
Member

One of the drawbacks of EmbeddedManifold is that for two given manifolds, it only lets us do one embedding, while AbstractStatisticType could describe the desired embedding.

@kellertuer
Copy link
Member

Can you provide an example, where EmbeddedManifold would not be enough?

Here's the cases I have in mind

  • Statistics.mean(d::SomeManifoldMeasure) = Statistics.mean(base_manifold(d), d), i.e. the same as the next
  • Statistics.mean(::M, d::SomeManifoldMeasure{M}) where {M<:AbstractManifold} the intrinsic one
  • Statistics.mean(::EmbeddedManifold{𝔽, M, N}, d::SomeManifoldMeasure{M}}) where {M<:AbstractManifold, N<:AbstractManifold}extrinsic mean (of the manifold M in its embedding N.

so I do not get what you mean with “one” embedding, I can choose different N for sure (for example Symmetric(n) or Euclidean(n,n) as described before).

@mateuszbaran
Copy link
Member

Look at SpecialEuclideanInGeneralLinear: if I wanted to embed SE in GL as -affine_matrix instead of affine_matrix, I couldn't do it, because EmbeddedManifold couples them in one way. I could override the embedding, but I can't use two different ones without overriding in one Julia process.

@kellertuer
Copy link
Member

But then we should maybe - in the long run - allow for an EmbeddedManifold to have different embeddings between two manifolds?

@mateuszbaran
Copy link
Member

Yes, sure, that would make sense.

@sethaxen
Copy link
Member Author

I would prefer we try to keep the interface here as simple as possible. Supporting alternative embeddings would be nice, but I agree that it would be better to handle that at the Manifolds level and then use that machinery here.

A few general annoyances:

  • For @kellertuer's proposal, EmbeddedManifold is currently used to override the default embedding, so e.g. EmbeddedManifold(Sphere(n), Euclidean(n+1)) should behave the same as Sphere(n), right? But in this case, mean(::EmbeddedManifold{𝔽, M, N}, d::SomeManifoldMeasure{M}}) where {M<:AbstractManifold, N<:AbstractManifold} would do something different from mean(::M, d::SomeManifoldMeasure{M}}) where {M<:AbstractManifold}.
  • For @mateuszbaran's proposal, I'm not a huge fan of implementing lots of AbstractStatistic types to get different behavior of the statistics, especially when most users will just want one of two. But Manifolds also has a completely different way of specifying that the extrinsic vs intrinsic mean should be computed, and if we went that route, it would be nice to harmonize. e.g. perhaps the type of mean to be computed and the method to be used should be encoded in different arguments to mean in Manifolds.jl. Then here we could use the same machinery. e.g. if a user passed one of the mean estimation methods to our mean here, it would use sampling followed by estimating with the provided method to compute the mean.

@kellertuer
Copy link
Member

In principle it would behave the same, the point why one should specify EmbeddedManifold(Sphere(n), Euclidean(n+1)) here is, to distinguish this one (extrinsic) from the (AbstractEmbeddedManifold that behaves the same but is not an EmbeddedManifold in the concrete-exactly-that-type sense) Sphere(n).

So it is intentional that they behave differently, because just by occasion Sphere(n) (intrinsic) methods are computed in the embedding (though we have intrinsic methods!).

Is that too confusing? I had hoped that this is a distinction that is understandable...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants