Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kdm/univariateboundaryfix #457

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KrisDM
Copy link
Contributor

@KrisDM KrisDM commented Jan 30, 2016

Fixes errors in the boundary conditions for univariate distributions.

These errors were partly caused because for some distributions
the boundary points of the support are not in the support itself (in other words, the support is an open or half-open interval).

I solved this by allowing the support to be a closed, half-open or open interval.
The default is a closed interval (which corresponds to the old behaviour), meaning that only
those distributions with a half-open or open support interval needed to be
amended. The core of this functionality is in the file src/univariates.jl.
For half-open or open intervals, the macro @distr_boundaries specifies
which boundaries of the interval are open or closed. See e.g., in
src/univariate/continuous/lognormal.jl for an example.

Tests have been added to test/utils.jl and src/testutils.jl

A second set of errors came from sloppy implementations in pdf, logpdf,
cdf, ccdf, logcdf or logccdf functions of several distributions that did not explicitly test if the
presented values fall inside or outside the support. Tests that I added
in src/testutils.jl flagged - I believe - all instances of such errors,
and the errors have been fixed.

@andreasnoack
Copy link
Member

Thanks for doing this systematically. There are quite a few changes here so it might take some time to reivew. I'll try to do it on Monday.

@KrisDM
Copy link
Contributor Author

KrisDM commented Jan 30, 2016

Great, thanks.

Btw, it somehow seems to have lumped the commits I did for PR #455 into this PR, I don't know how that happened as I created a new branch and a new PR for this. But there are no conflicts: this work fixes errors in univariates, whereas PR #455 is a new multivariate distribution.

@andreasnoack
Copy link
Member

@KrisDM Please remove the commits that are not related to this PR. I know it can be useful to work on top of your other branches, but it makes the review harder.

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 1, 2016

How do I do that? I have the impression it automatically merged the first PR once I made the second PR. It's not something I did knowingly (except for the last commit, which I did in this branch because I thought they had been merged now).

@andreasnoack
Copy link
Member

My guess is that you started out from the branch that you used for preparing the other PRs.

There is probably several ways of fixing this, but I would first copy the changes to a different location. Then I'd reset this branch to origin/master, i.e.

git fetch origin
git reset --hard origin/master

and then add the relevant chages to the branch and then

git push --force fork kdm/univariateboundaryfix

when you are done. If someone else has a better solution, feel free to suggest it.

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 1, 2016

OK, thanks for the tips, will get to it later tonight.

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 2, 2016

@andreasnoack I cleaned up the commits, squashing only the univariate-related commits into one so if you don't see any other issues with my fix you can merge the PR straight away.

I ended up doing it slightly differently from your suggestions, using git rebase -i origin/master to separate the commits, and then do a git push --force to my branch.

You were right on the cause: I now know that my original error was to not start the new branch from master, but from my previously created branch where I had done the multivariate work.

@KrisDM KrisDM closed this Feb 2, 2016
@KrisDM KrisDM reopened this Feb 2, 2016
@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 2, 2016

Closed-opened PR to trigger new Travis build (which had failed for an external reason).

@andreasnoack
Copy link
Member

@KrisDM I'm wondering if we can avoid the extra complexity you add here by using a few floting point tricks. It would be useful if you could give specific examples of the issues you are seeing, but a guess could be that you are e.g. trying to avoid the NaN in logpdf(Chi(1), 0.0). If this case, I think it is sufficient to have something like

function logpdf(d::Chi, x::Float64)
    ν = d.ν
    if insupport(d, x)
        return (1.0 - 0.5 * ν) * logtwo + max((ν - 1.0) * log(x), -Inf) - 0.5 * x^2 - lgamma(0.5 * ν)
    else
        return -Inf
    end
end

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 4, 2016

@andreasnoack The issues were somewhat different for all distributions that I made changes to. Sometimes it was about avoiding a NaN on the boundary point of the support, as you say, but it doesn't cover all cases. I came across some domain errors been thrown as well.

I thought that this was a principled solution because in the mathematical formulation of distributions, some support intervals are closed, some are open, and some are half-open. In the Julia code, however, all support intervals were regarded as closed.

I don't think this adds unnecessary complexity. In terms of computations, probably none, as it's likely the compiler inlines everything away. Developers of the distributions package who are adding a new distribution with half-open or open support interval only need to define 1 macro that follows the same pattern as the existing @distr_support macro, and then not worry about what happens at the boundary points. And finally, the users of the Distributions package are entirely shielded.

In contrast, the fix you suggest would have to be applied separately to each affected function, and would be different for each of them. It adds at least one and possibly more operations per function (functions that may need to be called often) and makes the code less readable. In my experience, a fix like that always comes back to bite in the future, so I'm not a fan.

…ions.

These errors were partly caused because for some distributions
the boundary points of the support are not in the support itself.

I solved this by allowing the support to be a closed, half-open or open interval.
The default is a closed interval (which corresponds to the old behaviour), meaning that only
those distributions with a half-open or open support interval needed to be
ammended. The core of this functionality is in the file src/univariates.jl.
For half-open or open intervals, the macro @distr_boundaries specifies
which boundaries of the interval are open or closed. See e.g., in
src/univariate/continuous/lognormal.jl for an example.

Tests have been added to test/utils.jl and src/testutils.jl

A second set of errors came from sloppy implementations in pdf, logpdf,
cdf, ccdf, logcdf or logccdf functions of specific distributions
that did not explicitly test if the presented values fall inside or outside
the support. Tests that were added in src/testutils.jl flagged
- I believe - all instances of such errors, and the errors have been fixed.

I spotted 2 additional errors in the support() function in
src/univariates.jl.

There were no tests for this function so I added them to
src/testutils.jl/test_support()
@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 5, 2016

@andreasnoack I fixed the indentation which was off in two places.

By chance I met the developer of Juno yesterday. I asked him to change the default to 4 spaces per tab :-).

@andreasnoack
Copy link
Member

I'm still not sold on this change. I think this should just be handled for each definition of the special functions instead of introducing new types, functions and a macro. This package is already quite heavy in terms of compilation and I don't think it makes the code simpler to understand.

If the special functions don't return NaNs or throw then you don't have to care in other places if the interval includes the boundary point or not. It might take some work to go through, but it would be much more work than this PR.

I wouldn't worry about the max. It just a single comparison in a function that calls lgamma and log so it will be almost impossible to measure a difference.

Finally, there are some issues with the implementation here which will probably result in a serious slow down. The reason is that the comparison function ends up being determined by a runtime value. One day the compiler might be smart enough to figure out that the support symbols are constant for a specific distribution, but right now you'll see things like

julia> @code_warntype insupport(Chisq(1), 2.0)
Variables:
  d::Distributions.Chisq
  x::Float64

Body:
  begin  # /Users/andreasnoack/.julia/v0.4/Distributions/src/univariates.jl, line 52:
      unless ((Distributions.comparator)((top(apply_type))(Distributions.Val,:closed)::Type{_<:Val{T}})::F)(0.0,x::Float64)::Any goto 0
      return ((Distributions.comparator)((top(apply_type))(Distributions.Val,:closed)::Type{_<:Val{T}})::F)(x::Float64,Distributions.Inf)::Any
      0: 
      return false
  end::Any

This is one of the reasons why I was talking about complexity.

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 6, 2016

@andreasnoack Fair point that the implementation has issues. I thought the compiler would be smart enough to figure it out. I'll see if I can fix it.

In general I think that, because the root cause of the problem lies with the insupport function, it needs to be fixed inside the insupport function, and not with a custom hack to each of the calculations, which is error-prone and less maintainable.

Another point: a large part of the complexity of my fix is because I wanted that the fix would be equally valid to the auxiliary RealInterval type, which isn't used anywhere inside the Distributions package. I don't see why Distributions is actually exporting this type at all. I checked a handful of downstream packages using Distributions, and have not seen any indication anyone is using it, probably because it was never documented (neither the RealInterval type, nor the function support() which returns a RealInterval, are mentioned in the documentation). So if I am allowed to retire RealInterval and support(), we can get rid of some superfluous complexity in the package, and I'll be able to fix the insupport function far more easily with a uniform fix for all distributions.

What do you think?

@simonbyrne
Copy link
Member

Sorry for getting to this so late, I've had other stuff on my plate.

Thanks for looking into this, but I have to agree with @andreasnoack, I think that this seems a somewhat over-complicated solution to the problem. The whole open-closed distinction is somewhat meaningless in regards to distributions, since the only cases where it can possibly be different is at +/-Inf (and even then, I don't really see any reason why Inf couldn't be in the support anyway).

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 8, 2016

@simonbyrne Mmh, that's not true though. Quite a few distributions have a limited interval. E.g., the LogNormal has open support interval (0,+Inf).

If I call insupport(LogNormal(),0.0), I would expect it to return false, not true as it does now.

There are several more. Another example, the Levy distribution, has open support interval (mu,+Inf), with mu a parameter of the distribution.

As I said in my previous message, the solution would become a lot simpler if I am allowed to retire the undocumented but exported type RealInterval (defined at the top of univariates.jl, but used nowhere in the package). There is another Julia package implementing intervals, and it seems to me that the Distributions package shouldn't be exporting its own.

@simonbyrne
Copy link
Member

See earlier discussion in #150 (comment)

@simonbyrne
Copy link
Member

But I would agree that RealInterval isn't much use as it stands.

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 8, 2016

#150 seems to be a somewhat different discussion.

Perhaps "support" has subtly different meanings in different branches of mathematics. Each Wikipedia page about a particular distribution defines the support as an open, half-open or closed interval, so it seems to be an accepted concept in mainstream probability theory.

These wiki pages about distributions link to another wiki page about support than the one you posted in #150: https://en.wikipedia.org/wiki/Support_(mathematics)

If I understand that page well, then, unlike what you write in #150, there is a difference between uniform distributions with support (0.0,1.0) and [0.0,1.0], as the first one should return pdf(0.0) = 0.0, and the second one would return a value > 0.0.

I will propose another solution that retires RealInterval and fixes the insupport function for specific distributions whose support has open/half-open intervals (as per their wiki definition). You can then see whether you like the solution or not. If you don't like that solution then so be it, but I have no interest in hacking each pdf, logpdf, cdf, ccdf etc function separately to get rid of the incorrect results. That's not right either.

@simonbyrne
Copy link
Member

These wiki pages about distributions link to another wiki page about support than the one you posted in #150: https://en.wikipedia.org/wiki/Support_(mathematics)

If I understand that page well, then, unlike what you write in #150, there is a difference between uniform distributions with support (0.0,1.0) and [0.0,1.0], as the first one should return pdf(0.0) = 0.0, and the second one would return a value > 0.0.

I don't see how you get that: the page you link to refers to support of a function, not a distribution (it does mention distributions and measures further down, but points out that there are subtleties involved).

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 8, 2016

I agree it's not clear cut, and that sometimes support is indeed referred to as a closure.

But then, there seems to be a need to make a distinction between those distributions for which the boundary points return 0.0 for their pdf, and those who don't. Otherwise surely the support interval for the LogNormal wouldn't be defined as an open interval, as here: https://en.wikipedia.org/wiki/Log-normal_distribution (right-hand column)

It may be that those pages are actually wrong, or otherwise it's one of the subtleties that the support page glosses over?

@simonbyrne
Copy link
Member

I hesitate to say it, but I think the pages might be wrong. They seem to be treating as "the support of the pdf", but pdfs themselves are only uniquely defined up to null sets. In other words, we can set the value of the pdf arbitrarily at the boundary points, and not change the underlying distribution.

There are also more practical reasons for treating the intervals as closed. In the uniform (0,1) case, we would have x = 1e-20 inside the support, but 1-x outside the support.

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 8, 2016

You may be right. On further reading, it seems there is a mixup of "support of the distribution" (the set of values for which PR(x=X) > 0) and the "domain of the pdf" (the values for which the function is defined). For some distributions these are the same, and for others they are not.

That mixup is probably what gives rise to the confusion on the wiki pages, but it's also visible in the Julia code itself, whenever the insupport function is used inside the pdf calculation to determine what actually should be the domain. E.g., for the LogNormal, the support is [0,+Inf], whereas the domain is x > 0.0.

So to be correct, we actually should not rely on the insupport function inside the calculation of the pdf, but do domain boundary checks which may or may not be different from what the insupport function returns.

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 8, 2016

In other words, it seems that pdf(LogNormal(),0.0) or pdf(LogNormal(),-1.0) should not return 0.0, but throw a DomainError?

@simonbyrne
Copy link
Member

Not quite, the support isn't P(X=x) > 0 (otherwise continuous distributions would have no support).

I think the general consensus we have reached is that we should avoid DomainErrors if possible. i.e. return pdf(d,x) == 0 for x outside the support. In the lognormal case, it would also make sense to have pdf(d,0) == 0, as that makes the pdf continuous, but for this may not be the case for all distributions (e.g. Exponential).

@KrisDM
Copy link
Contributor Author

KrisDM commented Feb 9, 2016

You're right, I meant pdf(x) > 0.0. I'll prep a fix when I have time. It seems the most valuable thing I did in the last fix was write the tests that expose the boundary errors.

In the fix I propose not relying on insupport inside the calculations for which support of the distribution and the domain of its pdf are different (the first one always being a closed interval if we follow the strict definition of support as closure, the second one not necessarily) Performing the correct boundary tests would be better than fiddling with the calculations.

@matbesancon
Copy link
Member

what's the status on this one? It needs rebasing from master plus the fixes you mentioned in your last comment @KrisDM

@matbesancon
Copy link
Member

ping @KrisDM :)

@KrisDM
Copy link
Contributor Author

KrisDM commented Mar 29, 2019 via email

@aplavin aplavin mentioned this pull request Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants