Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHANGE: Poisson distribution return type, change from float to integer #1093

Closed
ndebuhr opened this issue Jan 27, 2021 · 12 comments
Closed

CHANGE: Poisson distribution return type, change from float to integer #1093

ndebuhr opened this issue Jan 27, 2021 · 12 comments

Comments

@ndebuhr
Copy link
Contributor

ndebuhr commented Jan 27, 2021

Summary

As a discrete probability distribution, the intuitive return type for Poisson random variates would seemingly be an integer type covering the distribution's support. This output strategy is already employed by other discrete probability distributions in rand_distr, like the "Bernoulli-related" Binomial distribution, Geometric distribution, and Hypergeometric distribution.

Details

The Poisson distribution implementation would need to be revised, to return a u64 (if following the Binomial, Geometric, and Hypergeometric precedent), instead of the current Float return type - a breaking API change. In addition to the API change, the internal implementation/calculations would need some revisions.

Motivation

What is the motivation for this change?

Making this change could make the crate more intuitive and dependent programs cleaner. Fundamentally, Poisson random variates are unsigned integers, so representing them as signed floats seems unintuitive. To me, this seems like an API made to match an internal implementation, rather than an API to match real-world intuition.

Alternatives

The Poisson random variates can be either f64 or f32, but some basic type casting logic is nevertheless fairly straightforward. Any of the possible type casts require some thought around truncation, etc., but I don't see any major issues with the cast strategy.

Notes

Through some digging, I see that a u64 return type was the initial implementation #96, yet this was later modified c03e2c8. Are we calling this a "done deal" and settling on the float return type? If there's shared interest in revisiting this, I could take a stab at the PR.

@dhardy
Copy link
Member

dhardy commented Jan 27, 2021

See also #958.

@vks
Copy link
Collaborator

vks commented Jan 27, 2021

@dhardy I think most of the discussion is in #987.

As you noted, the reason we return an f64 is that this is the data type used internally. Forcing the output to be u64 would therefore be mostly cosmetic with the following disadvantages:

  • The returned number might not be representable as u64, introducing a panic or truncation we are currently avoiding.
  • The caller might want to have an f64 anyway, and would be forced to introduce a pointless conversion.

While I agree that we are leaking an implementation detail here, I'm not convinced the more intuitive API outweighs those disadvantages. (Note that in a way it is unavoidable to leak implementation details via the return type: There are many different ways to represent a number, and we either have to pick or make the code generic.)

@ndebuhr
Copy link
Contributor Author

ndebuhr commented Jan 27, 2021

Thanks for the replies! We are already making the code somewhat generic through the Poisson's Float use from num_traits. I wonder if we can refine/expand the generics implementation to get the best of both worlds. The Uniform distribution supports an extraordinarily wide variety of return types (and associated constructor argument types) - spanning both floats and integers. Is a similar strategy for Poisson an interesting possibility?

@dhardy
Copy link
Member

dhardy commented Jan 28, 2021

In #987 the argument against a u64 impl is that we shouldn't have potentially panicking code in the lib — but it's a weak argument since it is difficult to do better.

We could instead have an impl returning Result<u64, f64> — but I believe type inference will not work well (assuming we also keep the plain f64 impl).

So maybe the best thing is to add a u64 impl back (alongside the f64 impl) which panics when out of range. Considering we already have multiple impls (for F: Float) it shouldn't impact type inference. @vks?

@newpavlov
Copy link
Member

I also strongly believe that we should not have a spontaneously panicking impl. At the very least we should use something like let uint_res: u64 = float_res.try_into().unwrap_or(u64::MAX);. But IIRC TryInto<u64> is not implemented for f64, so we will have to write the conversion ourselves. Alternatively we could retry if a generated number falls out of u64 range.

@ndebuhr
Copy link
Contributor Author

ndebuhr commented Jan 29, 2021

That's a really interesting idea - rerunning generation, in the case of a number falling outside the u64 range. I would bet some of the other distributions are using acceptance-rejection algorithms for random variate generation - that wouldn't be exactly precedent, but pretty close. The Poisson generation efficiency for astronomically high values of lambda would be very low in this scenario - a situation that seems very unlikely to occur deliberately, but it could by accident. Do we also have to worry about super high lambda values leading to "silent failures" or "hangs" - where a massive number of variates are required for each u64 compliant variate? Some folks might think this sort of silent failure is worse than a spontaneously panicking impl.

@newpavlov
Copy link
Member

newpavlov commented Jan 29, 2021

We could limit lambda to a certain "reasonable" range, e.g. to values smaller than usize::MAX or we can round it up to 10^20, thus achieving guarantee that in the worst case at least half of samples will fall into the desired range (for 10^20 this probability will be smaller, but probably reasonable enough). The distribution constructor can panic on values outside this range, so by documenting this condition the panic will not be spontaneous anymore.

@dhardy
Copy link
Member

dhardy commented Jan 29, 2021

@newpavlov this is now the behaviour of as since 1.45.0

Out of range float to int conversions using as has been defined as a saturating conversion. This was previously undefined behaviour, but you can use the {f64, f32}::to_int_unchecked methods to continue using the current behaviour, which may be desirable in rare performance sensitive situations.

Alternatively we could retry if a generated number falls out of u64 range.

Why? That affects the distribution more. And if you're really generating u64 values that large frequently enough to observe too many u64::MAX values, then you should already know that you're pushing the limits of your numeric type (and this should be pretty rare given how large 2^64 - 1 is). So IMO this is a bad idea.

@vks
Copy link
Collaborator

vks commented Feb 4, 2021

We still need to make a decision. Personally, I would go with suggesting to use rng.sample(Poisson) as u64 (assuming Rust >= 1.45, so this is not undefined behavior), instead of implementing the distribution for u64 and possibly other integer types. This makes the caller responsible for how the lossy conversion should be done.

@ndebuhr
Copy link
Contributor Author

ndebuhr commented Feb 9, 2021

Is there enough discussion and consensus that I should close the issue? Is there anything valuable that I could contribute or help with here? It seems like there is no obviously best solution, and leaving the Poisson as float seems reasonable to me.

@vks
Copy link
Collaborator

vks commented Feb 9, 2021 via email

@ndebuhr
Copy link
Contributor Author

ndebuhr commented Feb 18, 2021

Thanks team. I've submitted a PR for rationale documentation.
rust-random/book#34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants