Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binomial and Poisson distributions #96

Merged
merged 3 commits into from
Mar 18, 2018
Merged

Conversation

fizyk20
Copy link
Contributor

@fizyk20 fizyk20 commented Jan 22, 2016

This pull request adds the binomial and Poisson distributions using "step-by-step" generation for low expected values and the rejection method for higher ones.

The code is heavily based on the algorithms presented in "Numerical Recipes in C" - it's obviously not just copy-pasted and the algorithms are rather widely known, but I'm still not sure if that doesn't cause any licensing problems.

@alexcrichton alexcrichton added the F-new-int Functionality: new, within Rand label Jun 14, 2017
@dhardy dhardy mentioned this pull request Sep 11, 2017
@mkindahl
Copy link

Not sure what dependencies that are allowed by the library, but POSIX.1-2001 defines both the tgamma and lgamma function for computing the gamma function and ln-Gamma function, respectively. Hence the log gamma function might only be necessary to define if the intention is to not be dependent on POSIX.

@fizyk20
Copy link
Contributor Author

fizyk20 commented Oct 24, 2017

I don't think the crate should depend on POSIX, it's also supposed to work on Windows, after all. That being said, we could conditionally compile to use the POSIX lgamma on systems supporting it, but I'm not sure if it's worth the effort.

Also, it looks like this PR might be made obsolete by the future changes in the API anyway, so... ;)

@dhardy
Copy link
Member

dhardy commented Oct 24, 2017

I don't know about obselete. There's been no discussion yet of whether distributions should be moved to a separate crate; having these distributions would still be useful.

@dhardy
Copy link
Member

dhardy commented Mar 4, 2018

@fizyk20 are you happy to update this now? I think we could merge now.

  • https in headers please
  • impl Distribution not *Sample

@fizyk20
Copy link
Contributor Author

fizyk20 commented Mar 4, 2018

Awesome! Rebased, fixed and squashed. If any other changes are necessary, please let me know.

@fizyk20 fizyk20 changed the title Added binomial and Poisson distributions Add binomial and Poisson distributions Mar 4, 2018
@fizyk20 fizyk20 force-pushed the discrete branch 2 times, most recently from 2ced48c to b7c59c6 Compare March 4, 2018 13:54
@dhardy
Copy link
Member

dhardy commented Mar 4, 2018

Thanks! Are you force-pushing? (Was told page was out of date twice.) Just let me know when this is ready for review..

@fizyk20
Copy link
Contributor Author

fizyk20 commented Mar 4, 2018

Yes, I was - I realised that my rustfmt ran automatically and modified pretty much the whole crate, so I amended the commit so that the changes are minimal. Should be ready now 👍

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you remembered my aversions to rustfmt 🤣

Just had a quick look at the API and doc; looks good! I'd still like to go over the maths though I don't expect any probleems.


/// Calculates ln(gamma(x)) (natural logarithm of the gamma
/// function) using the Lanczos approximation with g=5
pub fn log_gamma(x: f64) -> f64 {
Copy link
Member

@dhardy dhardy Mar 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to see a little more doc on this.

Edit: sorry, it's not exported so this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I added more before I noticed the edit :P Well, it won't hurt, and may help in reading the code ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good anyway 👍

}

impl Distribution<u64> for Binomial {
fn sample<R: Rng>(&self, rng: &mut R) -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using R: Rng + ?Sized for now. (I'm not sure whether to change to R: RngCore + ?Sized; I'm actually surprised this compiles since it differs from the trait.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it wouldn't compile if some code tried to use it with an unsized R - but I'd expect it to notice the difference, too. Maybe it's something that could be reported to the compiler team?

/// `n`, `p`. Panics if `p <= 0` or `p >= 1`.
pub fn new(n: u64, p: f64) -> Binomial {
assert!(p > 0.0, "Binomial::new called with `p` <= 0");
assert!(p < 1.0, "Binomial::new called with `p` >= 1");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why quote p in the assert messages? I think better not to (also for λ later)

}

impl Distribution<u64> for Poisson {
fn sample<R: Rng>(&self, rng: &mut R) -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R: Rng + ?Sized again

/// The Poisson distribution `Poisson(lambda)`.
///
/// This distribution has density function: `f(k) = lambda^k *
/// exp(-lambda) / k!` for `k >= 0`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not to split the code block over multiple lines IMO

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments; still needs more review of the maths (which are not trivial).

if expected < 25.0 {
let mut lresult = 0.0;
for _ in 0 .. self.n {
if rng.gen::<f64>() < p {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitdicker what do you think about this probability test? I've been wondering if we should add a dedicated Bernoulli distribution for more accurate sampling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should first learn a lot before I can make any meaningful comments w.r.t. the distributions...

This single line is pretty much the Bernoulli distribution? It might be more generally useful than gen_weighted_bool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should add a Bernoulli distribution, and it would be nice to have it reasonably accurate for small p.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can do better than just rng.gen::<f64>() < p, then sure, this is a good idea, otherwise I'm not really convinced that it makes sense to make it a separate distribution.
And as for doing better, that's a bit over my head, so unfortunately I won't be able to help...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fill you in @fizyk20, @pitdicker already did quite a bit of work implementing higher-precision floating point sampling, since the default method uses the same precision over the 0-1 range despite the format being able to represent a lot more close to 0 — however, we seem to have decided not to use this sampling method by default. There's also the thing that we use a small offset, which normally isn't an issue, but might be for correct sampling of small probabilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost available with Rng::gen_bool(p) from #308.

lambda: lambda,
exp_lambda: (-lambda).exp(),
log_lambda: lambda.ln(),
magic_val: lambda * lambda.ln() - log_gamma(1.0 + lambda),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call lambda.ln() twice? Also, the choice of method could be made here and stored in an enum; this reduces the number of parameters since exp_lambda and the last two parameters are not used simultaneously. Maybe also cache (2.0 * self.lambda).sqrt() since SQRT is slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an oversight, I wouldn't be surprised if the compiler optimised that, though.
And sure, caching the sqrt sounds reasonable.

fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> u64 {
// using the algorithm from Numerical Recipes in C

// for low expected values use the Knuth method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use the inverse transform method for small samples, since it only requires 1 random sample? I don't know a lot about this topic unfortunately.
https://en.wikipedia.org/wiki/Poisson_distribution#Generating_Poisson-distributed_random_variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't know, my knowledge is also limited. Maybe it is a good idea, sampling just once sounds attractive. I mostly just ported the algorithm from "Numerical Recipes", but it is possible that something else could be better.

Copy link
Member

@dhardy dhardy Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm; unless an expert in this area turns up (unlikely), perhaps the best we can do is implement some tests (e.g. plot a high-resolution histogram), then say this is good enough for now the best we can do.

@pitdicker
Copy link
Contributor

One small thing to add yet are benchmarks.

@dhardy dhardy mentioned this pull request Mar 11, 2018
33 tasks
@dhardy dhardy added P-medium D-review Do: needs review labels Mar 12, 2018

mod float;
mod integer;
mod log_gamma;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module also needs the cfg gate for std-only

@pitdicker
Copy link
Contributor

I have just gone over the code here and in "Numerical Recipes in C" side by side. It is a little different and uses much clearer function names. But it seems like a clean translation to me.

We should get more expert in the various distribution sampling methods, and may want to pick different algorithms in the future. But that will take time, I think this PR is just good to have in the meantime.

My result with plotting 100.000 samples of the binomial distribution (using simple spreadsheet):

    #[test]
    fn test_binomial_distr() {
        let mut results = [0; 41];
//        let binomial = Binomial::new(20, 0.5);
//        let binomial = Binomial::new(20, 0.7);
        let binomial = Binomial::new(40, 0.5);
        let mut rng = ::test::rng(123);
        for _ in 0..100_000 {
            let sample = binomial.sample(&mut rng);
            if sample <= 40 { results[sample as usize] += 1 }
        }
        let sum = results.iter().sum::<u64>() as f64;
        for sample in results.iter() {
            println!("{}", *sample as f64 / sum);
        }
        panic!();
    }

afbeelding

And the same for Poisson:

    #[test]
    fn test_poisson_distr() {
        let mut results = [0; 21];
//        let poisson = Poisson::new(1.0);
//        let poisson = Poisson::new(4.0);
        let poisson = Poisson::new(10.0);
        let mut rng = ::test::rng(123);
        for _ in 0..100_000 {
            let sample = poisson.sample(&mut rng);
            if sample <= 20 { results[sample as usize] += 1 }
        }
        let sum = results.iter().sum::<u64>() as f64;
        for sample in results.iter() {
            println!("{}", *sample as f64 / sum);
        }
        panic!();
    }

afbeelding

It is very primitive, but both look very plausible to me when compared to Wikipedia (Binomial, Poisson) 😄.

What is left to finally get this PR over the finish line (after 2+ years)?

  • std-only feature gate (also fixed CI error)
  • benchmarks
  • maybe use Rng::gen_bool(p)
  • I see some tests that are disabled for msvc, I suppose two years ago it was not completely reliable yet?

@dhardy What do you think of merging this PR, and I make a PR with those tiny fix-ups?

@pitdicker
Copy link
Contributor

The distributions are slow at the moment though:

test distr_uniform_f64          ... bench:       2,769 ns/iter (+/- 7) = 2889 MB/s (baseline)

test distr_binomial             ... bench:      87,178 ns/iter (+/- 373) = 91 MB/s
test distr_exp                  ... bench:       6,445 ns/iter (+/- 38) = 1241 MB/s
test distr_gamma_large_shape    ... bench:      17,451 ns/iter (+/- 152) = 458 MB/s
test distr_gamma_small_shape    ... bench:      76,848 ns/iter (+/- 2,983) = 104 MB/s
test distr_log_normal           ... bench:      23,258 ns/iter (+/- 431) = 343 MB/s
test distr_normal               ... bench:       6,196 ns/iter (+/- 46) = 1291 MB/s
test distr_poisson              ... bench:      31,369 ns/iter (+/- 330) = 255 MB/s

@dhardy
Copy link
Member

dhardy commented Mar 18, 2018

Ok. I created a tracker: #310

@dhardy dhardy merged commit 2e3f2bf into rust-random:master Mar 18, 2018
@pitdicker
Copy link
Contributor

Actually I have a branch ready.

@pitdicker
Copy link
Contributor

🎉

@fizyk20 fizyk20 deleted the discrete branch March 19, 2018 15:16
pitdicker pushed a commit that referenced this pull request Apr 4, 2018
Add binomial and Poisson distributions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D-review Do: needs review F-new-int Functionality: new, within Rand
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants