Uniform float improvements #1289

dhardy · 2023-02-21T17:08:43Z

Add a (possibly overcomplicated) benchmark for uniform float sampling
Add a couple extra tests
Add an explicit impl of sample_single_inclusive (~20% perf increase)

Note: sample_single and sample_single_inclusive use different code paths for sampling.

Around 20% faster for f32, slightly less for f64.

dhardy · 2023-02-21T17:24:25Z

Closes #1270. This branch was created to evaluate the mentioned paper: https://hal.science/hal-03282794/document

The paper ignores possible underflow, but notes the following potential issues for uniform FP sampling from a defined range:

Available precision may vary throughout the given range due to differing exponent
The method we use to sample from [0, 1) can only sample half the possible values just below 1
Translating from x in [0,1) to a range [a, b) via common methods (such as a + (b-a)*x) can yield value b due to loss of precision. We have special measures for dealing with this.
b - a may overflow. (Note: I personally do not think this a big deal since it is rare to use FP values near the limits of their range.)
Most transformations do not produce equidistant FP values. In some cases such as Monte Carlo estimation of PI this can impact the result, but in practice I believe it is rarely an issue.

GSL's method, a(1-x) + bx, translates to a(2-z) + b(z-1) where our samples are z in [1, 2) and x = z - 1. This method was benchmarked, with approx 12% performance loss for f32 single-sampling, 2% loss for f64, 10% loss for f32 distribution sampling, and 19% loss for f64. Only 3 of 24 benchmarks showed a performance gain. An attraction of this method is that it avoids the possible overflow of calculating b - a.

The method using y = 2(a/2 + (b/2 - a/2)x) has accuracy issues on sub-normals. In particular, something like f32::from_bits(3) / 2.0 rounds up, thus the result may exceed the upper bound. This is a more significant issue than overflow, hence this method is disqualified from further analysis.

The paper defines method SESS (Same Exponent Same Sign) which we shall ignore due to the narrow circumstances in which it is applicable.

The paper also defines method γsection for general applicability. This is used for sample_single_inclusive, translated to Rust from the Julia snippet γsectionCC as follows:

let (a, b) = (low, high);

let g = (a.next_up() - a).max(b - b.next_down());
let s = b/g - a/g;
let eps = if a.abs() <= b.abs() {
    -a/g - (s - b/g)
} else {
    b /g - (s + a/g)
};
let si = s.ceil() as $uty;
let hi = if s != si as $ty {
    si
} else {
    si + (eps > 0.0) as $uty
};

let k = (0..=hi).sample_single(rng).unwrap();
Ok(if a.abs() <= b.abs() {
    if k == hi {
        a
    } else {
        b - (k as $ty) *g
    }
} else {
    if k == hi {
        b
    } else {
        a + (k as $ty) *g
    }
})

Results are around 10 times slower for f32 and 6 times for f64:

$ cargo bench --bench uniform_float --features small_rng -- -b baseline inclusive
   Compiling rand v0.9.0 (/home/dhardy/projects/rand/rand)
    Finished bench [optimized] target(s) in 1.47s
     Running benches/uniform_float.rs (target/release/deps/uniform_float-177b17dd872a054a)
single_random_f32/SmallRng/sample_inclusive
                        time:   [17.293 ns 17.337 ns 17.377 ns]
                        change: [+928.42% +930.48% +934.08%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7868 outliers among 100000 measurements (7.87%)
  7868 (7.87%) high mild
single_random_f32/ChaCha8Rng/sample_inclusive
                        time:   [18.027 ns 18.072 ns 18.118 ns]
                        change: [+832.83% +835.21% +837.39%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5987 outliers among 100000 measurements (5.99%)
  5987 (5.99%) high mild
single_random_f32/Pcg32/sample_inclusive
                        time:   [17.056 ns 17.096 ns 17.139 ns]
                        change: [+913.05% +916.00% +918.69%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7752 outliers among 100000 measurements (7.75%)
  7752 (7.75%) high mild
single_random_f32/Pcg64/sample_inclusive
                        time:   [17.796 ns 17.838 ns 17.880 ns]
                        change: [+893.31% +895.96% +898.01%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5205 outliers among 100000 measurements (5.21%)
  5205 (5.21%) high mild

single_random_f64/SmallRng/sample_inclusive
                        time:   [22.094 ns 22.143 ns 22.189 ns]
                        change: [+571.29% +572.87% +574.19%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6735 outliers among 100000 measurements (6.74%)
  6735 (6.74%) high mild
single_random_f64/ChaCha8Rng/sample_inclusive
                        time:   [23.074 ns 23.124 ns 23.173 ns]
                        change: [+445.36% +446.46% +447.50%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3917 outliers among 100000 measurements (3.92%)
  3917 (3.92%) high mild
single_random_f64/Pcg32/sample_inclusive
                        time:   [22.403 ns 22.448 ns 22.493 ns]
                        change: [+570.28% +571.57% +573.10%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4182 outliers among 100000 measurements (4.18%)
  4182 (4.18%) high mild
single_random_f64/Pcg64/sample_inclusive
                        time:   [22.518 ns 22.565 ns 22.611 ns]
                        change: [+518.80% +520.10% +521.35%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5656 outliers among 100000 measurements (5.66%)
  5656 (5.66%) high mild

In comparison, our current implementations benchmark at 1-1.5ns for distribution sampling, ~2ns for single f32 sampling and ~4ns for single f64 sampling on an AMD 5800X at 3.8GHz.

The paper's implementation of SESS benchmarked at 10ns per value for both f32 and f64, while most other methods benchmarked at approx 15ns per value on an Intel 6700HQ at 2.6GHz. It would appear that there is room for optimisation both in the SESS implementation above and in the other implementations as used in the paper, however I am doubtful that the method could perform competitively with our current method.

Note also that #531 should (if I remember correctly) be able to offer more precise sampling and with similar performance, thus I do not see much benefit in considering the γsection method further.

dhardy · 2023-02-21T17:25:48Z

Test failure should be resolved by #1287 (use of feature small_rng for testing benchmarks).

@vks could you review please?

TheIronBorn · 2023-02-22T20:12:53Z

Are there any significant optimizations we can make for the [0,1) case?

dhardy · 2023-02-23T07:43:32Z

I didn't look at that here, but I don't think so (we essentially just mask and bit-cast an integer, then subtract 1).

benches/uniform_float.rs

vks

Looks good! I only have some small questions about the benchmarks.

benches/uniform_float.rs

src/distributions/uniform.rs

Also avoid using internal API of Uniform distribution.

dhardy added 3 commits February 21, 2023 12:58

Add uniform_float benchmark

95b366f

Note: sample_single and sample_single_inclusive use different code paths for sampling.

Add direct impl of sample_single_inclusive for floats

5e5879a

Around 20% faster for f32, slightly less for f64.

Uniform for floats: extra tests

7aedb36

dhardy mentioned this pull request Mar 9, 2023

Uniform Generator hangs for certain limits. #1299

Closed

vks reviewed Mar 17, 2023

View reviewed changes

benches/uniform_float.rs Outdated Show resolved Hide resolved

vks reviewed Mar 17, 2023

View reviewed changes

benches/uniform_float.rs Show resolved Hide resolved

benches/uniform_float.rs Show resolved Hide resolved

src/distributions/uniform.rs Outdated Show resolved Hide resolved

dhardy added 2 commits March 18, 2023 09:54

Bench uniform_float: better grouping of plots

8d98d45

Also avoid using internal API of Uniform distribution.

CI: benches require small_rng

ee1653c

WarrenWeckesser mentioned this pull request Mar 21, 2023

Improve calculation of the scale parameter for the uniform float distribution. #1301

Closed

vks approved these changes May 1, 2023

View reviewed changes

Merge branch 'master' into uniform-float

026292d

vks merged commit d4a2945 into rust-random:master May 1, 2023

dhardy deleted the uniform-float branch July 8, 2024 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniform float improvements #1289

Uniform float improvements #1289

dhardy commented Feb 21, 2023

dhardy commented Feb 21, 2023

dhardy commented Feb 21, 2023

TheIronBorn commented Feb 22, 2023 •

edited

Loading

dhardy commented Feb 23, 2023

vks left a comment

Uniform float improvements #1289

Uniform float improvements #1289

Conversation

dhardy commented Feb 21, 2023

dhardy commented Feb 21, 2023

dhardy commented Feb 21, 2023

TheIronBorn commented Feb 22, 2023 • edited Loading

dhardy commented Feb 23, 2023

vks left a comment

Choose a reason for hiding this comment

TheIronBorn commented Feb 22, 2023 •

edited

Loading