Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2x speed difference vs hand-unrolled #135

Open
yongqli opened this issue Jun 15, 2015 · 8 comments
Open

2x speed difference vs hand-unrolled #135

yongqli opened this issue Jun 15, 2015 · 8 comments

Comments

@yongqli
Copy link

yongqli commented Jun 15, 2015

Hi,

Based on this blog post, I've decided to benchmark nalgebra and I've found it to be 2x slower. Any ideas why? I'm new to rust, so it's entirely possible I'm doing something wrong.

https://gist.github.com/yongqli/7ba8ef0e06fbfaebd98f takes 9.186 s to run, so 9.18 ms per 1 million iterations.

#[cfg(test)]
mod tests {
    extern crate nalgebra;
    extern crate test;

    use super::*;
    use nalgebra::*;

    #[bench]
    fn bench_4x4_mult(b: &mut test::Bencher) {
        b.iter(|| {
            let mut a = test::black_box(
                Mat4::new(1., 1., 1., 1.,
                          1., 2., 1., 1.,
                          1., 1., 4., 1.,
                          1., 1., 1., 1.,)
            );

            let b = test::black_box(new_identity::<Mat4<f64>>(4));

            for _ in 0..1_000_000 {
                // Mat4::inv_mut(&mut a);
                a = a * b;
            }
        });
    }
}

takes 22 ms according to cargo bench.

@sebcrozet
Copy link
Member

I see. If this is due to the lack of manual unrolling, this is very unfortunate. Perhaps we could somehow perform this unrolling automatically using macros.

@yongqli
Copy link
Author

yongqli commented Jun 21, 2015

I've been using this, which you might also find useful:

macro_rules! new_Mat3x3(
    ($f: expr) => (
        Mat3x3(
            [[($f)(0, 0), ($f)(0, 1), ($f)(0, 2)],
             [($f)(1, 0), ($f)(1, 1), ($f)(1, 2)],
             [($f)(2, 0), ($f)(2, 1), ($f)(2, 2)]]
        )
    )
);


...


impl Add for $Mat {
    type Output = $Mat;
    #[inline(always)]
    fn add(self, rhs: $Mat) -> $Mat {
        $new_Mat!(|i, j| self[i][j] + rhs[i][j])
    }
}

This unrolls the closure into the "shape" of the matrix.

@yongqli
Copy link
Author

yongqli commented Jun 21, 2015

Here's an example of matrix multiplication:

macro_rules! unroll_sum_4 (
    ($f: expr) => (
        ($f)(0) + ($f)(1) + ($f)(2) + ($f)(3)
    )
);

...

macro_rules! unroll_Mat4x4(
    ($f: expr) => (
        Mat4x4(
            [[($f)(0, 0), ($f)(0, 1), ($f)(0, 2), ($f)(0, 3)],
             [($f)(1, 0), ($f)(1, 1), ($f)(1, 2), ($f)(1, 3)],
             [($f)(2, 0), ($f)(2, 1), ($f)(2, 2), ($f)(2, 3)],
             [($f)(3, 0), ($f)(3, 1), ($f)(3, 2), ($f)(3, 3)]]
        )
    )
);

...

impl Mul<Mat4x4> for Mat4x4 {
    type Output = Mat4x4;
    #[inline(always)]
    fn mul(self, rhs: Mat4x4) -> Mat4x4 {
        unroll_Mat4x4!(|i, j| unroll_sum_4!(|k| self[i][k] * rhs[k][j]))
    }
}

@bluss
Copy link

bluss commented Aug 23, 2015

@yongqli What kind of compilation flags did you use for C and Rust? This forum thread's later posts touch upon the issue of -ffast-math (lack thereof in rust) and also lack of unrolling. Lack of vectorization in floating point reduction (accumulation) loops is explicitly documented by llvm.

@milibopp
Copy link
Collaborator

When issues like this come up, I always feel like it should be pushed to the compiler, as this will solve it in more generally useful fashion (hopefully).

@bluss
Copy link

bluss commented Sep 10, 2015

The annotation for "fast" float semantics probably needs to be explicit. Maybe that's not the whole issue.

@bluss
Copy link

bluss commented Dec 20, 2015

Here's the rust issue on fast-math / imprecise float operations rust-lang/rust/issues/21690

@yongqli
Copy link
Author

yongqli commented Feb 17, 2017

There's still a performance difference of up to .8x to 2.3x with the latest version of nalgebra.

test tests::bench_4x4_mult_nalgebra   ... bench:      26,195 ns/iter (+/- 2,911)
test tests::bench_4x4_mult_unrolled   ... bench:       7,826 ns/iter (+/- 977)
test tests::bench_4x4_t_mult_nalgebra ... bench:      18,170 ns/iter (+/- 2,764)
test tests::bench_4x4_t_mult_unrolled ... bench:      10,420 ns/iter (+/- 1,630)

You can run it yourself by checking out https://github.com/yongqli/rust_linalgs_bench and running cargo bench

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants