Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD implementation for float returns index > final array index #8

Closed
jvdd opened this issue Sep 23, 2022 · 7 comments
Closed

SIMD implementation for float returns index > final array index #8

jvdd opened this issue Sep 23, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@jvdd
Copy link

jvdd commented Sep 23, 2022

First of all thanks for open-sourcing this amazing library! The SIMD implementation gives me huge speedups 🚀

I believe I stumbled upon a bug while I was benchmarking the code. It seems that for large arrays, the simd_f*.rs implementation (thus for float arrays) returns an index value that is outside the range of valid array indexes.

Reproducible example;

use argmm::ArgMinMax;
use argmm::generic::{simple_argmin, simple_argmax};

fn main() {
    let arr: Vec<i32> = (0..17_000_000).collect();

    // using scalar implementation
    let min = simple_argmin(&arr);
    let max = simple_argmax(&arr);
    println!("Scalar -- min: {}, max: {}", min, max);

    // using SIMD implementation
    let min = arr.argmin().unwrap();
    let max = arr.argmax().unwrap();
    println!("SIMD   -- min: {}, max: {}", min, max);

    let arr: Vec<f32> = arr.iter().map(|x| *x as f32).collect();

    // using scalar implementation
    let min = simple_argmin(&arr);
    let max = simple_argmax(&arr);
    println!("Scalar -- min: {}, max: {}", min, max);

    // using SIMD implementation
    let min = arr.argmin().unwrap();
    let max = arr.argmax().unwrap();
    println!("SIMD   -- min: {}, max: {}", min, max);
}

From my basic understanding I think this might stem from changing precision when casting the usize index to f32 in order to perform the SIMD operations? If so, would there be a way to fix this?

P.S.: I would love to help fix this issue (& understand properly why it persists), but my rust + SIMD knowledge is rather lacking...

@jvdd jvdd changed the title SSIMD implementation for float returns index > final array index SIMD implementation for float returns index > final array index Sep 23, 2022
@minimalrust
Copy link
Owner

Apologies, I've only just seen this issue. Firstly its great to hear you've found some use in the library, it was largely experimental but I saw some pretty impressive speedups myself.

I'll see what I can do over the next week to fix/explain whats causing the issue.

P.S I've been using an Apple M1 for most of the year so working on this has become a lot harder. But fear not, you've given me good reason to get back into it 😄

@minimalrust minimalrust self-assigned this Dec 13, 2022
@minimalrust minimalrust added the bug Something isn't working label Dec 13, 2022
@jvdd
Copy link
Author

jvdd commented Dec 13, 2022

Hi there! I tried fixing the bug but didn't have much luck. I'm still not sure what's causing it to persist :/

Anyway, I was really impressed by the speed of this library, so much so that it inspired me to start my own Rust project: https://github.com/jvdd/argminmax. This library focuses on optimizing the argminmax (1 function that performs both argmin & argmax) function with SIMD (sse, avx(2), avx512, neon) and runtime feature detection.

By the way, I mentioned your name and this repository in the description of the repo. I hope you don't mind!

@jvdd
Copy link
Author

jvdd commented Dec 13, 2022

I believe I have discovered the cause of the bug: an overflow issue.

As you may know, f32 can represent up to 1 << f32::MANTISSA_DIGITS (2**24 = 16.78M) integers exactly. However, when the length of the array exceeds this value, we start seeing rounding errors in the float values used to track the index in the SIMD vector.

I tackled this issue in my own library by implementing an "overflow-safe" outer loop over the SIMD loop. (The only limitation is that this outer loop becomes the bottleneck for 8-bit data types - to address this, I used horizontal SIMD operations to avoid the slower scalar code.)

You can find more details on this topic here: https://stackoverflow.com/a/3793950.
Let me know what you think!

@minimalrust
Copy link
Owner

Hi there! I tried fixing the bug but didn't have much luck. I'm still not sure what's causing it to persist :/

Anyway, I was really impressed by the speed of this library, so much so that it inspired me to start my own Rust project: https://github.com/jvdd/argminmax. This library focuses on optimizing the argminmax (1 function that performs both argmin & argmax) function with SIMD (sse, avx(2), avx512, neon) and runtime feature detection.

By the way, I mentioned your name and this repository in the description of the repo. I hope you don't mind!

I love this, thanks for keeping the idea going and I really appreciate the credit! I stuck with SSE instructions because apparently it put the least strain on your CPU, but I guess that should be left to the library user if they really want faster speed.

It's nice to see you've tackled the neon instruction set too, that was going to be my next challenge so I could get this working on the M1. Ultimately, I think it will be great once the portable SIMD library is stable, as a lot of this code could be simplified.

@minimalrust
Copy link
Owner

minimalrust commented Dec 22, 2022

I believe I have discovered the cause of the bug: an overflow issue.

As you may know, f32 can represent up to 1 << f32::MANTISSA_DIGITS (2**24 = 16.78M) integers exactly. However, when the length of the array exceeds this value, we start seeing rounding errors in the float values used to track the index in the SIMD vector.

I tackled this issue in my own library by implementing an "overflow-safe" outer loop over the SIMD loop. (The only limitation is that this outer loop becomes the bottleneck for 8-bit data types - to address this, I used horizontal SIMD operations to avoid the slower scalar code.)

You can find more details on this topic here: https://stackoverflow.com/a/3793950. Let me know what you think!

This is a really interesting find. I always assumed the precision issues were after the decimal point 😅 so thank you for clarifying that.

Here is a playground demonstrating the inaccuracy you've described. Unfortunately because we use __m128 types within the simd_f32 operations, we have to use f32s as indexes or we lose a lot of the speed up.

There is a possibility we could use __m128i types, but I'm not sure if we have access to the same operations, or the speed implications.

Having looked into your library more, I think you've done a great job in restricting the MAX_INDEX and not to mention the use of traits 👏🏿 , very impressive. Do you think you could add support for the Vec type? If thats something you had in mind, I think it would be appropriate to sunset my Rust package. I will make yours the preferred library in my readme ⭐.

@jvdd
Copy link
Author

jvdd commented Dec 28, 2022

Hi! Thx for the nice words and the insightful remarks :)

I created 2 issues concerning your remarks (& gave you credits for them).

I have just given given it some thought to use integer SIMD dtype for the index. I believe this should be possible (and thus reduce the bottleneck for f16 and perhaps f32 as well). The key is casting the mask to a SIMD dtype that can be used to blend the index SIMD dtype (e.g., _mm256_castps_si256)

@minimalrust
Copy link
Owner

Hi! Thx for the nice words and the insightful remarks :)

I created 2 issues concerning your remarks (& gave you credits for them).

I have just given given it some thought to use integer SIMD dtype for the index. I believe this should be possible (and thus reduce the bottleneck for f16 and perhaps f32 as well). The key is casting the mask to a SIMD dtype that can be used to blend the index SIMD dtype (e.g., _mm256_castps_si256)

You’ve got it! Thanks again for raising the original issue (and the new ones). I may be able to contribute to the changes too which would be nice.

P.S Happy new year 🎉

@minimalrust minimalrust closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants