SIMD implementation for float returns index > final array index #8

jvdd · 2022-09-23T17:52:23Z

First of all thanks for open-sourcing this amazing library! The SIMD implementation gives me huge speedups 🚀

I believe I stumbled upon a bug while I was benchmarking the code. It seems that for large arrays, the simd_f*.rs implementation (thus for float arrays) returns an index value that is outside the range of valid array indexes.

Reproducible example;

use argmm::ArgMinMax;
use argmm::generic::{simple_argmin, simple_argmax};

fn main() {
    let arr: Vec<i32> = (0..17_000_000).collect();

    // using scalar implementation
    let min = simple_argmin(&arr);
    let max = simple_argmax(&arr);
    println!("Scalar -- min: {}, max: {}", min, max);

    // using SIMD implementation
    let min = arr.argmin().unwrap();
    let max = arr.argmax().unwrap();
    println!("SIMD   -- min: {}, max: {}", min, max);

    let arr: Vec<f32> = arr.iter().map(|x| *x as f32).collect();

    // using scalar implementation
    let min = simple_argmin(&arr);
    let max = simple_argmax(&arr);
    println!("Scalar -- min: {}, max: {}", min, max);

    // using SIMD implementation
    let min = arr.argmin().unwrap();
    let max = arr.argmax().unwrap();
    println!("SIMD   -- min: {}, max: {}", min, max);
}

From my basic understanding I think this might stem from changing precision when casting the usize index to f32 in order to perform the SIMD operations? If so, would there be a way to fix this?

P.S.: I would love to help fix this issue (& understand properly why it persists), but my rust + SIMD knowledge is rather lacking...

The text was updated successfully, but these errors were encountered:

minimalrust · 2022-12-13T17:32:17Z

Apologies, I've only just seen this issue. Firstly its great to hear you've found some use in the library, it was largely experimental but I saw some pretty impressive speedups myself.

I'll see what I can do over the next week to fix/explain whats causing the issue.

P.S I've been using an Apple M1 for most of the year so working on this has become a lot harder. But fear not, you've given me good reason to get back into it 😄

jvdd · 2022-12-13T17:51:30Z

Hi there! I tried fixing the bug but didn't have much luck. I'm still not sure what's causing it to persist :/

Anyway, I was really impressed by the speed of this library, so much so that it inspired me to start my own Rust project: https://github.com/jvdd/argminmax. This library focuses on optimizing the argminmax (1 function that performs both argmin & argmax) function with SIMD (sse, avx(2), avx512, neon) and runtime feature detection.

By the way, I mentioned your name and this repository in the description of the repo. I hope you don't mind!

jvdd · 2022-12-13T18:06:53Z

I believe I have discovered the cause of the bug: an overflow issue.

As you may know, f32 can represent up to 1 << f32::MANTISSA_DIGITS (2**24 = 16.78M) integers exactly. However, when the length of the array exceeds this value, we start seeing rounding errors in the float values used to track the index in the SIMD vector.

I tackled this issue in my own library by implementing an "overflow-safe" outer loop over the SIMD loop. (The only limitation is that this outer loop becomes the bottleneck for 8-bit data types - to address this, I used horizontal SIMD operations to avoid the slower scalar code.)

You can find more details on this topic here: https://stackoverflow.com/a/3793950.
Let me know what you think!

minimalrust · 2022-12-21T06:50:57Z

Hi there! I tried fixing the bug but didn't have much luck. I'm still not sure what's causing it to persist :/

Anyway, I was really impressed by the speed of this library, so much so that it inspired me to start my own Rust project: https://github.com/jvdd/argminmax. This library focuses on optimizing the argminmax (1 function that performs both argmin & argmax) function with SIMD (sse, avx(2), avx512, neon) and runtime feature detection.

By the way, I mentioned your name and this repository in the description of the repo. I hope you don't mind!

I love this, thanks for keeping the idea going and I really appreciate the credit! I stuck with SSE instructions because apparently it put the least strain on your CPU, but I guess that should be left to the library user if they really want faster speed.

It's nice to see you've tackled the neon instruction set too, that was going to be my next challenge so I could get this working on the M1. Ultimately, I think it will be great once the portable SIMD library is stable, as a lot of this code could be simplified.

minimalrust · 2022-12-22T04:34:44Z

I believe I have discovered the cause of the bug: an overflow issue.

As you may know, f32 can represent up to 1 << f32::MANTISSA_DIGITS (2**24 = 16.78M) integers exactly. However, when the length of the array exceeds this value, we start seeing rounding errors in the float values used to track the index in the SIMD vector.

I tackled this issue in my own library by implementing an "overflow-safe" outer loop over the SIMD loop. (The only limitation is that this outer loop becomes the bottleneck for 8-bit data types - to address this, I used horizontal SIMD operations to avoid the slower scalar code.)

You can find more details on this topic here: https://stackoverflow.com/a/3793950. Let me know what you think!

This is a really interesting find. I always assumed the precision issues were after the decimal point 😅 so thank you for clarifying that.

Here is a playground demonstrating the inaccuracy you've described. Unfortunately because we use __m128 types within the simd_f32 operations, we have to use f32s as indexes or we lose a lot of the speed up.

There is a possibility we could use __m128i types, but I'm not sure if we have access to the same operations, or the speed implications.

Having looked into your library more, I think you've done a great job in restricting the MAX_INDEX and not to mention the use of traits 👏🏿 , very impressive. Do you think you could add support for the Vec type? If thats something you had in mind, I think it would be appropriate to sunset my Rust package. I will make yours the preferred library in my readme ⭐.

jvdd · 2022-12-28T09:00:40Z

Hi! Thx for the nice words and the insightful remarks :)

I created 2 issues concerning your remarks (& gave you credits for them).

I have just given given it some thought to use integer SIMD dtype for the index. I believe this should be possible (and thus reduce the bottleneck for f16 and perhaps f32 as well). The key is casting the mask to a SIMD dtype that can be used to blend the index SIMD dtype (e.g., _mm256_castps_si256)

minimalrust · 2023-01-02T02:41:47Z

Hi! Thx for the nice words and the insightful remarks :)

I created 2 issues concerning your remarks (& gave you credits for them).

I have just given given it some thought to use integer SIMD dtype for the index. I believe this should be possible (and thus reduce the bottleneck for f16 and perhaps f32 as well). The key is casting the mask to a SIMD dtype that can be used to blend the index SIMD dtype (e.g., _mm256_castps_si256)

You’ve got it! Thanks again for raising the original issue (and the new ones). I may be able to contribute to the changes too which would be nice.

P.S Happy new year 🎉

jvdd changed the title ~~SSIMD implementation for float returns index > final array index~~ SIMD implementation for float returns index > final array index Sep 23, 2022

minimalrust self-assigned this Dec 13, 2022

minimalrust added the bug Something isn't working label Dec 13, 2022

This was referenced Dec 28, 2022

♻️ use portable SIMD to simplify code jvdd/argminmax#8

Open

✨ implement ArgMinMax trait for other types jvdd/argminmax#9

Closed

minimalrust closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD implementation for float returns index > final array index #8

SIMD implementation for float returns index > final array index #8

jvdd commented Sep 23, 2022

minimalrust commented Dec 13, 2022

jvdd commented Dec 13, 2022

jvdd commented Dec 13, 2022 •

edited

Loading

minimalrust commented Dec 21, 2022

minimalrust commented Dec 22, 2022 •

edited

Loading

jvdd commented Dec 28, 2022

minimalrust commented Jan 2, 2023

SIMD implementation for float returns index > final array index #8

SIMD implementation for float returns index > final array index #8

Comments

jvdd commented Sep 23, 2022

minimalrust commented Dec 13, 2022

jvdd commented Dec 13, 2022

jvdd commented Dec 13, 2022 • edited Loading

minimalrust commented Dec 21, 2022

minimalrust commented Dec 22, 2022 • edited Loading

jvdd commented Dec 28, 2022

minimalrust commented Jan 2, 2023

jvdd commented Dec 13, 2022 •

edited

Loading

minimalrust commented Dec 22, 2022 •

edited

Loading