-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD implementation for float returns index > final array index #8
Comments
Apologies, I've only just seen this issue. Firstly its great to hear you've found some use in the library, it was largely experimental but I saw some pretty impressive speedups myself. I'll see what I can do over the next week to fix/explain whats causing the issue. P.S I've been using an Apple M1 for most of the year so working on this has become a lot harder. But fear not, you've given me good reason to get back into it 😄 |
Hi there! I tried fixing the bug but didn't have much luck. I'm still not sure what's causing it to persist :/ Anyway, I was really impressed by the speed of this library, so much so that it inspired me to start my own Rust project: https://github.com/jvdd/argminmax. This library focuses on optimizing the By the way, I mentioned your name and this repository in the description of the repo. I hope you don't mind! |
I believe I have discovered the cause of the bug: an overflow issue. As you may know, I tackled this issue in my own library by implementing an "overflow-safe" outer loop over the SIMD loop. (The only limitation is that this outer loop becomes the bottleneck for 8-bit data types - to address this, I used horizontal SIMD operations to avoid the slower scalar code.) You can find more details on this topic here: https://stackoverflow.com/a/3793950. |
I love this, thanks for keeping the idea going and I really appreciate the credit! I stuck with SSE instructions because apparently it put the least strain on your CPU, but I guess that should be left to the library user if they really want faster speed. It's nice to see you've tackled the |
This is a really interesting find. I always assumed the precision issues were after the decimal point 😅 so thank you for clarifying that. Here is a playground demonstrating the inaccuracy you've described. Unfortunately because we use __m128 types within the There is a possibility we could use __m128i types, but I'm not sure if we have access to the same operations, or the speed implications. Having looked into your library more, I think you've done a great job in restricting the |
Hi! Thx for the nice words and the insightful remarks :) I created 2 issues concerning your remarks (& gave you credits for them). I have just given given it some thought to use integer SIMD dtype for the index. I believe this should be possible (and thus reduce the bottleneck for f16 and perhaps f32 as well). The key is casting the mask to a SIMD dtype that can be used to blend the index SIMD dtype (e.g., |
You’ve got it! Thanks again for raising the original issue (and the new ones). I may be able to contribute to the changes too which would be nice. P.S Happy new year 🎉 |
First of all thanks for open-sourcing this amazing library! The SIMD implementation gives me huge speedups 🚀
I believe I stumbled upon a bug while I was benchmarking the code. It seems that for large arrays, the
simd_f*.rs
implementation (thus for float arrays) returns an index value that is outside the range of valid array indexes.Reproducible example;
From my basic understanding I think this might stem from changing precision when casting the
usize
index tof32
in order to perform the SIMD operations? If so, would there be a way to fix this?P.S.: I would love to help fix this issue (& understand properly why it persists), but my rust + SIMD knowledge is rather lacking...
The text was updated successfully, but these errors were encountered: