Include rsqrt function for Vec{4,Float32} #57

non-Jedi · 2019-09-10T13:45:22Z

llvm.x86.sse.rsqrt.ps calls the _mm_rsqrt_ps intrinsic which can greatly speed up calculations in some cases. It would be very convenient if this was included in SIMD.jl (see discussion here: JuliaPerf/BenchmarksGame.jl#44 (comment)). I see two barriers to this:

rsqrt is not a function in Julia Base like the other functions defined in this package. To me, this seems like a non-issue, but it does expand the conceptual scope of this package.
rsqrt could only be defined for Vec{4,Float32}. The current architecture of the package doesn't cleanly allow this. While llvmins could be defined for Float32 but not for Float64, there's no way of defining it for length 4 vectors but not for length 2. Perhaps this makes "(Also currently missing: Type conversions, reinterpretation that changes the vector size)" a prerequisite for including something like this.

Thoughts?

The text was updated successfully, but these errors were encountered:

KristofferC · 2019-09-10T13:47:29Z

Just so we know what we are getting into when starting to add intrinsics: https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Start scrolling :)

eschnett · 2019-09-10T13:55:49Z

In principle I'd be happy to add support for functions that are not in Base. However, I'd like to have a reasonably performing fall-back that works on all architectures (and for all vector types and sizes).

What I don't like about the rsqrt intrinsic is that it only calculates an approximation, i.e. it is much less accurate than 1/sqrt. Since I understand the need for performance, and since there's nothing wrong with SIMD.jl experimenting with things that might (or might not) be added to Base later, I'd also be happy to do that. However, the name should then be something like approx_rsqrt (please start bikeshedding).

The way to go about this would be:

add a function approx_rsqrt to SIMD
provide a generic implementation via 1/sqrt
provide a specific implementation for Intel, Float32, vector lengths 4 or 8, depending on what LLVM provides. I don't know how to check for this, but this should be determined automatically.

For bonus points:

provide an implementation for Intel and Float64 via explicit type conversions (no need to introduce generic mechanisms here)

non-Jedi · 2019-09-10T14:07:21Z

provide a generic implementation via 1/sqrt

In my opinion, it would be better to only define the method for Vec{4,Float32}. This way, people don't mistakenly think they're getting the performance benefit of _mm_rsqrt_ps when using it with different Vec sizes or types. But if you'd prefer to have the fallback available for symmetry, I can understand that desire.

I suppose it's also worthwhile to provide versions for lengths 1-3 that just fill the remainder of the Vec with 1f0 and then drop the extra values on return.

eschnett · 2019-09-10T19:06:36Z

Yes, I'd rather have a function that works anywhere. This will allow e.g. people running tests on any kind of machine. You could provide a mechanism to tell callers whether there is an efficient implementation, e.g. by overloading a function.

approx_is_fast(f) = false
approx_is_fast(::typeof(approx_rsqrt)) = true

Fallbacks for other vector lengths would also be a good idea. There could also be fallbacks for longer vectors that split them into 4-wide or 8-wide chunks.

non-Jedi · 2019-09-10T19:57:25Z

Are you comfortable taking a dependency on https://github.com/m-j-w/CpuId.jl? Much cleaner than reimplimenting in a cross-platform way.

eschnett · 2019-09-10T20:35:31Z

I think Cpuid only works for Intel CPUs.

Can you instead call LLVM with that intrinsic in a try block, and see if it fails? That would be the easiest way to check whether an intrinsic is supported.

smallnamespace · 2019-09-11T00:49:44Z

FYI, LLVM will automatically convert inverse sqrt IR into rsqrt_ss (single float) and rsqrt_ps (4x) assembly as long as SSE/fastmath is enabled, and automatically includes one Newton-Raphson refinement step for ~single-precision acccuracy (it may lose a couple ulp), so intrinsic + 1 step is isn't much less accurate, as long as limited exponent range and different inf/0 handling is fine for user.

Refining to double precision takes 2 NR steps, questionable whether it's favorable on Skylake+ where sqrt throughput is improved. This SO post is a useful summary.

On the other hand, making rsqrt emit bare intrinsic to LLVM allows caller to control how much refinement occurs.

@eschnett Is there a preference between bare intrinsic or depending on LLVM fastmath conversion?

eschnett · 2019-09-11T01:49:31Z

As mentioned above, mapping rsqrt to the bare intrinsic would be misleading due to its inaccuracy. I'd be happy with either or both of an rsqrt that is accurate (preferably letting LLVM handle the details) and an approx_rsqrt that is the bare intrinsic, with a fallback that works for all accuracies.

smallnamespace · 2019-09-11T04:04:47Z

Related: JuliaLang/julia#33220

If this is fixed, I believe SIMD.jl should automatically emit vectorized rsqrt for 4x floats?

eschnett · 2019-09-11T13:53:08Z

I'm not sure any more what the intent behind this PR is:
(1) Ensure that the expression 1/sqrt is vectorized efficiently, using e.g. the llvm.x86.sse.rsqrt.ps intrinsic with an additional Newton iteration
(2) Give people access to the raw llvm.x86.sse.rsqrt.ps intrinsic, so that they can calculate fast but inaccurate inverse square roots

Either are good goals, but we shouldn't mix up the discussions.

smallnamespace · 2019-09-12T17:51:37Z

Ah yes, sorry for crossing the streams.

From the user's standpoint, it's probably(?) best to let Julia/LLVM do the optimization (with fast-math as the hint). Exposing the bare intrinsic or approx_sqrt here is a stopgap.

Will take a crack at a PR soon.

nlw0 · 2021-01-16T14:29:02Z

I'm surprised the approximate rsqrt instruction could be emitted "automatically" even with fast-math flag on. In my opinion it should be treated as a very special instruction, only emitted via intrinsics or via some special, dedicated high-level function that always does it.

I do think it's OK to replace it with with the accurate-but-slow version for the sake of compatibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include rsqrt function for Vec{4,Float32} #57

Include rsqrt function for Vec{4,Float32} #57

non-Jedi commented Sep 10, 2019

KristofferC commented Sep 10, 2019 •

edited

Loading

eschnett commented Sep 10, 2019

non-Jedi commented Sep 10, 2019

eschnett commented Sep 10, 2019

non-Jedi commented Sep 10, 2019

eschnett commented Sep 10, 2019

smallnamespace commented Sep 11, 2019 •

edited

Loading

eschnett commented Sep 11, 2019

smallnamespace commented Sep 11, 2019

eschnett commented Sep 11, 2019

smallnamespace commented Sep 12, 2019 •

edited

Loading

nlw0 commented Jan 16, 2021 •

edited

Loading

Include rsqrt function for Vec{4,Float32} #57

Include rsqrt function for Vec{4,Float32} #57

Comments

non-Jedi commented Sep 10, 2019

KristofferC commented Sep 10, 2019 • edited Loading

eschnett commented Sep 10, 2019

non-Jedi commented Sep 10, 2019

eschnett commented Sep 10, 2019

non-Jedi commented Sep 10, 2019

eschnett commented Sep 10, 2019

smallnamespace commented Sep 11, 2019 • edited Loading

eschnett commented Sep 11, 2019

smallnamespace commented Sep 11, 2019

eschnett commented Sep 11, 2019

smallnamespace commented Sep 12, 2019 • edited Loading

nlw0 commented Jan 16, 2021 • edited Loading

KristofferC commented Sep 10, 2019 •

edited

Loading

smallnamespace commented Sep 11, 2019 •

edited

Loading

smallnamespace commented Sep 12, 2019 •

edited

Loading

nlw0 commented Jan 16, 2021 •

edited

Loading