Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CpuMath Enhancement: Double-compute input elements in hardware intrinsics #836

Open
briancylui opened this issue Sep 5, 2018 · 0 comments
Labels
enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point. perf Performance and Benchmarking related up-for-grabs A good issue to fix if you are trying to contribute to the project

Comments

@briancylui
Copy link
Contributor

Style changes needed to solve part of #823

After implementing "double-compute", it is expected to make hardware intrinsics more efficient.

Details (mostly from @tannergooding)

  • In src\Microsoft.ML.CpuMath\SseIntrinsics.cs and src\Microsoft.ML.CpuMath\AvxIntrinsics.cs, change the last loop of the existing 3-loop code pattern into the following:
    1. Saving the stored result (dstVector) from the last iteration of the vectorized code
    2. Moving pDstCurrent back such that pDstCurrent + elementsPerIteration == pEnd
    3. Doing a single iteration for the remaining elements
    4. Mix the saved result from the last iteration of the vectorized code with the result from the remaining elements
    5. Write the result

This generally results in more performant code, depending on the exact algorithm and number of remaining elements

  • On handling unpadded parts in AVX intrinsics:

For some algorithms (like Sum), it is possible to “double-compute” a few elements in the beginning and end to have better overall performance. See the following pseudo-code:

if addr not aligned
              tmp = unaligned load from addr
              tmp &= mask which zero's elements after the first aligned address
              result = tmp
              move addr forward to the first aligned address 

while addr is aligned and remaining bits >= 128
              result += aligned load
              addr += 128-bits

if any remaining
              addr = endAddr - 128
              tmp = unaligned load from addr
              tmp &= mask which zero's elements already processed
              result += tmp

Sum the elements in result (using "horizontal add" or "shuffle and add")

So, your overall algorithm will probably look like:

if (Avx.IsSupported && (Length >= AvxLimit))
{
    // Process 256-bits, we have a limit since 256-bit 
    // AVX instructions can cause a downclock in the CPU
    // Algorithm would be similar to the SSE pseudo-code
}
else if (Sse.IsSupported && (Length >= SseLimit))
{
    // Pseudo-code algorithm given above

    // 128-bit instructions operate at full frequency
    // and don't downclock the CPU, we can only use
    // them for more than 128-bits so we don't AV
}
else
{
    // Software Implementation
}

If you can’t “double-compute” for some reason, then you generally do the “software” processing for the beginning (to become aligned) and end (to catch stray elements).
AvxLimit is generally a number that takes into account the “downclocking” that can occur for heavy 256-bit instruction usage
SseLimit is generally 128-bits for algorithms where you can “double-compute” and some profiled number for other algorithms

cc: @tannergooding since he suggested this approach.

@briancylui briancylui changed the title Double-compute input elements in hardware intrinsics CpuMath Enhancement: Double-compute input elements in hardware intrinsics Sep 6, 2018
@danmoseley danmoseley added the up-for-grabs A good issue to fix if you are trying to contribute to the project label Sep 6, 2018
@antoniovs1029 antoniovs1029 added enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point. perf Performance and Benchmarking related labels Jan 10, 2020
@Anipik Anipik removed their assignment Feb 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point. perf Performance and Benchmarking related up-for-grabs A good issue to fix if you are trying to contribute to the project
Projects
None yet
Development

No branches or pull requests

4 participants