Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update loops in CpuMath to be more efficient #1177

Merged
merged 4 commits into from
Oct 31, 2018

Conversation

jwood803
Copy link
Contributor

@jwood803 jwood803 commented Oct 6, 2018

Fixes issue #835

@Ivanidzo4ka
Copy link
Contributor

Did you run any benchmarks to make sure it bring speed improvements?

@danmoseley
Copy link
Member

@jwood803 the end to end benchmarks are documented here. They take a bit of time to run.

There are micro benchmarks for CPUMath specifically which might be best to start with. They are here

@danmoseley
Copy link
Member

danmoseley commented Oct 11, 2018

Oh, I just realized this is a duplicate of #994. That was not linked properly to #835.

As you see in #994, it is unfortunately not at all clear that this will actually help. See comment.

@tannergooding perhaps we should close #835 without action.

@@ -431,7 +431,7 @@ public static unsafe void AddScalarU(float scalar, Span<float> dst)

Vector128<float> scalarVector128 = Sse.SetAllVector128(scalar);

if (pDstCurrent + 4 <= pDstEnd)
if (pDstCurrent <= pDstEnd - 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a temp for pDstEnd - 4 is probably better, as that would better ensure that each loop isn't doing pDstEnd - 4 before the comparison

@tannergooding
Copy link
Member

tannergooding commented Oct 11, 2018

@tannergooding perhaps we should close #835 without action.

I don't believe so, both PRs have the same problem, which is that each iteration of the loop may do a substraction/addition as part of the comparison.

As commented on #1177 (comment), creating a temp var pLoopEnd = pDstEnd - 4; and then using it in the comparison (while (pDstCurrent <= pLoopEnd)) will help ensure that the subtraction happens once, rather than once per loop iteration.

@jwood803
Copy link
Contributor Author

@danmosemsft Thanks a lot for letting me know about the benchmarks! I ran it two different times with the PredictionEngine benchmark. Below are the results. So far it looks like the changes provide a bit better performance.

Without changes:
image

With changes:
image

@@ -417,7 +417,7 @@ public static unsafe void AddScalarU(float scalar, Span<float> dst)

Vector128<float> scalarVector = Sse.SetAllVector128(scalar);

while (pDstCurrent + 4 <= pDstEnd)
while (pDstCurrent <= pDstEnd - 4)
Copy link
Member

@tannergooding tannergooding Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the SSE case as well?

@@ -22,6 +22,8 @@ internal static class AvxIntrinsics

private const int Vector256Alignment = 32;

private const int destinationEnd = pDstEnd - 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is pDstEnd defined, I wouldn't expect this could be a constant....

@danmoseley
Copy link
Member

@tannergooding does this look OK now?

Nice perf improvement @jwood803 thank you!

@@ -417,6 +417,8 @@ public static unsafe void AddScalarU(float scalar, Span<float> dst)
{
float* pDstEnd = pdst + dst.Length;
float* pDstCurrent = pdst;
int destinationEnd = pDstEnd - 4;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: stray newline here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be updated to remove the extra new line.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shauheen shauheen merged commit 71c9ff3 into dotnet:master Oct 31, 2018
@jwood803 jwood803 deleted the cpu-loops branch October 31, 2018 23:56
@eerhardt
Copy link
Member

eerhardt commented Nov 1, 2018

This was a bad change/merge

Error	CS0266	Cannot implicitly convert type 'float*' to 'int'. An explicit conversion exists (are you missing a cast?)	Microsoft.ML.CpuMath(netcoreapp3.0)	F:\git\machinelearning2\src\Microsoft.ML.CpuMath\SseIntrinsics.cs	758	Active
                float* pDstEnd = pdst + dst.Length;
                float* pDstCurrent = pdst;
                int destinationEnd = pDstEnd - 4;

pDstEnd is a float*. Subtracting 4 still gives you a float*, which is not convertable to int.

The Intrinsics/netcoreapp3.0 build is broken because of this.

@shauheen @danmosemsft - maybe now would be a good time to get a netcoreapp3.0 CI leg running.

I've opened #1495 for the build break.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants