-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor improvement on the dataLength % 3
.
#114
Conversation
No change for x86, but save 2 `mov` instructions on x64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super 👍
Some nits. Otherwise LGTM.
source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs
Outdated
Show resolved
Hide resolved
source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs
Outdated
Show resolved
Hide resolved
source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs
Outdated
Show resolved
Hide resolved
@@ -231,8 +231,30 @@ private static int GetNumBase64PaddingCharsAddedByEncode(int dataLength) | |||
// 0 -> 0 | |||
// 1 -> 2 | |||
// 2 -> 1 | |||
uint mod3 = FastMod3((uint)dataLength); | |||
return mod3 == 0 ? 0 : (int)(3 - mod3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the JIT can emit cmov
here, so it gets branchless. Unfortunately there is no "trick" to force the JIT to do so, except some very obscure hacks (which we shouldn't use here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious to know the obscure hacks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactely a cmov
(which would be desired), but still a branchless version with some bit-twiddling and the knowledge that true / false
is represented as 1 / 0
-> snippet in sharplab
It the stack isn't used, it would be even better.
The main advantage will be when the branch-predictor can't work reliable due to "random" inputs, i.e. unknown patterns for mod3 == 0
. Then this approach may gain some perf.
Contrast this with clang, so it could be pretty concise branchless code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting: if the bool
is "casted" to byte
instead of int
the stack-usage goes away and code looks pretty good.
Can you give this a try with the above benchmark? If it has potential to be faster, we should go with such a approach (although state otherwise above 😉).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried with the BranchlessSelect
casted as byte. The results are interesting:
Method | Value | Mean | Error | StdDev | Median | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
Mod3 | 64 | 2.607 ns | 0.0991 ns | 0.2890 ns | 2.519 ns | 1.00 | 0.00 |
FastMod3 | 64 | 2.208 ns | 0.0491 ns | 0.1351 ns | 2.165 ns | 0.85 | 0.11 |
FastMod3_Branchless | 64 | 2.611 ns | 0.0502 ns | 0.0445 ns | 2.606 ns | 1.07 | 0.06 |
FastMod3_BranchlessReversed | 64 | 2.299 ns | 0.0066 ns | 0.0051 ns | 2.300 ns | 0.94 | 0.05 |
Mod3 | 65 | 2.357 ns | 0.0494 ns | 0.0462 ns | 2.350 ns | 1.00 | 0.00 |
FastMod3 | 65 | 2.108 ns | 0.0411 ns | 0.0365 ns | 2.094 ns | 0.89 | 0.02 |
FastMod3_Branchless | 65 | 2.622 ns | 0.0626 ns | 0.0615 ns | 2.607 ns | 1.11 | 0.04 |
FastMod3_BranchlessReversed | 65 | 2.313 ns | 0.0467 ns | 0.0437 ns | 2.298 ns | 0.98 | 0.02 |
Mod3 | 66 | 2.157 ns | 0.0346 ns | 0.0289 ns | 2.166 ns | 1.00 | 0.00 |
FastMod3 | 66 | 1.920 ns | 0.0396 ns | 0.0471 ns | 1.918 ns | 0.89 | 0.02 |
FastMod3_Branchless | 66 | 2.637 ns | 0.0588 ns | 0.0577 ns | 2.631 ns | 1.23 | 0.04 |
FastMod3_BranchlessReversed | 66 | 2.334 ns | 0.0617 ns | 0.0606 ns | 2.327 ns | 1.09 | 0.03 |
The 'branchless' is not as efficient as expected .
When reverting the branch BranchlessSelect(mod3 != 0, 3 - mod3, 0)
instead of BranchlessSelect(mod3 == 0, 0, 3 - mod3)
, the result are better but still under the branched one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the numbers. 👍
Some thoughts -- but I don't know what matches here...
The effect of cmov
et.al. depends on the distribution of taken / not taken branches, i.e. on how effecient the branch predictor can work. If the branch predictor does a good job, then the "branchy code" may be more efficient, as the pipeline can speculatively process the following steps, whereas with cmov
the rest of the pipeline needs to wait for the result of cmov
.
The code for the branchless version is quite a bit, so it may be more efficient just to use simple and compact code with a branch -- which can be predicted, as in the benchmarks where for constant input length the result will always be the same.
I'm pretty sure that there will be a benchmark constellation where the branchless version will be quite a bit faster.
In effect I think we should go with the code as is. It's easy and not hacky...and as the numbers show quite good. Furthermore I doubt a slight improvement here will show any effect on overall perf.
Once the runtime (JIT) will detect a pattern like return cond ? a : b
and emit better code (maybe with tiering) we'll gain the perf-win for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More or less just for fun I tried a vectorized approach to make it branchless. I show it here because you are curios 😉
Benchmark is laid out so that (hopefully) the branch-predictor won't do a good job.
| Method | Mean | Error | StdDev | Ratio | RatioSD |
|------------ |----------:|----------:|----------:|------:|--------:|
| Default | 1.0953 ns | 0.0213 ns | 0.0199 ns | 1.00 | 0.00 |
| Branchless | 0.8785 ns | 0.0136 ns | 0.0127 ns | 0.80 | 0.02 |
| Vectorized | 0.9657 ns | 0.0210 ns | 0.0186 ns | 0.88 | 0.02 |
| Vectorized1 | 0.8840 ns | 0.0156 ns | 0.0146 ns | 0.81 | 0.02 |
Benchmark-code
using System;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;
#if !DEBUG
using BenchmarkDotNet.Running;
#endif
namespace ConsoleApp4
{
class Program
{
static void Main(string[] args)
{
var bench = new Bench();
int mod3 = 0;
Console.WriteLine(bench.Default(mod3));
Console.WriteLine(bench.Branchless(mod3));
Console.WriteLine(bench.Vectorized(mod3));
Console.WriteLine(bench.Vectorized1(mod3));
Console.WriteLine();
mod3 = 1;
Console.WriteLine(bench.Default(mod3));
Console.WriteLine(bench.Branchless(mod3));
Console.WriteLine(bench.Vectorized(mod3));
Console.WriteLine(bench.Vectorized1(mod3));
Console.WriteLine();
mod3 = 2;
Console.WriteLine(bench.Default(mod3));
Console.WriteLine(bench.Branchless(mod3));
Console.WriteLine(bench.Vectorized(mod3));
Console.WriteLine(bench.Vectorized1(mod3));
#if !DEBUG
BenchmarkRunner.Run<Bench>();
#endif
}
}
public class Bench
{
[Benchmark(Baseline = true, OperationsPerInvoke = 3)]
public int Default()
{
int s = 0;
for (int i = 0; i < 3; ++i)
{
s += this.Default(i);
}
return s;
}
[Benchmark(OperationsPerInvoke = 3)]
public int Branchless()
{
int s = 0;
for (int i = 0; i < 3; ++i)
{
s += this.Branchless(i);
}
return s;
}
[Benchmark(OperationsPerInvoke = 3)]
public int Vectorized()
{
int s = 0;
for (int i = 0; i < 3; ++i)
{
s += this.Vectorized(i);
}
return s;
}
[Benchmark(OperationsPerInvoke = 3)]
public int Vectorized1()
{
int s = 0;
for (int i = 0; i < 3; ++i)
{
s += this.Vectorized1(i);
}
return s;
}
public int Default(int mod3) => mod3 == 0 ? 0 : 3 - mod3;
public int Branchless(int mod3) => BranchlessSelect(mod3 != 0, 3 - mod3, 0);
public int Vectorized(int mod3)
{
Vector128<int> vMod3 = Vector128.CreateScalarUnsafe(mod3);
Vector128<int> zero = Vector128<int>.Zero;
Vector128<int> valueIfTrue = Vector128.CreateScalarUnsafe(3 - mod3);
Vector128<int> eqZero = Sse2.CompareEqual(vMod3, zero);
Vector128<int> neqZero = Sse2.CompareEqual(eqZero, zero);
Vector128<int> result = Sse2.And(neqZero, valueIfTrue);
return Sse2.ConvertToInt32(result);
}
public int Vectorized1(int mod3)
{
Vector128<int> vMod3 = Vector128.CreateScalarUnsafe(mod3);
Vector128<int> zero = Vector128<int>.Zero;
Vector128<int> valueIfTrue = Vector128.CreateScalarUnsafe(3 - mod3);
Vector128<int> eqZero = Sse2.CompareEqual(vMod3, zero);
Vector128<int> result = Sse2.AndNot(eqZero, valueIfTrue);
return Sse2.ConvertToInt32(result);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static int BranchlessSelect(bool condition, int valueIfTrue, int valueIfFalse)
{
return ((-Unsafe.As<bool, byte>(ref condition)) & (valueIfTrue - valueIfFalse)) + valueIfFalse;
}
}
}
Use `Environment.Is64BitProcess` instead of sizeof(IntPtr) == 8 in unsafe context.
…r.Encode.cs Co-Authored-By: Günther Foidl <gue@korporal.at>
Co-Authored-By: Günther Foidl <gue@korporal.at>
Looks like CI has an assert failure in this test Base64/tests/gfoidl.Base64.Tests/Internal/Base64UrlEncoderTests/GetEncodedLength.cs Line 75 in 9e4584b
|
I will take a look at this issue |
The assert failure is caused by a test, which hits Base64/source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs Line 245 in 2e74e9b
The fix is quite simple... |
source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- if you don't have anything else for this PR I'll merge it later today.
source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs
Show resolved
Hide resolved
source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs
Show resolved
Hide resolved
source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs
Outdated
Show resolved
Hide resolved
@ycrumeyrolle thank you! |
Description
No change for x86, but save 3
mov
instructions on x64 with theGetEncodedLength()
method.This is based on the PR dotnet/runtime#406
No such improvement that in the dotnet runtime as it is a modulo with a constant.
Benchmarks