-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Guid v7 performance on Unix #106525
Conversation
@jeffhandley what would it take to reopen this and get perf improved for this new API on non-Windows targets? |
@EgorBot -intel -arm64 -perf using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<Bench>(args: args);
public class Bench
{
[Benchmark]
public Guid Foo() => Guid.CreateVersion7();
} |
The performance improvement on Linux is 1-2%: EgorBot/runtime-utils#93 (comment) . It is the kind of improvement that I would expect from this change. (Unfortunately, we do not a quick automated way to run a micro-benchmark on macOS that you run it on.) Could you please share the exact sources for the micro-benchmark that you have used? |
It was this, with the comments removed it looks identical to yours: using System;
using BenchmarkDotNet;
using BenchmarkDotNet.Attributes;
namespace uuidv7.net
{
// [SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.Net80)]
[SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.Net90)]
[MemoryDiagnoser]
public class Benchmarks
{
// [Benchmark(Baseline = true)]
// public Guid Original() => new Guid(UUIDv7_v1.Generate());
// [Benchmark]
// public Guid vcsjones() => new UUIDv7_v2().AsGuid();
// [Benchmark]
// public Guid yaakov() => new UUIDv7_v3().AsGuid();
// [Benchmark]
// public Guid yaakov_with_vcsjones_improved_fill() => new UUIDv7_v4().AsGuid();
[Benchmark(Baseline = true)]
public Guid runtime() => Guid.CreateVersion7();
// [Benchmark]
// public Guid faster() => new UUIDv7_v5().AsGuid();
// [Benchmark]
// public Guid faster_localsinit() => new UUIDv7_v6().AsGuid();
}
} Program.Main is just: var config = DefaultConfig.Instance;
var summary = BenchmarkRunner.Run<Benchmarks>(config, args); Command used to run the benchmark: dotnet run -c release -- --coreRun "/Users/yaakov/Developer/GitHub/dotnet/runtime/artifacts/bin/testhost/net9.0-osx-Release-arm64/shared/Microsoft.NETCore.App/9.0.0/corerun" Output as of today with RC1:
|
My understanding that you PR does:
It is really hard to imagine it being 3x faster unless mac has something special for smaller size for urandom.. |
huh yeah, that is interesting. I just restested on Asahi Fedora (arm64) and it's only about a 3.5% improvement there. Let me check macOS x64... |
That's not how we test changes. Typically, we build corerun for baseline/main and then corerun for changes, then, a benchmark is ran as |
Interesting, I actually can reproduce the same 3-5X difference on my Macbook Pro M2 Max 😕 Self-contained benchmark: using System.Reflection;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<Bench>(args: args);
public unsafe class Bench
{
delegate void GetRandomBytesDelegate(byte* pbBuffer, int count);
static GetRandomBytesDelegate GetRandomBytes;
static Bench()
{
MethodInfo getRandomBytesMethod = typeof(object).Assembly.GetType("Interop")!.GetMethod("GetRandomBytes",
BindingFlags.NonPublic | BindingFlags.Static);
GetRandomBytes = (GetRandomBytesDelegate)getRandomBytesMethod!.CreateDelegate(typeof(GetRandomBytesDelegate));
}
[Benchmark]
public void GetRandom16()
{
byte* data = stackalloc byte[16];
GetRandomBytes(data, 16);
Consume(data);
}
[Benchmark]
public void GetRandom10()
{
byte* data = stackalloc byte[10];
GetRandomBytes(data, 10);
Consume(data);
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(byte* _){}
}
|
I get a 5x difference on macOS arm64 M2 Pro with that particular microbenchmark. What the... 🫢 |
I've taken a quick look at it via XCode Instruments: So apparently, length 10 goes through Or perhaps it's some sort of a queue (cache) of random values and with length=16 we drain it too fast? PS: Length >= 12 is where it becomes slow. PS2: Ah, perhaps Length>=12 has to be a strong random because it's used as key/iv in various AES etc while something smaller is not |
I had a quick look through Apple Open Source and the not-so-open crypto library source but can't find a branch like that. It seems to be fairly recent - my Intel Mac with macOS 12 is also only a small perf gain like on Linux. |
I can clearly see the branch (>=12) from the Xcode Instruments. @stephentoub @jkotas @vcsjones @bartonjs so do we want to accept an apple-specific improvement like this? it's based on an internal implementation detail (may be changed in future) + it looks like the api gives us, potentially, less cryptographically secure random for <12 bytes - it's probably still important for guid generation? My personal opinion that if someone hits a bottle-neck in guid generation, they should consider some adhoc solutions like incremental guids etc. Since macOS is typically not used on back-ends, I presume it's unlikely anyone will notice this improvement. |
|
||
private static unsafe Guid CreateRandomizedPartialVersion7() | ||
{ | ||
Guid g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it ok for UUIDv7 that you use a non-initialized variable here so the upper 6 bytes will be zero or stack garbage? (since corelib is compiled with SkipLocalsInit)
For better or worse, folks rely on the cryptographic nature of this data with guid. Do we know that this implementation actually impacts the quality of the csprng? That'd be concerning separate from this change. |
That is just my blind guess given that the implementation is fully private. It's just that under 12 bytes the bottle-neck is Although, it's likely just cached random (generated by the slow path) for quick access. |
Apple's CoreCrypto is "open" source. The reason for the difference at 12 is a line like this:
Where I don't think the < 12 is any "less" random. It is just required to be fresh for NIST compliance reasons. |
You can see the same performance difference in using System.Reflection;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Security.Cryptography;
BenchmarkRunner.Run<Bench>(args: args);
public class Bench
{
[Benchmark]
public void GetRandom16()
{
Span<byte> data = stackalloc byte[16];
RandomNumberGenerator.Fill(data);
Consume(data);
}
[Benchmark]
public void GetRandom10()
{
Span<byte> data = stackalloc byte[10];
RandomNumberGenerator.Fill(data);
Consume(data);
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(Span<byte> _){}
} Gives
So I do not believe the change here is doing anything meaningfully worse off. You can just get a fast-path now. |
I personally don't think this is worth the extra complexity. The perf "gain" here is up to 200ns on the latest Apple devices, which is highly unlikely to be a bottleneck to your application, even if you're generating billions of these UUIDs (in which case there are better ways to minimize the overhead anyways) then the overhead of the network requests to serialize the database, write these GUIDs to disk, or anything similar will cause this minor overhead to be lost as noise in comparison. Additionally, it will be Apple specific and will not solve such an "issue" on Linux where its taking 350-470ns for the same operation to complete on various hardware (nearly 2x slower than the |
|
Yes, I can observe that.
I see a lock is taken regardless of the cache being used or not ( |
On top of locks, caches, and reseeding, a CSPRNG behavior is rarely "predicable" when you are talking about nanoseconds. For better or for worse since |
Based on @vcsjones ' notes, I don't have a principled disagreement with the change... but I agree with @tannergooding 's "I personally don't think this is worth the extra complexity." Sure, this change, and 5000 others like it, could add up to saving 1ms on some operation somewhere; but as someone who is always slogging through OS-specific partials and the like, I'd generally happily sacrifice 200ns in a non-bottleneck function to avoid OS-specific code and/or a #if. |
Very interesting outcome there. I did not expect this to come down to an internal switch in Apple's clopen-source RNG. I completely understand the complexity issue. Thanks all! |
Draft PR for discussion.
Seemingly just skipping 6/16 bytes of random generation is enough to speed up performance by almost 3x, and I really have no idea why.
Benchmarks summary:
Where
Job-MJUZQJ
is this PR and.NET 9.0
is the public Preview 7 bits.Fixes #106377.