Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native ARM64 popcount for sizeof(T) above 1 to TensorPrimitives #103214

Merged
merged 1 commit into from
Jun 12, 2024

Conversation

neon-sunset
Copy link
Contributor

@neon-sunset neon-sunset commented Jun 9, 2024

I validated that the codegen before and after the change for sizeof(T) is 1 in InvokeSpanIntoSpan.Vectorized128 is identical so whatever difference there is, it is likely the noise (from non-temporal stores? I found the numbers unstable even at 256KiB of data).

For sizeof(T) is 8 I also tested Vector128 into Vector64x2 with vaddvq_u32and back into V128 but it was slower than continuing to bruteforce.

Environment:

BenchmarkDotNet v0.13.12, macOS Sonoma 14.5 (23F79) [Darwin 23.5.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 9.0.100-preview.6.24307.18
  [Host]     : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 9.0.0 (9.0.24.30702), Arm64 RyuJIT AdvSIMD

Benchmark:

public class TensorPopCount
{
    const int Length = 262_144;
    static readonly byte[] sink = new byte[Length];
    static readonly byte[] bytes = Enumerable.Repeat((byte)0b10101010, Length).ToArray();
    static readonly ushort[] shorts = MemoryMarshal.Cast<byte, ushort>(bytes).ToArray();
    static readonly uint[] ints = MemoryMarshal.Cast<byte, uint>(bytes).ToArray();
    static readonly ulong[] longs = MemoryMarshal.Cast<byte, ulong>(bytes).ToArray();

    [Benchmark]
    public void Bytes() => TensorPrimitives.PopCount<byte>(bytes, sink);

    [Benchmark]
    public void Shorts() => TensorPrimitives.PopCount(shorts, MemoryMarshal.Cast<byte, ushort>(sink));

    [Benchmark]
    public void Ints() => TensorPrimitives.PopCount(ints, MemoryMarshal.Cast<byte, uint>(sink));

    [Benchmark]
    public void Longs() => TensorPrimitives.PopCount(longs, MemoryMarshal.Cast<byte, ulong>(sink));
}

Current:

Method Mean Error StdDev
Bytes 5.689 us 0.0738 us 0.0690 us
Shorts 19.246 us 0.0215 us 0.0190 us
Ints 19.368 us 0.0264 us 0.0221 us
Longs 71.893 us 0.1026 us 0.0960 us

PR:

Method Mean Error StdDev
Bytes 5.809 us 0.1144 us 0.1447 us
Shorts 3.660 us 0.0237 us 0.0185 us
Ints 4.297 us 0.0463 us 0.0387 us
Longs 5.360 us 0.0150 us 0.0117 us

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 9, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks. @tannergooding ?

@tannergooding tannergooding merged commit f5ab8d0 into dotnet:main Jun 12, 2024
81 of 83 checks passed
@neon-sunset neon-sunset deleted the tensor-arm64-cnt branch June 12, 2024 19:21
@github-actions github-actions bot locked and limited conversation to collaborators Jul 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants