-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Enable EVEX embedded broadcast in more places #109258
Conversation
I expected there might be a couple of diffs from this one. Wild guess is that spmi failed because the embedded encoding diffs would have something like |
@MihuBot -intel |
CC. @dotnet/jit-contrib for secondary review |
@BruceForstall PTAL for code review. |
/azp run runtime-coreclr superpmi-diffs |
Azure Pipelines successfully started running 1 pipeline(s). |
Ah, you fixed spmi 🎉 ASM diffs generated on windows x64linux arm64Diffs are based on 2,324,354 contexts (942,241 MinOpts, 1,382,113 FullOpts). No diffs found. DetailsContext information
linux x64Diffs are based on 2,352,092 contexts (925,689 MinOpts, 1,426,403 FullOpts). Overall (+39 bytes)
MinOpts (+12 bytes)
FullOpts (+27 bytes)
Example diffscoreclr_tests.run.linux.x64.checked.mch+0 (0.00%) : 1629.dasm - System.Number+Grisu3:GetCachedPowerForBinaryExponentRange(int,int,byref):System.Number+DiyFp (Instrumented Tier1)
+0 (0.00%) : 6145.dasm - System.Number:Dragon4(ulong,int,uint,ubyte,int,ubyte,System.Span`1[ubyte],byref):uint (Tier1)
+0 (0.00%) : 6305.dasm - testout1:Func_0_1_2(testout1+VT_0_1_2):long (MinOpts)
+0 (0.00%) : 556112.dasm - TestApp:test_3_5(double):double (FullOpts)
+2 (+0.16%) : 514325.dasm - Packet256Tracer:CreateDefaultScene():Scene (FullOpts)
+1 (+0.18%) : 505847.dasm - VectorMathTests.Program:TestEntryPoint():int (FullOpts)
libraries.pmi.linux.x64.checked.mch+0 (0.00%) : 9181.dasm - Microsoft.FSharp.Collections.ArrayModule:RandomChoiceBy[short](Microsoft.FSharp.Core.FSharpFunc`2[Microsoft.FSharp.Core.Unit,double],short[]):short (FullOpts)
+0 (0.00%) : 9185.dasm - Microsoft.FSharp.Collections.ArrayModule:RandomChoiceBy[long](Microsoft.FSharp.Core.FSharpFunc`2[Microsoft.FSharp.Core.Unit,double],long[]):long (FullOpts)
+0 (0.00%) : 9201.dasm - Microsoft.FSharp.Collections.ArrayModule:RandomSampleBy[System.__Canon](Microsoft.FSharp.Core.FSharpFunc`2[Microsoft.FSharp.Core.Unit,double],int,System.__Canon[]):System.__Canon[] (FullOpts)
+0 (0.00%) : 275964.dasm - System.Threading.Barrier:SignalAndWait(System.TimeSpan,System.Threading.CancellationToken):ubyte:this (FullOpts)
+0 (0.00%) : 276656.dasm - System.Text.RegularExpressions.RegexRunner:g__ConfigureTimeout|24_0(System.TimeSpan):this (FullOpts)
+0 (0.00%) : 279404.dasm - Microsoft.CodeAnalysis.Collections.SegmentedList`1[ubyte]:TrimExcess():this (FullOpts)
libraries_tests.run.linux.x64.Release.mch+0 (0.00%) : 27377.dasm - System.Threading.Tasks.Task:Delay(System.TimeSpan,System.Threading.CancellationToken):System.Threading.Tasks.Task (Instrumented Tier1)
+0 (0.00%) : 27389.dasm - System.Threading.Tasks.Task:Delay(System.TimeSpan,System.TimeProvider,System.Threading.CancellationToken):System.Threading.Tasks.Task (Instrumented Tier1)
+0 (0.00%) : 40657.dasm - Microsoft.CodeAnalysis.CSharp.Binder:DoUncheckedConversion(byte,Microsoft.CodeAnalysis.ConstantValue):System.Object (Instrumented Tier0)
+0 (0.00%) : 817256.dasm - System.Threading.PortableThreadPool:AdjustMaxWorkersActive():this (Tier1)
+6 (+4.26%) : 557002.dasm - System.Runtime.Intrinsics.Tests.Vectors.Vector256Tests:ConvertToInt32NativeTest():this (Tier0)
+6 (+4.35%) : 557115.dasm - System.Runtime.Intrinsics.Tests.Vectors.Vector128Tests:ConvertToInt32NativeTest():this (Tier0)
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch+0 (0.00%) : 805.dasm - Xunit.DelegatingLongRunningTestDetectionSink:ThreadWorker():this (FullOpts)
+0 (0.00%) : 4377.dasm - Microsoft.CodeAnalysis.Shared.TestHooks.AsynchronousOperationListenerProvider+NullOperationListener:Delay(System.TimeSpan,System.Threading.CancellationToken):System.Threading.Tasks.Task`1[ubyte]:this (FullOpts)
+0 (0.00%) : 23653.dasm - Tests.System.TimeProviderTests+TestExtensionsTaskFactory:WaitAsync(System.Threading.Tasks.Task,System.TimeSpan,System.TimeProvider,System.Threading.CancellationToken):System.Threading.Tasks.Task:this (FullOpts)
+6 (+0.85%) : 205695.dasm - System.Runtime.Intrinsics.Tests.Vectors.Vector128Tests:ConvertToInt32NativeTest():this (FullOpts)
+4 (+1.17%) : 205320.dasm - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512SingleSumTest():this (FullOpts)
+4 (+1.32%) : 205212.dasm - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests:Vector512DoubleSumTest():this (FullOpts)
libraries.pmi.linux.x64.checked.mch+0 (0.00%) : 9181.dasm - Microsoft.FSharp.Collections.ArrayModule:RandomChoiceBy[short](Microsoft.FSharp.Core.FSharpFunc`2[Microsoft.FSharp.Core.Unit,double],short[]):short (FullOpts)
+0 (0.00%) : 9185.dasm - Microsoft.FSharp.Collections.ArrayModule:RandomChoiceBy[long](Microsoft.FSharp.Core.FSharpFunc`2[Microsoft.FSharp.Core.Unit,double],long[]):long (FullOpts)
+0 (0.00%) : 9201.dasm - Microsoft.FSharp.Collections.ArrayModule:RandomSampleBy[System.__Canon](Microsoft.FSharp.Core.FSharpFunc`2[Microsoft.FSharp.Core.Unit,double],int,System.__Canon[]):System.__Canon[] (FullOpts)
+0 (0.00%) : 275964.dasm - System.Threading.Barrier:SignalAndWait(System.TimeSpan,System.Threading.CancellationToken):ubyte:this (FullOpts)
+0 (0.00%) : 276656.dasm - System.Text.RegularExpressions.RegexRunner:g__ConfigureTimeout|24_0(System.TimeSpan):this (FullOpts)
+0 (0.00%) : 279404.dasm - Microsoft.CodeAnalysis.Collections.SegmentedList`1[ubyte]:TrimExcess():this (FullOpts)
smoke_tests.nativeaot.linux.x64.checked.mch+0 (0.00%) : 20750.dasm - System.Convert:ToInt32(double):int (FullOpts)
+0 (0.00%) : 20778.dasm - System.Collections.Hashtable:rehash(int):this (FullOpts)
+0 (0.00%) : 19151.dasm - System.Threading.ProcessorIdCache:ProcessorNumberSpeedCheck():ubyte (FullOpts)
+0 (0.00%) : 19711.dasm - System.Collections.HashHelpers:IsPrime(int):ubyte (FullOpts)
+0 (0.00%) : 21060.dasm - System.Number:Dragon4(ulong,int,uint,ubyte,int,ubyte,System.Span`1[ubyte],byref):uint (FullOpts)
+0 (0.00%) : 21156.dasm - System.Number+Grisu3:GetCachedPowerForBinaryExponentRange(int,int,byref):System.Number+DiyFp (FullOpts)
DetailsSize improvements/regressions per collection
PerfScore improvements/regressions per collection
Context information
jit-analyze outputcoreclr_tests.run.linux.x64.checked.mch
Detail diffs
libraries.pmi.linux.x64.checked.mch
Detail diffs
libraries_tests.run.linux.x64.Release.mch
Detail diffs
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch
Detail diffs
libraries.pmi.linux.x64.checked.mch
Detail diffs
smoke_tests.nativeaot.linux.x64.checked.mch
Detail diffs
osx arm64Diffs are based on 933,703 contexts (390,234 MinOpts, 543,469 FullOpts). MISSED contexts: 1 (0.00%) No diffs found. DetailsContext information
windows arm64Diffs are based on 383,129 contexts (21,722 MinOpts, 361,407 FullOpts). No diffs found. DetailsContext information
windows x64Diffs are based on 54,760 contexts (14 MinOpts, 54,746 FullOpts). Overall (+0 bytes)
FullOpts (+0 bytes)
Example diffsrealworld.run.windows.x64.checked.mch+0 (0.00%) : 1325.dasm - BepuPhysics.BatchCompressor:Compress(BepuUtilities.Memory.BufferPool,BepuUtilities.IThreadDispatcher,ubyte):this (FullOpts)
+0 (0.00%) : 2601.dasm - SixLabors.ImageSharp.Formats.Png.PngEncoderOptionsHelpers:CalculateBitDepth[SixLabors.ImageSharp.PixelFormats.Rgba32](SixLabors.ImageSharp.Formats.Png.PngEncoderOptions,SixLabors.ImageSharp.IndexedImageFrame`1[SixLabors.ImageSharp.PixelFormats.Rgba32]):ubyte (FullOpts)
+0 (0.00%) : 2725.dasm - SixLabors.ImageSharp.Formats.Gif.GifEncoderCore:Encode[SixLabors.ImageSharp.PixelFormats.Rgba32](SixLabors.ImageSharp.Image`1[SixLabors.ImageSharp.PixelFormats.Rgba32],System.IO.Stream,System.Threading.CancellationToken):this (FullOpts)
+0 (0.00%) : 6012.dasm - System.Security.Cryptography.X509Certificates.ChainPal:BuildChain(ubyte,System.Security.Cryptography.X509Certificates.ICertificatePal,System.Security.Cryptography.X509Certificates.X509Certificate2Collection,System.Security.Cryptography.OidCollection,System.Security.Cryptography.OidCollection,int,int,System.Security.Cryptography.X509Certificates.X509Certificate2Collection,int,System.DateTime,System.TimeSpan,ubyte):System.Security.Cryptography.X509Certificates.IChainPal (FullOpts)
+0 (0.00%) : 6308.dasm - System.DateTimeParse:DoStrictParse(System.ReadOnlySpan`1[ushort],System.ReadOnlySpan`1[ushort],int,System.Globalization.DateTimeFormatInfo,byref):ubyte (FullOpts)
+0 (0.00%) : 19380.dasm - Microsoft.Cci.FullMetadataWriter:Create(Microsoft.CodeAnalysis.Emit.EmitContext,Microsoft.CodeAnalysis.CommonMessageProvider,ubyte,ubyte,ubyte,ubyte,System.Threading.CancellationToken):Microsoft.Cci.MetadataWriter (FullOpts)
smoke_tests.nativeaot.windows.x64.checked.mch+0 (0.00%) : 17661.dasm - System.Convert:ToInt32(double):int (FullOpts)
+0 (0.00%) : 17869.dasm - System.Number:Dragon4(ulong,int,uint,ubyte,int,ubyte,System.Span`1[ubyte],byref):uint (FullOpts)
+0 (0.00%) : 16106.dasm - System.Threading.ProcessorIdCache:ProcessorNumberSpeedCheck():ubyte (FullOpts)
+0 (0.00%) : 17930.dasm - System.Number+Grisu3:GetCachedPowerForBinaryExponentRange(int,int,byref):System.Number+DiyFp (FullOpts)
+0 (0.00%) : 16747.dasm - System.Collections.HashHelpers:IsPrime(int):ubyte (FullOpts)
DetailsSize improvements/regressions per collection
PerfScore improvements/regressions per collection
Context information
jit-analyze outputrealworld.run.windows.x64.checked.mch
Detail diffs
smoke_tests.nativeaot.windows.x64.checked.mch
Detail diffs
|
FWIW, I think this is a net negative change code quality wise since the current broadcast logic only handles constant vectors. An aligned full-width vector load is cheaper than a broadcast, so I don't think you'd normally do it unless optimizing for code size. This does, however, make it possible to test embedded broadcast encoding of more instructions (I added it while testing the GFNI implementation for correctness) and makes it easier to improve centrally later. I'd like to look at containing normal data broadcasts, where containment actually saves a broadcast instruction -- unless someone else has that planned. |
That's not the case for embedded broadcasts. Both Intel and AMD document full-width memory loads and embedded broadcast memory loads as being the same cost, sites such as Additionally, it also helps cache locality and other factors, so it is typically viewed as a net win and is the preferred default. |
We already contain some of these as well, when they come from memory: For example: public Vector128<float> M(float* data)
=> Vector128<float>.Zero + Avx.BroadcastScalarToVector128(data); generates: L0000: vzeroupper
L0003: vxorps xmm0, xmm0, xmm0
L0007: vaddps xmm0, xmm0, [r8]{1to4}
L000d: vmovups [rdx], xmm0
L0011: mov rax, rdx
L0014: ret There's likely additional patterns that need to be recognized. For example, |
Good to know, thanks! |
Factored out the containment checks and included an attempt to use embedded broadcast where possible.