Here's a summary of what's new in the .NET Runtime in this preview release:
- ARM64 Code Generation now adds ability to store operations
- Code Layout - RyuJIT's block reordering algorithm with a simpler, more global approach
- Loop Optimizations for code size reduction and performance improvements
- Reduced Address Exposure through RyuJIT improvements to better track usage of local variable addreses.
- AVX10v1 Support, a new SIMD instruction set from Intel
- Hardware Intrinsic Code Generation
- Constant Folding for Floating Point and SIMD Operations
Runtime updates in .NET 9 Preview 6:
- What's new in the .NET Runtime in .NET 9 documentation
.NET 9 Preview 6:
ARM64 supports multiple instructions for loading and storing data. For example, the str
instruction stores data from a single register to memory, while the stp
instruction stores data from a pair of registers. Leveraging stp
in lieu of str
allows us to accomplish the same task with fewer store operations, improving execution time.
For example, consider the following snippet:
class Body { public double x, y, z, vx, vy, vz, mass; }
static void Advance(double dt, Body[] bodies)
{
foreach (Body b in bodies)
{
b.x += dt * b.vx;
b.y += dt * b.vy;
b.z += dt * b.vz;
}
}
In the loop body, we are updating the values of b.x
, b.y
, and b.z
; notice how these members are contiguous in Body
's definition.
At the assembly level, we could store to each member with a str
instruction, or we could use stp
to handle two of the stores (b.x
and b.y
, or b.y
and b.z
-- the two members we are storing to with stp
have to be next to each other in memory) with one instruction.
Shaving off one instruction may seem like a small improvement, but if the above loop runs for a nontrivial number of iterations, the performance gains can add up quickly.
In this example, if we want to use the stp
instruction to store to b.x
and b.y
simultaneously, RyuJIT needs to be able to determine that the computations b.y + (dt * b.vy)
and b.x + (dt * b.vx)
are independent of one another, and can be performed before storing to b.x
and b.y
. With this ordering, RyuJIT will generate one instruction for the stores to b.x
and b.y
.
The opportunity to use stp
may seem obvious, but the decision to use a specific assembly instruction is made late in RyuJIT's compilation phases. Before we get to that point, RyuJIT, like many compilers, builds an architecture-agnostic internal representation of your program's logic to simplify optimizations performed later during compilation.
Applying architecture-specific optimizations, such as using stp
instead of str
, requires RyuJIT to tweak this representation such that during code generation, RyuJIT will recognize the pattern needed to apply the assembly-level optimization.
RyuJIT already has the ability to transform its representation of contiguous loads to leverage the ldp
(analog of stp
for loading values) instruction on ARM64, and Preview 6 extends this ability to store operations.
Here's what the assembly for storing to b.x
and b.y
used to look like:
ldr x3, [x0, w1, UXTW #3]
ldp d16, d17, [x3, #0x08]
ldp d18, d19, [x3, #0x20]
fmul d18, d0, d18
fadd d16, d16, d18
str d16, [x3, #0x08] // Here, we were storing to b.x
fmul d16, d0, d19
fadd d16, d17, d16
str d16, [x3, #0x10] // Here, we were storing to b.y
And here's what it looks like now:
ldr x3, [x0, w1, UXTW #3]
ldp d16, d17, [x3, #0x08]
ldp d18, d19, [x3, #0x20]
fmul d18, d0, d18
fadd d16, d16, d18
fmul d18, d0, d19
fadd d17, d17, d18
stp d16, d17, [x3, #0x08] // Now, we're storing to b.x and b.y with one instruction
Check out dotnet/runtime #102133 for more details on the implementation of this optimization!
Compilers typically reason about a program's control flow using basic blocks, where each block is a chunk of code that can only be entered at the first instruction, and exited via the last instruction; this means if a block ends with a branch instruction, control flow transfers to another block. In methods with linear control flow (such as one without any loops, if-statements, etc.), determining a reasonable basic block order is trivial: we can just lay out the blocks in execution order. But as we introduce more branches into the program's control flow, we give the compiler more possible block orderings to consider.
Why is the order of basic blocks important? One goal of block reordering is to reduce the number of branch instructions in the generated code by maximizing fallthrough behavior, or the property that each basic block is followed by its most-likely successor, such that it can "fall into" its successor without needing a jump. Reducing code size is usually ideal, but producing a block ordering where every block falls into its most-likely successor frequently isn't possible.
Consider the below snippet:
// BB01
var random = new Random();
int number = random.Next(100);
if (number < 99)
{
// BB02
Console.WriteLine("We're in BB02");
}
else
{
// BB03
Console.WriteLine("We're in BB03");
}
// BB04
Console.WriteLine("We're in BB04 -- the successor of BB02 and BB03");
There are two possible paths of execution: [BB01, BB02, BB04], and [BB01, BB03, BB04]. Since both BB02 and BB03 are succeeded by BB04, we cannot create fallthrough for both of them, so we have to introduce a branch on one of the paths to improve code quality for the other path. Based on the if-statement's condition, we know [BB01, BB02, BB04] is the more likely path (and we can leverage profile data during compilation to determine this -- more on that later), so [BB01, BB02, BB04, BB03] is probably a better ordering than [BB01, BB03, BB04, BB02]. Not only does this ordering reduce the number of branches on the most-likely path of execution, but it also has the nice property of pushing BB03 -- the coldest (i.e. least-frequently-executed) block -- to the end of the method. This increases the likelihood of the instructions on the hot path sharing a line of the CPU's instruction cache, thus reducing the likelihood of cache misses during program execution. When reordering blocks, compilers frequently consider maximizing fallthrough behavior and hot code density in tandem.
RyuJIT already reorders blocks, but until recently, was limited by the flowgraph implementation it uses and wasn't able take full advantage of the profiling infrastructure in dynamic PGO. Over the last several months, we've refactored RyuJIT's flowgraph data structures to remove various restrictions around block ordering, and to ingrain execution likelihoods into every control flow change between blocks. We've also invested into ensuring profile data is propagated and maintained as we transform the method's flowgraph. This has enabled us to replace RyuJIT's block reordering algorithm with a simpler, more global approach.
Using execution likelihoods derived from profile data or heuristics, we first build a reverse post-order traversal of the method's basic blocks, greedily placing the most-likely successor of each block next. This traversal has the nice property of ensuring each block is visited before its successor, unless the block's successor is behind it in a loop.
Using the earlier example, the reverse post-order makes the following decisions:
- BB01's possible successors are BB02 and BB03. BB02 is more likely, so place that next.
- BB04 is reachable from BB02 and BB03, so place BB04 at the end of the method.
Thus, our initial ordering is [BB01, BB02, BB03, BB04]. The initial traversal avoided creating any backward jumps, hence the suboptimal decision to place BB04 after BB03 instead of BB02. To improve upon this, for each block that doesn't fall into its most-likely successor, we compare the profile data of this block with the block that does fall into the shared successor, and move the successor up if the former block is sufficiently hot, thus pushing cold blocks further down the method. For our current layout, BB02 is the only block that doesn't fall into its most-likely successor (BB04), and since profile data suggests BB02 will execute in lieu of BB03 the overwhelming majority of the time, we should switch the order of BB03 and BB04. Thus, we get [BB01, BB02, BB04, BB03] as our final ordering.
This is a broad overview of the changes we've made to RyuJIT's flowgraph implementation for .NET 9. To get a deeper dive into the specifics of this effort, as well as some of the improvements we have planned for future releases, check out our flowgraph modernization meta-issue, dotnet/runtime #93020. As we iterate upon RyuJIT's PGO story, you can expect our new block reordering implementation to further improve code layout over time.
Loops in .NET code frequently use counter variables to control the number of iterations. In the idiomatic for (int i = ...)
pattern, the counter variable typically increases.
Consider the following example:
for (int i = 0; i < 100; i++)
{
Foo();
}
On many architectures, it is more performant to decrement the loop's counter, like so:
for (int i = 100; i > 0; i--)
{
Foo();
}
For the first example, we need to emit an instruction to increment i
, followed by an instruction to perform the i < 100
comparison, followed by a conditional jump to continue the loop if the condition is still true -- that's three instructions in total. However, if we flip the counter's direction, we need one fewer instruction.
For example, on x64, we can use the dec
instruction to decrement i
; when i
reaches zero, the dec
instruction sets a CPU flag that can be used as the condition for a jump instruction immediately following the dec
.
Here's what the x64 assembly looks like for the first example:
G_M35517_IG02: ;; offset=0x0005
xor ebx, ebx // Set i to 0
G_M35517_IG03: ;; offset=0x0007
call [Foo()]
inc ebx // Increment i
cmp ebx, 100 // Compare i to 100
jl SHORT G_M35517_IG03 // Continue loop if (i < 100)
And for the second example:
G_M35517_IG02: ;; offset=0x0005
mov ebx, 100 // Set i to 100
G_M35517_IG03: ;; offset=0x000A
call [Foo()]
dec ebx // Decrement i
jne SHORT G_M35517_IG03 // Continue loop if (i != 0)
The code size reduction is small, but if the loop runs for a nontrivial number of iterations, the performance improvement can be significant. RyuJIT now recognizes when the direction of a loop's counter variable can be flipped without affecting the program's behavior, and does the transformation. To learn more about this work, check out dotnet/runtime #102261.
When the address of a local variable is used, it is considered "address-exposed," and RyuJIT must take extra precautions when optimizing the method. For example, suppose RyuJIT is optimizing a method which passes the address of a local variable in a call to another method. Since the callee may use the address to access the local variable, RyuJIT will avoid transforming the variable when optimizing to maintain correctness. Addressed-exposed locals can significantly inhibit RyuJIT's optimization potential.
Consider the following example:
public struct Awaitable
{
public int Opts;
public Awaitable(bool value)
{
Opts = value ? 1 : 2;
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test() => new Awaitable(false).Opts;
In Test
, we create an Awaitable
instance, and then access its Opts
member. By implicitly taking the address of the Awaitable
instance to access Opts
, the instance is marked as address-exposed, blocking multiple optimization opportunities.
Here's what the x64 assembly for Test
looks like:
G_M59043_IG01: ;; offset=0x0000
push rax
G_M59043_IG02: ;; offset=0x0001
xor eax, eax
mov dword ptr [rsp], eax
mov dword ptr [rsp], 2 // Set Opts to 2
mov eax, dword ptr [rsp] // Return Opts
G_M59043_IG03: ;; offset=0x0010
add rsp, 8
ret
; Total bytes of code: 21
RyuJIT can now better track the usage of local variables' addresses, and avoid unnecessary address exposure. In this example, the more precise address exposure detection allows RyuJIT to replace usages of the Awaitable
instance with its Opts
field, and then eliminate Opts
altogether by evaluating the constant expression value ? 1 : 2
, where value
is known to be false
.
Here's what the updated x64 assembly looks like:
G_M59043_IG02: ;; offset=0x0000
mov eax, 2 // Return 2
G_M59043_IG03: ;; offset=0x0005
ret
Check out dotnet/runtime #102808 for more details.
New APIs have been added for AVX10, a new SIMD instruction set by Intel. You'll be able to accelerate your .NET applications on AVX10-enabled hardware with vectorized operations using the new System.Runtime.Intrinsics.X86.Avx10v1
APIs. Check out dotnet/runtime #101938.
Many hardware intrinsic APIs expect users to pass constant values for certain parameters, as these constants are encoded directly into the intrinsic's underlying instruction, rather than being loaded into registers or accessed from memory. If a constant isn't provided, the intrinsic will be replaced with a call to a fallback implementation that is functionally-equivalent, but slower.
Consider the following example:
static byte Test1()
{
Vector128<byte> v = Vector128<byte>.Zero;
byte size = 1;
v = Sse2.ShiftRightLogical128BitLane(v, size);
return Sse41.Extract(v, 0);
}
The usage of size
in the call to Sse2.ShiftRightLogical128BitLane
can be substituted with the constant 1
, and under normal circumstances, RyuJIT is already capable of this substitution optimization.
But when determining whether to generate the accelerated or fallback code for Sse2.ShiftRightLogical128BitLane
, RyuJIT would detect that a variable is being passed instead of a constant, and prematurely decide against intrisifying the call.
This is what the generated code used to look like:
; Method Program:Test1():ubyte (FullOpts)
G_M000_IG01: ;; offset=0x0000
sub rsp, 72
G_M000_IG02: ;; offset=0x0004
vxorps xmm0, xmm0, xmm0
vmovaps xmmword ptr [rsp+0x20], xmm0
lea rdx, [rsp+0x20]
lea rcx, [rsp+0x30]
mov r8d, 1
call [System.Runtime.Intrinsics.X86.Sse2:ShiftRightLogical128BitLane(System.Runtime.Intrinsics.Vector128`1[ubyte],ubyte):System.Runtime.Intrinsics.Vector128`1[ubyte]]
vmovaps xmm0, xmmword ptr [rsp+0x30]
vpextrb eax, xmm0, 0
G_M000_IG03: ;; offset=0x0030
add rsp, 72
ret
; Total bytes of code: 53
RyuJIT now recognizes many more cases like this example, and substitutes the variable argument with its constant value, thus generating the accelerated code.
Here's what the new code generation looks like for the above example:
; Method Program:Test1():ubyte (FullOpts)
G_M11031_IG01: ;; offset=0x0000
G_M11031_IG02: ;; offset=0x0000
vxorps xmm0, xmm0, xmm0
vpsrldq xmm0, xmm0, 1
vpextrb eax, xmm0, 0
G_M11031_IG03: ;; offset=0x000F
ret
; Total bytes of code: 16
Constant folding is an existing optimization in RyuJIT where expressions that can be computed at compile-time are replaced with the constants they evaluate to, thus eliminating computations at runtime.
Examples of the many new capabilities:
- Floating-point binary operations, where one of the operands is a constant:
x + NaN
is now folded intoNaN
.x * 1.0
is now folded intox
.x + -0
is now folded intox
.
- Hardware intrinsics, for example, assuming
x
is aVector<T>
:x + Vector<T>.Zero
is now folded intox
.x & Vector<T>.Zero
is now folded intoVector<T>.Zero
.x & Vector<T>.AllBitsSet
is now folded intox
.
Check out dotnet/runtime #103206 and dotnet/runtime #103143 for a deeper dive.