Add option to change SVE vector length for current and children processes #101295

SwapnilGaikwad · 2024-04-19T14:36:43Z

Current coreclr assumes SVE vector length as 128 bits leading it to limit size of Vector to 16 bytes.
While executing on a platform offerring higher vector lengths, such as 256 bits, lead to using registers of larger size.
This leads to incorrect results, e.g., while using unzip even (uzp1 instruction).
Here C# expects processing based on half of the vectors (size 128bits) while the actual result is based on full vectors (size 256bits).

Add DOTNET_MaxVectorLength=N flag where N is desired SVE vector length in bytes for the current execution.
Let M is the max/current vector length and N is the vector length specified with DOTNET_MaxVectorLength option, then V is the new vector length for the current execution.

If N < M, N % 16 == 0 (a valid vector length), V = N
If N < M, N % 16 != 0 (an invalid vector length), V = M
If N > M (N can be a valid or invalid length), V = M

…sses.

dotnet-policy-service · 2024-04-19T14:37:09Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

SwapnilGaikwad · 2024-04-19T14:37:13Z

@kunalspathak @a74nh @dotnet/arm64-contrib @arch-arm64-sve

SwapnilGaikwad · 2024-04-19T14:38:13Z

This patch is required for #101294 while running on a system that offers SVE vector length greater than 128.

a74nh · 2024-04-19T14:47:42Z

This PR fixes the issues I've been seeing implementing AddAcross.

We've both been using V1 machines. On an N2 this PR is not required, and won't cause any effects.

src/coreclr/inc/clrconfigvalues.h

tannergooding · 2024-04-19T14:49:59Z

src/coreclr/vm/codeman.cpp

@@ -1525,6 +1528,18 @@ void EEJitManager::SetCpuInfo()

    if (((cpuFeatures & ARM64IntrinsicConstants_Sve) != 0) && CLRConfig::GetConfigValue(CLRConfig::EXTERNAL_EnableArm64Sve))
    {
+#if defined(TARGET_LINUX) && (defined(DEBUG) || defined(_DEBUG))


Why is this restricted to DEBUG only?

I also expect that we need a path for Windows and should likely treat SVE as unsupported if the vector length is larger and restricting the size (via prctl or equivalent on other OS) fails.

This option is added for development purpose only. It's primarily added to aid implementation of API, and its testing on a 256bit SVE enabled V1 system that we have access to. When Vector would use the entire vector length, it may become redundant and be removed.

I don't think it being for development purposes only is a good thing long term and it probably doesn't match up with the intent of DOTNET_MaxVectorTBitWidth.

We don't really want to disable SVE on hardware with 256-bit vectors just because a user has said they want 128-bit vectors, using the OS feature like prctl should be a much better option so that they can still get SVE usage in that scenario.

I also expect that we need a path for Windows and should likely treat SVE as unsupported if the vector length is larger and restricting the size (via prctl or equivalent on other OS) fails.

I will check with Windows team on what is the equivalent API.

Kunal asked me to share how this works in Windows (I developed the SVE support in Windows for the next release). We felt that being able to dynamically change the vector length at runtime would be messy, since there's no save/restore mechanism for these vector length changes, hence you could get into situations where for example some code calls into a library, the library decreases the VL for some reason, but then has a bug where under certain conditions it fails to restore the vector length, so after the caller gains control again it would be running with a decreased vector length indefinitely, etc.

We do recognize the need to change the vector length for various purposes (testing, perf, compat, etc), so we did add a CreateProcess parameter that allows the vector length to be specified during process creation, but after a process has been created, its vector length cannot be changed. There's also registry keys that can be set to change the vector length on a per-process basis (IFEO settings), or for all processes in the system, but these registry-based methods are typically used for development and testing only.

By default, we'll use the highest VL supported by the system and the underlying hypervisor. So the above mechanisms will only be necessary when you want to decrease the VL for processes.

Happy to explore this topic more with you.

Thanks for the context @JasonLinMS.

I definitely agree that changing this during the lifetime of the process is messy/problematic and the intent in general is to not allow that for .NET either.

I think it's probably worth us having a short meeting (.NET, Windows, and Arm) to see if we can discuss things here and see if we can find something that generally works for everyone. CC. @jkotas

In the case of .NET, we would really only need the capability to set this once as part of our own startup before any user code has executed. We don't have any intent to change this dynamically (although there is the SME feature that makes this a little bit more complicated) and the ideal for user code is to write size agnostic algorithms so that it doesn't matter what size the hardware actually supports.

However, the official Arm64 SVE/Vector ABIs (https://github.com/ARM-software/abi-aa?tab=readme-ov-file#abi-for-the-arm-64-bit-architecture-with-sve-support) do define the ability to say a given API expects a particular size and there may be cases where a user needs to fix it themselves (potentially just for testing purposes or giving users a workaround for a bug). As such, the ability for a process to set the size for itself is beneficial, especially if that can be shared across all Arm64 capable operating systems.

Yes, a meeting to discuss this sounds good.

Once other feedback is addressed, this change can go in without having us to wait for Windows support.

I think this should also check the other requirements such as it should by 128-bit increments only and maximum should be 2048 and such.

we did add a CreateProcess parameter that allows the vector length to be specified during process creation, but after a process has been created, its vector length cannot be changed.

This is a very sensible choice. In Linux, changing the vector length inside a library and not restoring it before returning is generally seen as undefined behaviour. OpenJDK doesn't want to trust that, and after calling external routines it inserts checks to confirm the vector length hasn't changed. Neither of which is ideal.

However, the official Arm64 SVE/Vector ABIs (https://github.com/ARM-software/abi-aa?tab=readme-ov-file#abi-for-the-arm-64-bit-architecture-with-sve-support) do define the ability to say a given API expects a particular size

The PCS isn't clear if the vector length must remain fixed. I've raised a bug against it here. It would be good to have a clear statement.

For coreclr, if changing the vector length is only for debugging then a wrapper script/binary that just launches coreclr with the correct vector length might be good enough. That would work for Windows and Linux. Ideally I'd still like to keep this PRs mechanism for use in Linux.

src/coreclr/vm/codeman.cpp

jkotas · 2024-04-24T12:42:16Z

src/coreclr/vm/codeman.cpp

+        int maxVectorLength = (maxVectorTBitWidth >> 3);
+
+        // Limit the SVE vector length to 'maxVectorLength' if the underlying hardware offers longer vectors.
+        if ((prctl(PR_SVE_GET_VL, 0,0,0,0) & PR_SVE_VL_LEN_MASK) > maxVectorLength)


This will set the SVE vector length only for the current thread only. It won't set it for threads that has been started already. Is that correct?

How does it work for new threads? Do new threads inherit SVE vector length of the parent thread or do new threads inherit SVE vector length of the process?

Are we sure that none of the libraries that has been initialized by this point have not cached the vector length?

https://www.man7.org/linux/man-pages/man2/prctl.2.html has explicit warning about use of this API only if you know what you are doing.

It seems that allowing the SVE vector length to be set only before process start is the only reliable option. Setting it here may break all sorts of random things. I am not convinced that we know what we are doing by setting it here.

Do you mean in the case of something like the C runtime or in the case of something else hosting the CLR?

Are you thinking the only valid thing for us to do here is to fail to launch for an SVE mismatch (AOT) and to just disable SVE usage otherwise (JIT)?

Right, this is obviously not compatible with hosting. (We have no reliable way to tell whether we are hosted.)

Even without hosting and external libraries outside our control in the picture, there are number of our own threads created in the process by this point (PAL, EventPipe) that will have the wrong size configured. It is hard to guarantee that none of these threads is going to wander into managed code.

Are you thinking the only valid thing for us to do here is to fail to launch for an SVE mismatch (AOT)

I do not see a better option. It suggests that the design we are working with is questionable since it does not work well for AOT.

just disable SVE usage otherwise (JIT)

Yes, if the JIT is not able to accommodate the SVE length that the process was started with.

This will set the SVE vector length only for the current thread only. It won't set it for threads that has been started already. Is that correct?

Correct. There shouldn't be any other threads running at this point? (it's possible this code is called much later than I expected or there are hosting scenarios I'm not familiar with)

How does it work for new threads? Do new threads inherit SVE vector length of the parent thread or do new threads inherit SVE vector length of the process?

It will use the parent thread vector length.

All other SVE state of a thread, including the currently configured vector length, the state of the PR_SVE_VL_INHERIT flag, and the deferred vector length (if any), is preserved across all syscalls, subject to the specific exceptions for execve() described in section 6.
In particular, on return from a fork() or clone(), the parent and new child process or thread share identical SVE configuration, matching that of the parent before the call.
sve.rst

Are we sure that none of the libraries that has been initialized by this point have not cached the vector length?

Assuming this is happening early in the process, there should be no use of SVE at all yet. Vector length can be easily read using an instruction (eg CNT). But we can't guarantee what a library might do.

https://www.man7.org/linux/man-pages/man2/prctl.2.html has explicit warning about use of this API only if you know what you are doing.

It seems that allowing the SVE vector length to be set only before process start is the only reliable option. Setting it here may break all sorts of random things. I am not convinced that we know what we are doing by setting it here.

If this is only used for debugging/testing and never used in production, then I lean towards it's fine. Anything more then maybe not.

Are you thinking the only valid thing for us to do here is to fail to launch for an SVE mismatch (AOT) and to just disable SVE usage otherwise (JIT)?

Or should DOTNET_MaxVectorTBitWidth be X86 only?

If this is only used for debugging/testing and never used in production, then I lean towards it's fine. Anything more then maybe not.

In general these switches are primarily there for debugging/testing purposes. However, they also generally exist as a way to disable or limit intrinsic support if a library is found to have a blocking bug.

It's not great that we can't help setup the process to achieve success, but its also not the end of the world and is something we can ideally give user guidance around.

Or should DOTNET_MaxVectorTBitWidth be X86 only?

I think it's fine for us to respect it still, that's functionally what AOT compiled for a particular SVE size would have to do after all. It's just a different way to disable SVE support.

The documented switches have to be reliable. DOTNET_MaxVectorTBitWidth is documented switch.

It does not sound like that this switch can be reliable. It means that it should have different name, and ideally be a debug-only switch. We do not want to be dealing with inscrutable crashes caused the different parts of the process being configured to different vector sizes.

It does not sound like that this switch can be reliable.

It's still reliable and working as documented, even with this change in direction. The switch was intentionally named MaxVectorTBitWidth because such complications could exist. All that's changed is that instead of us setting sizeof(Vector<T>) based on min(SveLength, DOTNET_MaxVectorTBitWidth), we simply set it based on (SveLength > DOTNET_MaxVectorTBitWidth) ? 16 : SveLength.

So, this minor change in direction is really no different than us limiting the maximum bit width to 256 by default on x64 hardware or not considering 512-bits on certain first gen AVX512 hardware unless the users also opt into a hidden undocumented switch.

Which is to say, it still simply represents the maximum size a user wants to support (defaulting to 0 which means the system can decide). It can be smaller if the system doesn't support the size specified.

As proposed in this PR, it does more than just setting the sizeof(Vector<T>). PR_SVE_SET_VL call makes it unreliable.

It impacts the global state of the thread and process in a way that may be incompatible with other components in the process. It is what makes it unrealizable. It is guaranteed to be broken for CoreCLR hosting scenarios, and it may have issues without hosting too based on the documentation. It is very hard to audit what is loaded in the process.

I agree that it would be ok if the switch set sizeof(Vector<T>) only without calling PR_SVE_SET_VL.

Right. I should have clarified I meant that given your input that we shouldn't take this PR to call prcrtl because its unreliable, that the alternative where we simply just don't use SVE if its larger than the DOTNET_MaxVectorTSize is fine and still inline with the currently documented behavior for that config switch.

If we provided anything around prctl (and it sounds like we're leaning towards no), it would need to be a separate undocumented switch, potentially debug only. -- I don't think we have the need to add that given our current testing needs and the known sizes (128 and 256) we'll want to support for existing SVE capable hardware (both consumer and server/cloud).

Thanks @tannergooding for clarifying some of the things offline, so it eventually boils down to:

int VectorTLength = 0; int SystemVectorTLength = Get_System_VL(); if (DOTNET_MaxVectorTBitWidth == 0) { // If we fix getVectorTByteLength() to return system length - then use system provided length // Otherwise use 128 VectorTLength = SystemVectorTLength; } else if (DOTNET_MaxVectorTBitWidth >= SystemVectorTLength) { // For a 256-bit machine, if user provides DOTNET_MaxVectorTBitWidth=512, we will use the // maximum available length of 256-bits VectorTLength = SystemVectorTLength; } else if (DOTNET_MaxVectorTBitWidth < SystemVectorTLength) { // For a 256-bit machine, if user provides DOTNET_MaxVectorTBitWidth=128, we do not want // to update it using syscall because that has implications on already initialized components // as it might have stale vector length. In that case, disable SVE. // // If user really wants to downgrade the size, they can call `prctl` or windows `CreateProcess()` // to limit the size before launching dotnet process VectorTLength = 128 DisableSve(); }

To implement Get_System_VL(), we can use prctl PR_SVE_GET_VL, it will be good to use CNTB or an equivalent instruction because that way, it will be OS agnostic. We will need that anyway for getVectorTByteLength() when DOTNET_MaxVectorTBitWidth is not set.

With that said, there is no need to introduce a different environment variable that is DEBUG only to support the downgrade scenario.

kunalspathak

We need to set the system vector length if DOTNET_MaxVectorTBitWidth is not specified.

src/coreclr/vm/codeman.cpp

kunalspathak · 2024-05-01T05:02:22Z

src/coreclr/vm/codeman.cpp

+        {
+            // Enable SVE only when user specified vector length larger than or equal to the system
+            // vector length. When eabled, SVE would use full vector length available to the process.
+            // For a 256-bit machine, if user provides DOTNET_MaxVectorTBitWidth=128, disable SVE.


When DOTNET_MaxVectorTBitWidth is not specified, it is 0 and in that case, we should use the full vector length, the system offers (as the comment says). In that case, we should set CPUCompileFlags.Clear(InstructionSet_VectorT256); on 256-bit machine and CPUCompileFlags.Clear(InstructionSet_VectorT256); and CPUCompileFlags.Clear(InstructionSet_VectorT512); on a 512-bit machine.

We wouldn't want to CPUCompileFlags.Clear, as that's implicit

Rather instead we would CPUCompileFlags.Set(InstructionSet_VectorT256); if the reported SVE length is 256-bits and CPUCompileFlags.Set(InstructionSet_VectorT512); if the reported SVE length is 512-bits.

They should be off unless otherwise set, but we double check that via a cleanup check anyways (which should be moved to be shared with Arm64):

runtime/src/coreclr/vm/codeman.cpp

Lines 1564 to 1573 in 22aa47e

// Clean up mutually exclusive ISAs

if (CPUCompileFlags.IsSet(InstructionSet_VectorT512))

{

CPUCompileFlags.Clear(InstructionSet_VectorT256);

CPUCompileFlags.Clear(InstructionSet_VectorT128);

}

else if (CPUCompileFlags.IsSet(InstructionSet_VectorT256))

{

CPUCompileFlags.Clear(InstructionSet_VectorT128);

}

-- I think we also need a TODO here explaining that we're artificially restricting the size to 128-bits for the time being, as the support for larger vector sizes hasn't been plumbed through for Arm64 yet.

We wouldn't want to CPUCompileFlags.Clear, as that's implicit

Yes, that's what I meant...if they are already enabled, then we should disable it or vice-versa.

I think we also need a TODO here explaining that we're artificially restricting the size to 128-bits for the time being,

Yes, I think that's what we should do and leave the change of setting Vector256, etc. for later PR, when we address some of the things in #101477 , specifically getVectorTByteLength().

src/coreclr/vm/codeman.cpp

tannergooding · 2024-05-02T14:07:35Z

src/coreclr/vm/codeman.h

+    inline UINT64 GetSystemVectorLength()
+    {
+        UINT64 size;
+        __asm__ __volatile__("cntb %0" : "=r"(size));
+        return size;
+    }


I'd expect this to fail for MSVC, which doesn't allow inline assembly?

Can we just call svcntb() on Windows instead, or is there potentially an official OS API for this?

I wasn't sure this was available for MSVC yet. If so, then great, that's easier.

Is there a min version of MSVC needed to build coreclr?

Hmmm, it might not be available yet, at least its not available in the MSVC version CI is using. -- @kunalspathak might know better when that support is expected to land.

I expect we need some ifdef here regardless since __asm is explicitly unsupported on MSVC for anything but 32-bit x86 and for the general function to be configured to assert if called from Windows, as a safety measure. -- This function directly executing an SVE instruction unguarded is a bit dangerous and there isn't really a way to assert that it's being done safely. It's only safe in the current use-case since its only called if the relevant support was queried from the OS.

at least it's not available in the MSVC version CI is using

yes, I checked around and they haven't added that support yet, so it will not be anytime sooner

Can we just call svcntb() on Windows instead

They do not and the guidance was to use whatever is exposed by the compiler, so we won't have it at least for new few months.

So, I think in short-term, let us do it in asm guarded with if CPU-capability supports SVE.

The consideration is the msvc compiler does not support asm

Return a hardcoded vector length of 16-bytes until we find a suitable mechanism to retrieve it.

Just noting that we won't be able to actually ship in November in such a state. We will require some form of actual query of the hardware size from the OS to avoid any issues if its run on hardware that has larger length.

It might be overall simpler to just get such a fallback setup now so we don't need to worry about it, risk forgetting it about it, or anything along those lines.

It might be overall simpler to just get such a fallback setup now

Agreed, but AIUI, for windows today:

Can't use inline asm, including encoding in hex

no OS API available

no SVE ACLE available

I'm not aware of any other method.

I think the solution is what @jkotas mentioned of emitting hex code or just write the assembly cntb in the .asm and .S file - https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/arm64/asmhelpers.asm.

It looks like MASM v14.4 supports things and we can just define a function that does:

rdvl x0, #1 ret

Earlier versions may also support the functionality, but I don't have such earlier versions installed at the moment.

Updated the PR to test how the ci handles use of rdvl.

SwapnilGaikwad · 2024-05-03T16:09:13Z

Updated the PR to do the following
On Linux

Disable SVE when the user specifies vector length that's smaller than the available/OS vector length.
Avoid inline assembly and use ACLE (svcntb()) to retrieve the vector length.
- Use of svcntb() is available from clang-16. On older versions of clang, such as clang-14, including arm_sve.h errors out with SVE support not enabled on non-SVE systems.

On Windows

Return a hardcoded vector length of 16-bytes until we find a suitable mechanism to retrieve it.
Added a TODO to note the current limitation.

src/coreclr/vm/arm64/asmhelpers.asm

src/coreclr/vm/arm64/asmhelpers.S

jkotas

We prefer standard C/C++ types for new code.

src/coreclr/vm/arm64/asmhelpers.S

src/coreclr/vm/arm64/asmhelpers.asm

src/coreclr/vm/codeman.h

SwapnilGaikwad · 2024-06-05T15:56:36Z

Hi @jkotas, do you have any suggestions to fix this build error - unknown opcode: rdvl, for Windows on Arm64 ?
Potentially any current places where we use a hex code for an instruction.

src/coreclr/vm/arm64/asmhelpers.asm

jkotas · 2024-06-05T16:31:33Z

do you have any suggestions to fix this build error - unknown opcode: rdvl, for Windows on Arm64 ?

I do not think we need to be creative: https://github.com/dotnet/runtime/pull/101295/files#r1628086182

src/coreclr/vm/arm64/asmhelpers.asm

kunalspathak · 2024-06-05T17:32:27Z

do you have any suggestions to fix this build error - unknown opcode: rdvl, for Windows on Arm64 ?

I do not think we need to be creative: https://github.com/dotnet/runtime/pull/101295/files#r1628086182

Currently, GetSveLengthFromOS() will return different values each time depending on the content of x0 for windows/arm64. Until we upgrade the masm on CI machine (not sure how frequently that happens), we should at least make it return 128 for windows/arm64.

kunalspathak

LGTM. Thanks for your contributions.

Add option to change SVE vector length for current and children proce…

1393c30

…sses.

dotnet-issue-labeler bot added the area-VM-coreclr label Apr 19, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 19, 2024

SwapnilGaikwad marked this pull request as ready for review April 19, 2024 14:36

SwapnilGaikwad mentioned this pull request Apr 19, 2024

Add support for Sve.UnzipEven/Odd & Sve.ZipHighLow #101294

Merged

tannergooding reviewed Apr 19, 2024

View reviewed changes

src/coreclr/inc/clrconfigvalues.h Outdated Show resolved Hide resolved

tannergooding reviewed Apr 19, 2024

View reviewed changes

kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Apr 21, 2024

kunalspathak self-requested a review April 21, 2024 18:08

This was referenced Apr 22, 2024

JIT ARM64-SVE: Add Count*BitElements #101188

Merged

Arm64 SVE: Size of vector is always 128bits #101433

Open

kunalspathak mentioned this pull request Apr 24, 2024

Arm64/Sve: Some misc items about SVE Vector Length #101477

Open

5 tasks

Use maxVectorTBitWidth to get desired SVE length

2c040a7

SwapnilGaikwad commented Apr 24, 2024

View reviewed changes

src/coreclr/vm/codeman.cpp Outdated Show resolved Hide resolved

jkotas reviewed Apr 24, 2024

View reviewed changes

SwapnilGaikwad added 2 commits April 30, 2024 17:37

Merge main

19dce8d

Use CNTB to determine current vector length

22aa47e

build-analysis bot mentioned this pull request Apr 30, 2024

System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanScalarDestination_SpecialValues fails #101721

Closed

kunalspathak requested changes May 1, 2024

View reviewed changes