Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: VPCLMULQDQ Intrinsics #95772

Closed
saucecontrol opened this issue Dec 7, 2023 · 5 comments · Fixed by #109137
Closed

[API Proposal]: VPCLMULQDQ Intrinsics #95772

saucecontrol opened this issue Dec 7, 2023 · 5 comments · Fixed by #109137
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@saucecontrol
Copy link
Member

saucecontrol commented Dec 7, 2023

Background and motivation

VPCLMULQDQ is supported by Intel in the Ice Lake and newer architectures, and by AMD in Zen 4. It allows for parallel pclmulqdq in Vector256 and Vector512 and is important for implementing vectorized CRC32 among other things.

API Proposal

namespace System.Runtime.Intrinsics.X86;

public abstract class Pclmulqdq : Sse2
{
    public abstract class V256
    {
        public static new bool IsSupported { get; }

        public static Vector256<long> CarrylessMultiply(Vector256<long> left, Vector256<long> right, [ConstantExpected] byte control);
        public static Vector256<ulong> CarrylessMultiply(Vector256<ulong> left, Vector256<ulong> right, [ConstantExpected] byte control);
    }

    public abstract class V512
    {
        public static new bool IsSupported { get; }

        public static Vector512<long> CarrylessMultiply(Vector512<long> left, Vector512<long> right, [ConstantExpected] byte control);
        public static Vector512<ulong> CarrylessMultiply(Vector512<ulong> left, Vector512<ulong> right, [ConstantExpected] byte control);
    }
}

API Usage

Examples of vectorized CRC32 implementations using the equivalent C intrinsics abound. One such example: https://github.com/corsix/fast-crc32/blob/main/sample_avx512_vpclmulqdq_crc32c_v4s5x3.c

Alternative Designs

The Pclmulqdq256 and Pclmulqdq512 classes could be nested under Pclmulqdq rather than being top-level classes inheriting from it. Since this ISA includes only a single instruction, that may be preferable.

A case could be made for making Avx the base of Pclulqdq256, as VEX encoding is required for vpclmulqdq. Likewise, Pclmulqdq512 could have Avx512F as a base given its requirement of EVEX encoding. However, the relationship will change with AVX10, where EVEX support will not imply 512-bit vector support.

Risks

N/A

@saucecontrol saucecontrol added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Dec 7, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Dec 7, 2023
@ghost
Copy link

ghost commented Dec 7, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

VPCLMULQDQ is supported by Intel in the Ice Lake and newer architectures, and by AMD in Zen 4. It allows for parallel pclmulqdq in Vector256 and Vector512 and is important for implementing vectorized CRC32 among other things.

API Proposal

namespace System.Runtime.Intrinsics.X86;

/// <summary>
/// This class provides access to Intel VPCLMULQDQ hardware instructions via intrinsics
/// </summary>
[Intrinsic]
[CLSCompliant(false)]
public abstract class Pclmulqdq256 : Pclmulqdq
{
    internal Pclmulqdq256() { }

    // This would depend on the VPCLMULQDQ CPUID bit for VEX encoding and VPCLMULQDQ + AVX512VL for EVEX
    public static new bool IsSupported { get => IsSupported; }

    [Intrinsic]
    public new abstract class X64 : Pclmulqdq.X64
    {
        internal X64() { }

        public static new bool IsSupported { get => IsSupported; }
    }

    /// <summary>
    /// __m256i _mm256_clmulepi64_si128 (__m256i a, __m256i b, const int imm8)
    ///   VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
    /// </summary>
    public static Vector256<long> CarrylessMultiply(Vector256<long> left, Vector256<long> right, [ConstantExpected] byte control) => CarrylessMultiply(left, right, control);
    /// __m256i _mm256_clmulepi64_si128 (__m256i a, __m256i b, const int imm8)
    ///   VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
    /// </summary>
    public static Vector256<ulong> CarrylessMultiply(Vector256<ulong> left, Vector256<ulong> right, [ConstantExpected] byte control) => CarrylessMultiply(left, right, control);
}


/// <summary>
/// This class provides access to Intel AVX-512 VPCLMULQDQ hardware instructions via intrinsics
/// </summary>
[Intrinsic]
[CLSCompliant(false)]
public abstract class Pclmulqdq512 : Pclmulqdq
{
    internal Pclmulqdq256() { }

    // This would depend on the VPCLMULQDQ + AVX512F CPUID bits
    public static new bool IsSupported { get => IsSupported; }

    [Intrinsic]
    public new abstract class X64 : Pclmulqdq.X64
    {
        internal X64() { }

        public static new bool IsSupported { get => IsSupported; }
    }

    /// <summary>
    /// __m512i _mm512_clmulepi64_si128 (__m512i a, __m512i b, const int imm8)
    ///   VPCLMULQDQ ymm1, ymm2, ymm3/m512, imm8
    /// </summary>
    public static Vector512<long> CarrylessMultiply(Vector512<long> left, Vector512<long> right, [ConstantExpected] byte control) => CarrylessMultiply(left, right, control);
    /// __m512i _mm512_clmulepi64_si128 (__m512i a, __m512i b, const int imm8)
    ///   VPCLMULQDQ ymm1, ymm2, ymm3/m512, imm8
    /// </summary>
    public static Vector512<ulong> CarrylessMultiply(Vector512<ulong> left, Vector512<ulong> right, [ConstantExpected] byte control) => CarrylessMultiply(left, right, control);
}

API Usage

Examples of vectorized CRC32 implementations using the equivalent C intrinsics abound. One such example: https://github.com/corsix/fast-crc32/blob/main/sample_avx512_vpclmulqdq_crc32c_v4s5x3.c

Alternative Designs

The Pclmulqdq256 and Pclmulqdq256 could be nested under Pclmulqdq rather than being top-level classes inheriting from it. Since this ISA includes only a single instruction, that may be preferable.

A case could be made for making Avx the base of Pclulqdq256, as VEX encoding is required for vpclmulqdq. Likewise, Pclulqdq512 could have Avx512F as a base given its requirement of EVEX encoding. However, the relationship will change with AVX10, where EVEX support will not imply 512-bit vector support.

Risks

N/A

Author: saucecontrol
Assignees: -
Labels:

api-suggestion, area-System.Runtime.Intrinsics

Milestone: -

@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged New issue has not been triaged by the area owner labels Dec 8, 2023
@MichalPetryka
Copy link
Contributor

Pclmulqdq512 could maybe be changed to inherit from Pclmulqdq256 if it implies that that's supported too.

@bartonjs
Copy link
Member

bartonjs commented Jan 9, 2024

Video

Looks good as proposed.

namespace System.Runtime.Intrinsics.X86;

/// <summary>
/// This class provides access to Intel VPCLMULQDQ hardware instructions via intrinsics
/// </summary>
[Intrinsic]
[CLSCompliant(false)]
public abstract class Pclmulqdq256 : Pclmulqdq
{
    internal Pclmulqdq256() { }

    // This would depend on the VPCLMULQDQ CPUID bit for VEX encoding and VPCLMULQDQ + AVX512VL for EVEX
    public static new bool IsSupported { get => IsSupported; }

    [Intrinsic]
    public new abstract class X64 : Pclmulqdq.X64
    {
        internal X64() { }

        public static new bool IsSupported { get => IsSupported; }
    }

    /// <summary>
    /// __m256i _mm256_clmulepi64_si128 (__m256i a, __m256i b, const int imm8)
    ///   VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
    /// </summary>
    public static Vector256<long> CarrylessMultiply(Vector256<long> left, Vector256<long> right, [ConstantExpected] byte control) => CarrylessMultiply(left, right, control);
    /// __m256i _mm256_clmulepi64_si128 (__m256i a, __m256i b, const int imm8)
    ///   VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
    /// </summary>
    public static Vector256<ulong> CarrylessMultiply(Vector256<ulong> left, Vector256<ulong> right, [ConstantExpected] byte control) => CarrylessMultiply(left, right, control);
}


/// <summary>
/// This class provides access to Intel AVX-512 VPCLMULQDQ hardware instructions via intrinsics
/// </summary>
[Intrinsic]
[CLSCompliant(false)]
public abstract class Pclmulqdq512 : Pclmulqdq
{
    internal Pclmulqdq512() { }

    // This would depend on the VPCLMULQDQ + AVX512F CPUID bits
    public static new bool IsSupported { get => IsSupported; }

    [Intrinsic]
    public new abstract class X64 : Pclmulqdq.X64
    {
        internal X64() { }

        public static new bool IsSupported { get => IsSupported; }
    }

    /// <summary>
    /// __m512i _mm512_clmulepi64_si128 (__m512i a, __m512i b, const int imm8)
    ///   VPCLMULQDQ zmm1, zmm2, zmm3/m512, imm8
    /// </summary>
    public static Vector512<long> CarrylessMultiply(Vector512<long> left, Vector512<long> right, [ConstantExpected] byte control) => CarrylessMultiply(left, right, control);
    /// __m512i _mm512_clmulepi64_si128 (__m512i a, __m512i b, const int imm8)
    ///   VPCLMULQDQ zmm1, zmm2, zmm3/m512, imm8
    /// </summary>
    public static Vector512<ulong> CarrylessMultiply(Vector512<ulong> left, Vector512<ulong> right, [ConstantExpected] byte control) => CarrylessMultiply(left, right, control);
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jan 9, 2024
@stephentoub stephentoub added this to the 10.0.0 milestone Jul 19, 2024
@saucecontrol
Copy link
Member Author

saucecontrol commented Oct 19, 2024

For consistency with the AVX10 surface (and #86952), this should probably be revised to

namespace System.Runtime.Intrinsics.X86;

public abstract class Pclmulqdq : Sse2
{
    public abstract class V256
    {
        public static new bool IsSupported { get; }

        public static Vector256<long> CarrylessMultiply(Vector256<long> left, Vector256<long> right, [ConstantExpected] byte control);
        public static Vector256<ulong> CarrylessMultiply(Vector256<ulong> left, Vector256<ulong> right, [ConstantExpected] byte control);
    }

    public abstract class V512
    {
        public static new bool IsSupported { get; }

        public static Vector512<long> CarrylessMultiply(Vector512<long> left, Vector512<long> right, [ConstantExpected] byte control);
        public static Vector512<ulong> CarrylessMultiply(Vector512<ulong> left, Vector512<ulong> right, [ConstantExpected] byte control);
    }
}

@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-approved API was approved in API review, it can be implemented labels Oct 21, 2024
@bartonjs
Copy link
Member

bartonjs commented Oct 22, 2024

Video

Looks good as proposed

namespace System.Runtime.Intrinsics.X86;

public abstract class Pclmulqdq : Sse2
{
    public abstract class V256
    {
        public static new bool IsSupported { get; }

        public static Vector256<long> CarrylessMultiply(Vector256<long> left, Vector256<long> right, [ConstantExpected] byte control);
        public static Vector256<ulong> CarrylessMultiply(Vector256<ulong> left, Vector256<ulong> right, [ConstantExpected] byte control);
    }

    public abstract class V512
    {
        public static new bool IsSupported { get; }

        public static Vector512<long> CarrylessMultiply(Vector512<long> left, Vector512<long> right, [ConstantExpected] byte control);
        public static Vector512<ulong> CarrylessMultiply(Vector512<ulong> left, Vector512<ulong> right, [ConstantExpected] byte control);
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Oct 22, 2024
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Oct 23, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Dec 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants