Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64 #13272

Merged
merged 5 commits into from
Oct 29, 2022

Commits on Oct 28, 2022

  1. crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64

    There's probably plenty of room to optimize these further in the
    future, but for the moment this gives ~3x improvement on Intel
    x86-64 processors, ~5x on AMD, and ~10x on M1 Macs.
    
    These extensions are very new - Most processors prior to 2020 do
    not support them.
    
    AVX-512 is a slightly older alternative that we could use on Intel
    for a much bigger performance bump, but it's been fused off on
    Intel's latest hybrid architectures and it relies on computing
    independent SHA hashes in parallel. In contrast, these SHA intrinsics
    provide the usual single-threaded, single-stream interface, and should
    continue working on new processors.
    
    AArch64 also has SHA-512 intrinsics that we could take advantage
    of in the future
    topolarity committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    10edb6d View commit details
    Browse the repository at this point in the history
  2. std.crypto: SHA-256 Properly gate comptime conditional

    This feature detection must be done at comptime so that we avoid
    generating invalid ASM for the target.
    topolarity committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    ee241c4 View commit details
    Browse the repository at this point in the history
  3. std.crypto: Optimize SHA-256 intrinsics for AMD x86-64

    This gets us most of the way back to the performance I had when
    I was using the LLVM intrinsics:
      - Intel Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz:
           190.67 MB/s (w/o intrinsics) -> 1285.08 MB/s
      - AMD EPYC 7763 (VM) @ 2.45 GHz:
           240.09 MB/s (w/o intrinsics) -> 1360.78 MB/s
      - Apple M1:
           216.96 MB/s (w/o intrinsics) -> 2133.69 MB/s
    
    Minor changes to this source can swing performance from 400 MB/s to
    1400 MB/s or... 20 MB/s, depending on how it interacts with the
    optimizer. I have a sneaking suspicion that despite LLVM inheriting
    GCC's extremely strict inline assembly semantics, its passes are
    rather skittish around inline assembly (and almost certainly, its
    instruction cost models can assume nothing)
    topolarity committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    4c1f71e View commit details
    Browse the repository at this point in the history
  4. std.crypto: Add isComptime guard around intrinsics

    Comptime code can't execute assembly code, so we need some way to
    force comptime code to use the generic path. This should be replaced
    with whatever is implemented for ziglang#868, when that day comes.
    
    I am seeing that the result for the hash is incorrect in stage1 and
    crashes stage2, so presumably this never worked correctly. I will follow
    up on that soon.
    topolarity committed Oct 28, 2022
    Configuration menu
    Copy the full SHA
    f9fe548 View commit details
    Browse the repository at this point in the history

Commits on Oct 29, 2022

  1. std.crypto: Use featureSetHas to gate intrinsics

    This also fixes a bug where the feature gating was not taking
    effect at comptime due to ziglang#6768
    topolarity committed Oct 29, 2022
    Configuration menu
    Copy the full SHA
    67fa326 View commit details
    Browse the repository at this point in the history