crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64 #13272

topolarity · 2022-10-23T07:05:18Z

There's probably plenty of room to optimize these further in the future, but for the moment this gives ~5-6x improvement on x86-64, and ~10x on M1 AArch64 Macs.

These extensions are very new. Most Intel processors prior to 2020 do not support them. AMD has supported them since Ryzen.

P.S. @kubkon I haven't fixed the CPU features for LLVM in stage2 yet, but at least this gives you something to play with until that's sorted 🙂

joachimschmidt557 · 2022-10-23T08:39:18Z

In future it would be nice if we use inline assembly instead of LLVM intrinsics to enable the self-hosted backends to use these features too.

topolarity · 2022-10-23T16:44:58Z

In future it would be nice if we use inline assembly instead of LLVM intrinsics to enable the self-hosted backends to use these features too.

Good point! Switched to inline asm for this PR.

In general, I think inline assembly can interact very differently with the optimizer versus intrinsics, but in this case it's actually a 20% win for Intel x86-64: Intel is over 1 GB/s now (4x improvement vs. master)

topolarity · 2022-10-23T16:57:04Z

Eugh... Intel x86-64 loves the new inline assembly, but AMD hates it:

intrinsics: 66870 us (1568.08 MB/s)
inline asm: 251760 us (416.50 MB/s)

I'll go digging through the assembly and see if I can figure out why.

kubkon · 2022-10-23T17:02:22Z

Eugh... Intel x86-64 loves the new inline assembly, but AMD hates it:
intrinsics: 66870 us (1568.08 MB/s)
inline asm: 251760 us (416.50 MB/s)
I'll go digging through the assembly and see if I can figure out why.

What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo.

topolarity · 2022-10-23T17:03:51Z

What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo.

arm64 on my M1 mac is about the same (<5% delta)

kubkon · 2022-10-23T17:04:45Z

What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo.

arm64 on my M1 mac is about the same (<5% delta)

Awesome! Thanks for confirming!

kubkon · 2022-10-23T17:22:10Z

Is there any trick to building any of this locally? On my M1Pro I am getting:

error: <inline asm>:2:1: instruction requires: sha2
sha256h.4s v2, v1, v3

Jarred-Sumner · 2022-10-24T05:41:59Z

not sure if helpful but I have a script that benches SHA hashing comparing BoringSSL's implementation to Zig's implementation

This is for an old build of Zig without this PR's changes 0.10.0-dev.2822+b79884eaf on a Ryzen machine

[SHA256]

     zig: 3.292s
  boring: 518.223ms
     evp: 518.25ms
  evp in: 518.228ms

https://github.com/Jarred-Sumner/bun/blob/dea7cb14bdf0446fcb8a0750fe86b2056a7c3be0/src/sha.zig#L173-L230

andrewrk · 2022-10-24T06:30:19Z

@kubkon what does zig build-exe --show-builtin output for you?

kubkon · 2022-10-24T06:59:07Z

@kubkon what does zig build-exe --show-builtin output for you?

pub const cpu: std.Target.Cpu = .{                                                                                                                                                                                                                                                                                                                   
    .arch = .aarch64,                                                                                                                                                                                                                                                                                                                                
    .model = &std.Target.aarch64.cpu.apple_a14,                                                                                                                                                                                                                                                                                                      
    .features = std.Target.aarch64.featureSet(&[_]std.Target.aarch64.Feature{                                                                                                                                                                                                                                                                        
        .aes,                                                                                                                                                                                                                                                                                                                                        
        .aggressive_fma,                                                                                                                                                                                                                                                                                                                             
        .alternate_sextload_cvt_f32_pattern,
        .altnzcv,
        .am,
        .arith_bcc_fusion,
        .arith_cbz_fusion,
        .ccdp,
        .ccidx,
        .ccpp,
        .complxnum,
        .contextidr_el2,
        .crc,
        .crypto,
        .disable_latency_sched_heuristic,
        .dit,
        .dotprod,
        .el2vmsa,
        .el3,
        .flagm,
        .fp16fml,
        .fp_armv8,
        .fptoint,
        .fullfp16,
        .fuse_address,
        .fuse_adrp_add,
        .fuse_aes,
        .fuse_arith_logic,
        .fuse_crypto_eor,
        .fuse_csel,
        .fuse_literals,
        .jsconv,
        .lor,
        .lse,
        .lse2,
        .mpam,
        .neon,
        .nv,
        .pan,
        .pan_rwv,
        .pauth,
        .perfmon,
        .predres,
        .ras,
        .rcpc,
        .rcpc_immo,
        .rdm,
        .sb,
        .sel2,
        .sha2,
        .sha3,
        .specrestrict,
        .ssbs,
        .tlb_rmi,
        .tracev8_4,
        .uaops,
        .v8_1a,
        .v8_2a,
        .v8_3a,
        .v8_4a,
        .v8a,
        .vh,
        .zcm,
        .zcz,
        .zcz_gp,
    }),
};

kubkon

Stellar work! Thanks @topolarity!

daurnimator · 2022-10-25T04:07:32Z

Doesn't this make it impossible to do sha256 at comptime?

topolarity · 2022-10-25T15:23:43Z

Doesn't this make it impossible to do sha256 at comptime?

Good point, I hadn't thought of that. Yeah, it breaks comptime evaluation for SHA-256.

Without #868 (or a related hack), the only way to fix that is to create a separate comptime method that users call at comptime. That means it's viral, too: Any function that calls SHA-256 must split into a comptime and runtime version, as well as their callers and so on.

jedisct1 · 2022-10-25T18:08:14Z

For hash functions, comptime-support can indeed be an important thing to have.

That leaves us with either dynamic dispatch (which doesn't match the rest, and the plan to eventually do it), or, as you pointed out, a way to check if the function is called at comptime or not.

Doing this, even with a hack as a starter, sounds like a much better route that duplicating every function to have a comptine variant.

andrewrk · 2022-10-28T01:02:59Z

pub fn isComptime() bool {
    var t = true;
    var a: u8 = 0;
    var b: u16 = 0;
    const x = if (t) a else b;
    return @TypeOf(x) == u8;
}

jedisct1 · 2022-10-28T13:06:11Z

That's a nice trick!

andrewrk · 2022-10-28T22:05:32Z

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
ZSTD

Hmm not sure what's up with the CI failures - I don't understand why this would be happening in this branch but not master branch.

Are you sure this is rebased against master branch? In master branch I see this:

-- The C compiler identification is Clang 15.0.3
-- The CXX compiler identification is Clang 15.0.3

However in this CI run I see this:

-- The C compiler identification is Clang 15.0.0
-- The CXX compiler identification is Clang 15.0.0

Possibly indicating an old tarball.

There's probably plenty of room to optimize these further in the future, but for the moment this gives ~3x improvement on Intel x86-64 processors, ~5x on AMD, and ~10x on M1 Macs. These extensions are very new - Most processors prior to 2020 do not support them. AVX-512 is a slightly older alternative that we could use on Intel for a much bigger performance bump, but it's been fused off on Intel's latest hybrid architectures and it relies on computing independent SHA hashes in parallel. In contrast, these SHA intrinsics provide the usual single-threaded, single-stream interface, and should continue working on new processors. AArch64 also has SHA-512 intrinsics that we could take advantage of in the future

This feature detection must be done at comptime so that we avoid generating invalid ASM for the target.

This gets us most of the way back to the performance I had when I was using the LLVM intrinsics: - Intel Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz: 190.67 MB/s (w/o intrinsics) -> 1285.08 MB/s - AMD EPYC 7763 (VM) @ 2.45 GHz: 240.09 MB/s (w/o intrinsics) -> 1360.78 MB/s - Apple M1: 216.96 MB/s (w/o intrinsics) -> 2133.69 MB/s Minor changes to this source can swing performance from 400 MB/s to 1400 MB/s or... 20 MB/s, depending on how it interacts with the optimizer. I have a sneaking suspicion that despite LLVM inheriting GCC's extremely strict inline assembly semantics, its passes are rather skittish around inline assembly (and almost certainly, its instruction cost models can assume nothing)

Comptime code can't execute assembly code, so we need some way to force comptime code to use the generic path. This should be replaced with whatever is implemented for ziglang#868, when that day comes. I am seeing that the result for the hash is incorrect in stage1 and crashes stage2, so presumably this never worked correctly. I will follow up on that soon.

topolarity · 2022-10-28T22:28:50Z

Are you sure this is rebased against master branch?

Oops, I thought CI merged the PR branch into master automatically, but I see now that this doesn't apply changes to the CI script itself.

Just pushed a proper rebase - should be ready for review/merge assuming it passes.

lib/std/crypto/sha2.zig

This also fixes a bug where the feature gating was not taking effect at comptime due to ziglang#6768

andrewrk · 2022-10-29T00:18:14Z

Re: the CI failures, I think

!isComptime() and comptime std.Target.x86.featureSetHas(builtin.cpu.features, .sha)

has to be changed to

comptime std.Target.x86.featureSetHas(builtin.cpu.features, .sha) and !isComptime()

See #6768 for more details.

Once we get rid of stage1, we can make featureSetHas inline and then that comptime keyword can go away.

topolarity · 2022-10-29T00:39:09Z

Thanks for the review! Yep, #6768 took me by surprise.

Should be all fixed up now 👍

xcaptain · 2023-11-10T16:38:43Z

Doesn't this make it impossible to do sha256 at comptime?

Good point, I hadn't thought of that. Yeah, it breaks comptime evaluation for SHA-256.

Without #868 (or a related hack), the only way to fix that is to create a separate comptime method that users call at comptime. That means it's viral, too: Any function that calls SHA-256 must split into a comptime and runtime version, as well as their callers and so on.

I don't think it's necessary to add another comptime method. I made some changes like below and run

zig test lib/std/std.zig --zig-lib-dir lib

The tests built and passed. The hash functions should already support running at comptile time, no need to add another one, am I right?

topolarity force-pushed the sha2-intrinsics branch from e6afae7 to faebc31 Compare October 23, 2022 16:41

topolarity force-pushed the sha2-intrinsics branch from faebc31 to 3ed22de Compare October 23, 2022 16:46

kubkon approved these changes Oct 24, 2022

View reviewed changes

topolarity force-pushed the sha2-intrinsics branch from c73b079 to d197819 Compare October 24, 2022 17:43

topolarity added 4 commits October 28, 2022 15:21

std.crypto: SHA-256 Properly gate comptime conditional

ee241c4

This feature detection must be done at comptime so that we avoid generating invalid ASM for the target.

topolarity force-pushed the sha2-intrinsics branch from cf2bec9 to f9fe548 Compare October 28, 2022 22:21

andrewrk requested changes Oct 29, 2022

View reviewed changes

lib/std/crypto/sha2.zig Outdated Show resolved Hide resolved

lib/std/crypto/sha2.zig Outdated Show resolved Hide resolved

std.crypto: Use featureSetHas to gate intrinsics

67fa326

This also fixes a bug where the feature gating was not taking effect at comptime due to ziglang#6768

andrewrk approved these changes Oct 29, 2022

View reviewed changes

andrewrk merged commit 20925b2 into ziglang:master Oct 29, 2022

InKryption mentioned this pull request Mar 16, 2023

Making (FixedBuffer)Allocator available at comptime #14931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64 #13272

crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64 #13272

topolarity commented Oct 23, 2022 •

edited

Loading

joachimschmidt557 commented Oct 23, 2022

topolarity commented Oct 23, 2022 •

edited

Loading

topolarity commented Oct 23, 2022

kubkon commented Oct 23, 2022

topolarity commented Oct 23, 2022

kubkon commented Oct 23, 2022

kubkon commented Oct 23, 2022

Jarred-Sumner commented Oct 24, 2022

andrewrk commented Oct 24, 2022

kubkon commented Oct 24, 2022

kubkon left a comment

daurnimator commented Oct 25, 2022

topolarity commented Oct 25, 2022

jedisct1 commented Oct 25, 2022

andrewrk commented Oct 28, 2022 •

edited

Loading

jedisct1 commented Oct 28, 2022

andrewrk commented Oct 28, 2022 •

edited

Loading

topolarity commented Oct 28, 2022

andrewrk commented Oct 29, 2022 •

edited

Loading

topolarity commented Oct 29, 2022

xcaptain commented Nov 10, 2023

crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64 #13272

crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64 #13272

Conversation

topolarity commented Oct 23, 2022 • edited Loading

joachimschmidt557 commented Oct 23, 2022

topolarity commented Oct 23, 2022 • edited Loading

topolarity commented Oct 23, 2022

kubkon commented Oct 23, 2022

topolarity commented Oct 23, 2022

kubkon commented Oct 23, 2022

kubkon commented Oct 23, 2022

Jarred-Sumner commented Oct 24, 2022

andrewrk commented Oct 24, 2022

kubkon commented Oct 24, 2022

kubkon left a comment

Choose a reason for hiding this comment

daurnimator commented Oct 25, 2022

topolarity commented Oct 25, 2022

jedisct1 commented Oct 25, 2022

andrewrk commented Oct 28, 2022 • edited Loading

jedisct1 commented Oct 28, 2022

andrewrk commented Oct 28, 2022 • edited Loading

topolarity commented Oct 28, 2022

andrewrk commented Oct 29, 2022 • edited Loading

topolarity commented Oct 29, 2022

xcaptain commented Nov 10, 2023

topolarity commented Oct 23, 2022 •

edited

Loading

topolarity commented Oct 23, 2022 •

edited

Loading

andrewrk commented Oct 28, 2022 •

edited

Loading

andrewrk commented Oct 28, 2022 •

edited

Loading

andrewrk commented Oct 29, 2022 •

edited

Loading