Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64 #13272

Merged
merged 5 commits into from
Oct 29, 2022

Conversation

topolarity
Copy link
Contributor

@topolarity topolarity commented Oct 23, 2022

There's probably plenty of room to optimize these further in the future, but for the moment this gives ~5-6x improvement on x86-64, and ~10x on M1 AArch64 Macs.

These extensions are very new. Most Intel processors prior to 2020 do not support them. AMD has supported them since Ryzen.

P.S. @kubkon I haven't fixed the CPU features for LLVM in stage2 yet, but at least this gives you something to play with until that's sorted 🙂

@joachimschmidt557
Copy link
Member

In future it would be nice if we use inline assembly instead of LLVM intrinsics to enable the self-hosted backends to use these features too.

@topolarity
Copy link
Contributor Author

topolarity commented Oct 23, 2022

In future it would be nice if we use inline assembly instead of LLVM intrinsics to enable the self-hosted backends to use these features too.

Good point! Switched to inline asm for this PR.

In general, I think inline assembly can interact very differently with the optimizer versus intrinsics, but in this case it's actually a 20% win for Intel x86-64: Intel is over 1 GB/s now (4x improvement vs. master)

@topolarity
Copy link
Contributor Author

Eugh... Intel x86-64 loves the new inline assembly, but AMD hates it:

intrinsics: 66870 us (1568.08 MB/s)
inline asm: 251760 us (416.50 MB/s)

I'll go digging through the assembly and see if I can figure out why.

@kubkon
Copy link
Member

kubkon commented Oct 23, 2022

Eugh... Intel x86-64 loves the new inline assembly, but AMD hates it:

intrinsics: 66870 us (1568.08 MB/s)
inline asm: 251760 us (416.50 MB/s)

I'll go digging through the assembly and see if I can figure out why.

What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo.

@topolarity
Copy link
Contributor Author

What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo.

arm64 on my M1 mac is about the same (<5% delta)

@kubkon
Copy link
Member

kubkon commented Oct 23, 2022

What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo.

arm64 on my M1 mac is about the same (<5% delta)

Awesome! Thanks for confirming!

@kubkon
Copy link
Member

kubkon commented Oct 23, 2022

Is there any trick to building any of this locally? On my M1Pro I am getting:

error: <inline asm>:2:1: instruction requires: sha2
sha256h.4s v2, v1, v3

@Jarred-Sumner
Copy link
Contributor

not sure if helpful but I have a script that benches SHA hashing comparing BoringSSL's implementation to Zig's implementation

This is for an old build of Zig without this PR's changes 0.10.0-dev.2822+b79884eaf on a Ryzen machine

[SHA256]

     zig: 3.292s
  boring: 518.223ms
     evp: 518.25ms
  evp in: 518.228ms

https://github.com/Jarred-Sumner/bun/blob/dea7cb14bdf0446fcb8a0750fe86b2056a7c3be0/src/sha.zig#L173-L230

@andrewrk
Copy link
Member

@kubkon what does zig build-exe --show-builtin output for you?

@kubkon
Copy link
Member

kubkon commented Oct 24, 2022

@kubkon what does zig build-exe --show-builtin output for you?

pub const cpu: std.Target.Cpu = .{                                                                                                                                                                                                                                                                                                                   
    .arch = .aarch64,                                                                                                                                                                                                                                                                                                                                
    .model = &std.Target.aarch64.cpu.apple_a14,                                                                                                                                                                                                                                                                                                      
    .features = std.Target.aarch64.featureSet(&[_]std.Target.aarch64.Feature{                                                                                                                                                                                                                                                                        
        .aes,                                                                                                                                                                                                                                                                                                                                        
        .aggressive_fma,                                                                                                                                                                                                                                                                                                                             
        .alternate_sextload_cvt_f32_pattern,
        .altnzcv,
        .am,
        .arith_bcc_fusion,
        .arith_cbz_fusion,
        .ccdp,
        .ccidx,
        .ccpp,
        .complxnum,
        .contextidr_el2,
        .crc,
        .crypto,
        .disable_latency_sched_heuristic,
        .dit,
        .dotprod,
        .el2vmsa,
        .el3,
        .flagm,
        .fp16fml,
        .fp_armv8,
        .fptoint,
        .fullfp16,
        .fuse_address,
        .fuse_adrp_add,
        .fuse_aes,
        .fuse_arith_logic,
        .fuse_crypto_eor,
        .fuse_csel,
        .fuse_literals,
        .jsconv,
        .lor,
        .lse,
        .lse2,
        .mpam,
        .neon,
        .nv,
        .pan,
        .pan_rwv,
        .pauth,
        .perfmon,
        .predres,
        .ras,
        .rcpc,
        .rcpc_immo,
        .rdm,
        .sb,
        .sel2,
        .sha2,
        .sha3,
        .specrestrict,
        .ssbs,
        .tlb_rmi,
        .tracev8_4,
        .uaops,
        .v8_1a,
        .v8_2a,
        .v8_3a,
        .v8_4a,
        .v8a,
        .vh,
        .zcm,
        .zcz,
        .zcz_gp,
    }),
};

Copy link
Member

@kubkon kubkon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stellar work! Thanks @topolarity!

@daurnimator
Copy link
Contributor

Doesn't this make it impossible to do sha256 at comptime?

@topolarity
Copy link
Contributor Author

Doesn't this make it impossible to do sha256 at comptime?

Good point, I hadn't thought of that. Yeah, it breaks comptime evaluation for SHA-256.

Without #868 (or a related hack), the only way to fix that is to create a separate comptime method that users call at comptime. That means it's viral, too: Any function that calls SHA-256 must split into a comptime and runtime version, as well as their callers and so on.

@jedisct1
Copy link
Contributor

For hash functions, comptime-support can indeed be an important thing to have.

That leaves us with either dynamic dispatch (which doesn't match the rest, and the plan to eventually do it), or, as you pointed out, a way to check if the function is called at comptime or not.

Doing this, even with a hack as a starter, sounds like a much better route that duplicating every function to have a comptine variant.

@andrewrk
Copy link
Member

andrewrk commented Oct 28, 2022

pub fn isComptime() bool {
    var t = true;
    var a: u8 = 0;
    var b: u16 = 0;
    const x = if (t) a else b;
    return @TypeOf(x) == u8;
}

@jedisct1
Copy link
Contributor

That's a nice trick!

@andrewrk
Copy link
Member

andrewrk commented Oct 28, 2022

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
ZSTD

Hmm not sure what's up with the CI failures - I don't understand why this would be happening in this branch but not master branch.

Are you sure this is rebased against master branch? In master branch I see this:

-- The C compiler identification is Clang 15.0.3
-- The CXX compiler identification is Clang 15.0.3

However in this CI run I see this:

-- The C compiler identification is Clang 15.0.0
-- The CXX compiler identification is Clang 15.0.0

Possibly indicating an old tarball.

There's probably plenty of room to optimize these further in the
future, but for the moment this gives ~3x improvement on Intel
x86-64 processors, ~5x on AMD, and ~10x on M1 Macs.

These extensions are very new - Most processors prior to 2020 do
not support them.

AVX-512 is a slightly older alternative that we could use on Intel
for a much bigger performance bump, but it's been fused off on
Intel's latest hybrid architectures and it relies on computing
independent SHA hashes in parallel. In contrast, these SHA intrinsics
provide the usual single-threaded, single-stream interface, and should
continue working on new processors.

AArch64 also has SHA-512 intrinsics that we could take advantage
of in the future
This feature detection must be done at comptime so that we avoid
generating invalid ASM for the target.
This gets us most of the way back to the performance I had when
I was using the LLVM intrinsics:
  - Intel Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz:
       190.67 MB/s (w/o intrinsics) -> 1285.08 MB/s
  - AMD EPYC 7763 (VM) @ 2.45 GHz:
       240.09 MB/s (w/o intrinsics) -> 1360.78 MB/s
  - Apple M1:
       216.96 MB/s (w/o intrinsics) -> 2133.69 MB/s

Minor changes to this source can swing performance from 400 MB/s to
1400 MB/s or... 20 MB/s, depending on how it interacts with the
optimizer. I have a sneaking suspicion that despite LLVM inheriting
GCC's extremely strict inline assembly semantics, its passes are
rather skittish around inline assembly (and almost certainly, its
instruction cost models can assume nothing)
Comptime code can't execute assembly code, so we need some way to
force comptime code to use the generic path. This should be replaced
with whatever is implemented for ziglang#868, when that day comes.

I am seeing that the result for the hash is incorrect in stage1 and
crashes stage2, so presumably this never worked correctly. I will follow
up on that soon.
@topolarity
Copy link
Contributor Author

Are you sure this is rebased against master branch?

Oops, I thought CI merged the PR branch into master automatically, but I see now that this doesn't apply changes to the CI script itself.

Just pushed a proper rebase - should be ready for review/merge assuming it passes.

lib/std/crypto/sha2.zig Outdated Show resolved Hide resolved
lib/std/crypto/sha2.zig Outdated Show resolved Hide resolved
This also fixes a bug where the feature gating was not taking
effect at comptime due to ziglang#6768
@andrewrk
Copy link
Member

andrewrk commented Oct 29, 2022

Re: the CI failures, I think

!isComptime() and comptime std.Target.x86.featureSetHas(builtin.cpu.features, .sha)

has to be changed to

comptime std.Target.x86.featureSetHas(builtin.cpu.features, .sha) and !isComptime()

See #6768 for more details.

Once we get rid of stage1, we can make featureSetHas inline and then that comptime keyword can go away.

@topolarity
Copy link
Contributor Author

Thanks for the review! Yep, #6768 took me by surprise.

Should be all fixed up now 👍

@xcaptain
Copy link

Doesn't this make it impossible to do sha256 at comptime?

Good point, I hadn't thought of that. Yeah, it breaks comptime evaluation for SHA-256.

Without #868 (or a related hack), the only way to fix that is to create a separate comptime method that users call at comptime. That means it's viral, too: Any function that calls SHA-256 must split into a comptime and runtime version, as well as their callers and so on.

I don't think it's necessary to add another comptime method. I made some changes like below and run

zig test lib/std/std.zig --zig-lib-dir lib

image

The tests built and passed. The hash functions should already support running at comptile time, no need to add another one, am I right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants