Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR updates the SHA256 implementation
cleanly separates message scheduling and hashing round. Message scheduling is independent of previous data and can be parallelized 4x uint32 per 4x uint32. For known padding blocks, such as those that arises in Merkle trees (see Accelerate Merkle tree hashing #205), it can be precomputed for significant speedups.
implements SSSE3 (introduced in 2006, via Core 2 Duo) to parallelize message scheduling. This speeds SHA256 by 30%.
Note: the SIMD rotate instruction
_mm_ror_epi32
to trivially translate sha256 was added with AVX512F+AVX512VL.The speedup would only be helpful on Skylake-X as the only architecture with AVX512VL but no hardware SHA (Cannon Lake never shipped). And that speedup would be limited to replacing shifts+xors with rotate
implements hardware SHA acceleration
reduce the size of sha256 context and simplify its update flow
Performance on small messages is greater than OpenSSL and BLST