Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize a scan of non state-chaning bytes with SSSE3 instructions #58

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Commits on Oct 10, 2023

  1. Optimize a scan of non state-chaning bytes with SSE2 instructions

    This commit optimizes the scan of non-state-changing bytes using SSE2 instructions.
    
    A [_mm_cmpestri](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_cmpestri) operation appears to be quite slow
    compared to alternative approach that involves [_mm_shuffle_epi8](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_shuffle_epi8)
    for low/high nibble of the input and using bitwise-and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations
    which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any)
    that changes the state. That is done by figuring out the first byte that LUTs to zero.
    
    The tricky part here is the following:
    
    ```
    Find A, B arrays (uint8_t[16]) such that
    * `A[i] & B[j] == 0` if `LUT[i | (j <<4)] == 0`
    * `A[i] & B[j] != 0` if `LUT[i | (j <<4)] != 0` // Note we don't need any specific non-zero value
    for all i,j = 0..15.
    ```
    
    To find `A` and `B` satisfying the above conditions a [Z3](https://github.com/Z3Prover/z3) library is used.
    The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit)
    was required to handle the async API for z3-wrapper.
    
    Using llhttp as a benchmark framework this change draws the following improvemnts:
    
    ```
    Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
    
    http: "seanmonstar/httparse" (C)
    BEFORE: 8192.00 mb | 1456.72 mb/s | 2172811.81 ops/sec | 5.62 s
    AFTER:  8192.00 mb | 1752.90 mb/s | 2614577.82 ops/sec | 4.67 s
    
    ~20% improvement
    
    http: "nodejs/http-parser" (C)
    BEFORE: 8192.00 mb | 1050.60 mb/s | 2118535.14 ops/sec | 7.80 s
    AFTER:  8192.00 mb | 1167.42 mb/s | 2354101.76 ops/sec | 7.02 s
    
    ~11% improvement
    ```
    
    For more header-fields-heavy messages numbers might be even more convincing.
    ngrodzitski committed Oct 10, 2023
    Configuration menu
    Copy the full SHA
    5b7c3a9 View commit details
    Browse the repository at this point in the history

Commits on Oct 11, 2023

  1. Fix SSE families

    The previous commit actually uses SSSE3 instruction.
    ngrodzitski committed Oct 11, 2023
    Configuration menu
    Copy the full SHA
    7602ea1 View commit details
    Browse the repository at this point in the history