-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1064: TLS performance imporovements #1375
Conversation
1. Changes in HPACK decoder to copy only Huffman-decoded and dynamically indexed headers; 2. Appropriate changes in HPACK-decoder/parser unit-tests.
Corrections as a result of HPACK decoder/encoder/parser unit-tests debugging.
TfwStrs often go in arrays, and there is a 1-byte alignment gap between TfwStr neighbours. `eolen` is always [0,2], 14-bits are enough for hpack index.
# Conflicts: # tempesta_fw/http_parser.c # tempesta_fw/http_sess.c # tempesta_fw/t/unit/test_hpack.c # tempesta_fw/t/unit/test_http_sticky.c
…ve duplication conflicts
…file is used in the source tree
Small improvement of mpi_fixup_used().
SECP 256 EC. 2. Remove unit test for assembly routines - now we test them within test_mpi_math.c
Bash script to run all the TLS unit tests.
is the same as the newer 186-3 D.2). The column based calculations involve more than 50 additions/subtractions. This number can be reduced to 27 operations if full 8-byte registers are used, which is close to WolfSSL's Montgomery reduction - about 33 operations). However, the algorithm may produce bigger than the modulo or negative result, so a comparison of the result is required and possibley following additions and subtractions. I tried to use AVX2 for the FIPS modulo, but the vector premutations are too complex for the instruction set we we need to empy too many SIMD operations. Moreover, unfortunately we can't process the whole 32 byte vector at once since we need to handle carry. All in all AVX2 doesn't work on the small vectors with complex operations. I need more study on Montgomery vs FIPS mudulo reductions. 1. MPI memory is allocated from memory pools with preallocated space. All MPI operations are constant size, so we know precisely how much memory a handshake of particular type uses. Thus we do not need to check return codes from MPI memory allocation routines - they're always succeed if a memory pool was successfully created. 2. Do not check public and private key in normal workflow, do this only for debug builds. 3. Small performance improvement of ttls_mpi_safe_cond_assign().
SECP 256 curve (ported from WolfSSL). 2. Do not call multiplication or squaring for MPIs with value 1. 3. Don't call modulus quasi reduction on MPIs of TlsEcpGrp->bits size - single addition or subtraction does the business faster. 4. Replace TlsEcpGrp->pbits and ->nbits by single ->bits.
Current perf profile: 8.57% [tempesta_tls] [k] ecp_mod_p256_x86_64 4.55% [tempesta_tls] [k] ttls_mpi_shift_r 4.04% [tempesta_tls] [k] ttls_mpi_sub_abs 2.63% [tempesta_tls] [k] ttls_mpi_safe_cond_assign 2.45% [tempesta_tls] [k] ttls_mpi_sub_mpi 2.30% [tempesta_tls] [k] ttls_mpi_inv_mod 2.28% [tempesta_tls] [k] ttls_mpi_cmp_mpi 2.22% [tempesta_tls] [k] mpi_mul_x86_64_4
Introduce __check_stack_addr() to properly check address on stack, either for a current user space process or softirq working on the same stack after IRQ. This is more reliable check since previously there could be 4 dynamically allocated pages just little bit before the stach, so the assertion failed.
1. Fix test_rsa: RSA context can not be allocated on the stack pool since TlsRSACtx->RN will be freed at the end of ttls_mpi_exp_mod(). 2. Minor fixes and cleanups.
1. RSA fixes: 1.1. ttls_mpi_div_mpi() is used in RSA, so replace alloca() with our own stack and make it 16 pages in size; 1.2. free allocated memory on failed ttls_rsa_private() call; 2. Fix double free error on ttls_pk_parse_subpubkey() call failure; 3. Use more straightforward stack verification in Mpi Pool; 4. Some minor fixes for Tempesta FW unit tests.
Remove the lock and make the blinding values per-cpu. Calculate the initial blinding values on configuration phase.
…a into ak-bn-mem-profiles
Fix pages leakage on errors and unsgined comparison against -1 (which is never true) in in ttls_handshake_server_hello() and ttls_handshake_finished(). Rename ttls_purge_io_ctx() to tfw_tls_purge_io_ctx() since it's not itn Tempesta TLS (ttls_ prefix). Small cleanup and code simplification in TLS context cleanup.
code, the caller does this. This bug lead to crash after the previous fix of the leakage - now the page has no 'extra' reference. Fix ttls_write_certificate(): fill the first skb frament immediately to not to crash if we fail somewhere at the middle of the function and ttls_handshake_server_hello() tries to put absent fragment page. Minor fixes in ttls_parse_certificate() which isn't used so far. Simplify ttls_write_certificate() code and make skb fragments usage more efficient by writing the certificate length on configuration plase. Data offset was added to tfw_cfg_read_file() to place the 3 byte of the length on the same page. Simplify ttls_handshake_finished() do not reenter the FSM for TTLS_SERVER_CHANGE_CIPHER_SPEC state, instead just save the state to mark the XFRM context ready and move to the next state.
With Tempesta configuration if RSA key is used (never reproduced with EC keys):
After running several attempts of tls-perf
Can't give more details this happend once ocasionally. |
* %RCX - B->used (used directly for looping); | ||
* %R8 - A->used. | ||
*/ | ||
ENTRY(mpi_sub_x86_64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get warning here:
tempesta/tls/.tmp_bignum_x86-64.o: warning: objtool: mpi_sub_x86_64()+0x6c: sibling call from callable instruction with modified stack frame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the warning each time I build the module and I have explored the code. The problem code is
.sub_small_b:
clc
ANNOTATE_RETPOLINE_SAFE
jmpq *%rbx
and it seems objtool
doesn't like
jz .sub_small_b
pushq %r12
However, the register is pop
'ed before the label. The code is fine. I didn't find a way to make objtool
like the code, but there were fixes recently in the tool iteself, so probably the warning will go away after #1049
In later #1064 assembly work probably we'll figure out how to adjust the code to keep pushing the register to the stack lately, but avoid the warnings.
failures even with the RSP register check and I don't see a reason to export reliable task_stack_page() and irq_stack_ptr symbols from the kernel just for the assertion. The reason for the false failures is that maximum stack size is large enough to span pages above the stack if RSP is pointing to begin of the allowed stack space.
The call trace #1375 (comment) is quite weird. The script Having that the problem has appeared on a5c5143 commit, before the last memory leak and corruption fixes, I assume that the problem is in the corrupted memory. |
TODO comments about further p256 optimization notes. Remove unused ECDH context blinding values. Remove outdated comment and static assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
#1064 point 1.
Introduce memory pools for the MPIs. MPI memory profile is an MPI memory pool filled at configuration time, so that we can use stream copy to quickly create ready to use MPI image for a particular PK computation.
#1064 point 2
Reduce number of public key checkings and assembly implementations for the core math.
TODO
The PR is buggy and to be fixed during the review.
Sorry for the large commits of mixed logical changes and cleanups.