#1064: TLS performance imporovements #1375

krizhanovsky · 2019-12-26T23:44:40Z

#1064 point 1.

Introduce memory pools for the MPIs. MPI memory profile is an MPI memory pool filled at configuration time, so that we can use stream copy to quickly create ready to use MPI image for a particular PK computation.

#1064 point 2

Reduce number of public key checkings and assembly implementations for the core math.

TODO

The PR is buggy and to be fixed during the review.

Sorry for the large commits of mixed logical changes and cleanups.

1. Changes in HPACK decoder to copy only Huffman-decoded and dynamically indexed headers; 2. Appropriate changes in HPACK-decoder/parser unit-tests.

…lity (#309).

…O' comments (#309).

Corrections as a result of HPACK decoder/encoder/parser unit-tests debugging.

…tion flow (#309).

…lts (#309).

…tion (#309).

TfwStrs often go in arrays, and there is a 1-byte alignment gap between TfwStr neighbours. `eolen` is always [0,2], 14-bits are enough for hpack index.

# Conflicts: # tempesta_fw/http_parser.c # tempesta_fw/http_sess.c # tempesta_fw/t/unit/test_hpack.c # tempesta_fw/t/unit/test_http_sticky.c

…ve duplication conflicts

…file is used in the source tree

Small improvement of mpi_fixup_used().

SECP 256 EC. 2. Remove unit test for assembly routines - now we test them within test_mpi_math.c

Bash script to run all the TLS unit tests.

is the same as the newer 186-3 D.2). The column based calculations involve more than 50 additions/subtractions. This number can be reduced to 27 operations if full 8-byte registers are used, which is close to WolfSSL's Montgomery reduction - about 33 operations). However, the algorithm may produce bigger than the modulo or negative result, so a comparison of the result is required and possibley following additions and subtractions. I tried to use AVX2 for the FIPS modulo, but the vector premutations are too complex for the instruction set we we need to empy too many SIMD operations. Moreover, unfortunately we can't process the whole 32 byte vector at once since we need to handle carry. All in all AVX2 doesn't work on the small vectors with complex operations. I need more study on Montgomery vs FIPS mudulo reductions. 1. MPI memory is allocated from memory pools with preallocated space. All MPI operations are constant size, so we know precisely how much memory a handshake of particular type uses. Thus we do not need to check return codes from MPI memory allocation routines - they're always succeed if a memory pool was successfully created. 2. Do not check public and private key in normal workflow, do this only for debug builds. 3. Small performance improvement of ttls_mpi_safe_cond_assign().

SECP 256 curve (ported from WolfSSL). 2. Do not call multiplication or squaring for MPIs with value 1. 3. Don't call modulus quasi reduction on MPIs of TlsEcpGrp->bits size - single addition or subtraction does the business faster. 4. Replace TlsEcpGrp->pbits and ->nbits by single ->bits.

Current perf profile: 8.57% [tempesta_tls] [k] ecp_mod_p256_x86_64 4.55% [tempesta_tls] [k] ttls_mpi_shift_r 4.04% [tempesta_tls] [k] ttls_mpi_sub_abs 2.63% [tempesta_tls] [k] ttls_mpi_safe_cond_assign 2.45% [tempesta_tls] [k] ttls_mpi_sub_mpi 2.30% [tempesta_tls] [k] ttls_mpi_inv_mod 2.28% [tempesta_tls] [k] ttls_mpi_cmp_mpi 2.22% [tempesta_tls] [k] mpi_mul_x86_64_4

Introduce __check_stack_addr() to properly check address on stack, either for a current user space process or softirq working on the same stack after IRQ. This is more reliable check since previously there could be 4 dynamically allocated pages just little bit before the stach, so the assertion failed.

1. Fix test_rsa: RSA context can not be allocated on the stack pool since TlsRSACtx->RN will be freed at the end of ttls_mpi_exp_mod(). 2. Minor fixes and cleanups.

1. RSA fixes: 1.1. ttls_mpi_div_mpi() is used in RSA, so replace alloca() with our own stack and make it 16 pages in size; 1.2. free allocated memory on failed ttls_rsa_private() call; 2. Fix double free error on ttls_pk_parse_subpubkey() call failure; 3. Use more straightforward stack verification in Mpi Pool; 4. Some minor fixes for Tempesta FW unit tests.

Remove the lock and make the blinding values per-cpu. Calculate the initial blinding values on configuration phase.

…a into ak-bn-mem-profiles

9c3b39d

Fix pages leakage on errors and unsgined comparison against -1 (which is never true) in in ttls_handshake_server_hello() and ttls_handshake_finished(). Rename ttls_purge_io_ctx() to tfw_tls_purge_io_ctx() since it's not itn Tempesta TLS (ttls_ prefix). Small cleanup and code simplification in TLS context cleanup.

code, the caller does this. This bug lead to crash after the previous fix of the leakage - now the page has no 'extra' reference. Fix ttls_write_certificate(): fill the first skb frament immediately to not to crash if we fail somewhere at the middle of the function and ttls_handshake_server_hello() tries to put absent fragment page. Minor fixes in ttls_parse_certificate() which isn't used so far. Simplify ttls_write_certificate() code and make skb fragments usage more efficient by writing the certificate length on configuration plase. Data offset was added to tfw_cfg_read_file() to place the 3 byte of the length on the same page. Simplify ttls_handshake_finished() do not reenter the FSM for TTLS_SERVER_CHANGE_CIPHER_SPEC state, instead just save the state to mark the XFRM context ready and move to the next state.

vankoven · 2020-04-24T13:29:46Z

With Tempesta configuration if RSA key is used (never reproduced with EC keys):

listen 80;
listen 443 proto=https;
cache 0;

tls_certificate /home/user/cert/tfw-root.crt;
tls_certificate_key /home/user/cert/tfw-root.key;

srv_group default {
    server 127.0.0.1:8080 conns_n=1;
}

block_action attack reply;

vhost default {
    proxy_pass default;
}

http_chain {
     -> default;
}

After running several attempts of tls-perf

./tls-perf -T 10  192.168.122.12 443

debian login: [193427.134548] [tempesta tls] Warning: None of the common ciphersuites is usable (e.g. no suitable certificate)
[193427.136826] [tempesta fw] Warning: Unrecognized TLS receive return code -22, drop packet
[193527.406298] [tempesta fw] Warning: Unrecognized TLS receive return code -28288, drop packet
[193527.409931] [tempesta fw] Warning: Unrecognized TLS receive return code -28288, drop packet
[193527.413461] ------------[ cut here ]------------
[193527.415771] WARNING: CPU: 1 PID: 16 at /home/user/qtc/release/tempesta/tls/bignum.c:1223 ttls_mpi_exp_mod+0x6b3/0x9e0 [tempesta_tls]
[193527.421587] Modules linked in: tempesta_fw(O) tempesta_db(O) tempesta_tls(O) tempesta_lib(O) sha256_ssse3 sha512_ssse3 sha512_generic ccm kvm_intel binfmt_misc snd_hda_codec_generic iTCO_wdt iTCO_vendor_support kvm snd_hda_intel irqbypass crct10dif_pclmul crc32_pclmul snd_hda_codec ghash_clmulni_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd sg soundcore virtio_console virtio_balloon virtio_gpu ttm pcspkr evdev lpc_ich serio_raw drm_kms_helper mfd_core drm shpchp button ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb aesni_intel crypto_simd cryptd glue_helper aes_x86_64 crc32c_generic sr_mod cdrom virtio_blk virtio_net crc32c_intel psmouse ahci i2c_i801 libahci libata sym53c8xx ehci_pci scsi_transport_spi uhci_hcd ehci_hcd scsi_mod virtio_pci virtio_ring virtio usbcore usb_common
[193527.448064]  [last unloaded: tempesta_lib]
[193527.449526] CPU: 1 PID: 16 Comm: ksoftirqd/1 Tainted: G           O    4.14.0-tempesta-kmemleak-amd64 #1 Debian 4.14.32-tfw7-1
[193527.452856] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20191223_100556-anatol 04/01/2014
[193527.455129] task: ffff91cb415d3c00 task.stack: ffffa01d006b0000
[193527.456308] RIP: 0010:ttls_mpi_exp_mod+0x6b3/0x9e0 [tempesta_tls]
[193527.457894] RSP: 0018:ffffa01d006b3408 EFLAGS: 00010206
[193527.459354] RAX: 0000000000000000 RBX: ffff91caf5ac1010 RCX: 0000000000000021
[193527.461390] RDX: 0000000000000020 RSI: ffff91caf5ac1018 RDI: ffff91caf5ac1028
[193527.464588] RBP: ffffa01d006b34c8 R08: ffff91caf5ac1040 R09: 0000000000000008
[193527.468112] R10: 0000000000000148 R11: ffff91cb0d740010 R12: ffff91caf5ac1020
[193527.471518] R13: ffffffffffffffff R14: ffff91caf5ac1040 R15: ffff91caf5ac1018
[193527.474295] FS:  0000000000000000(0000) GS:ffff91cb65500000(0000) knlGS:0000000000000000
[193527.476345] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[193527.477465] CR2: 00007fbd3f3c9018 CR3: 000000004ac0a006 CR4: 00000000003606e0
[193527.478678] Call Trace:
[193527.479456]  ? ttls_mpool+0x18/0x110 [tempesta_tls]
[193527.480394]  ? ttls_mpi_sub_int+0x4f/0xc0 [tempesta_tls]
[193527.481289]  ttls_dhm_make_params+0xa0/0x210 [tempesta_tls]
[193527.482261]  ttls_handshake_server_hello+0xd7c/0x1080 [tempesta_tls]
[193527.483772]  ? ttls_update_checksum+0x96/0x160 [tempesta_tls]
[193527.486626]  ttls_handshake_server_step+0x378/0x1a80 [tempesta_tls]
[193527.489689]  ? shash_async_import+0x40/0x40
[193527.490587]  ttls_recv+0x147/0x680 [tempesta_tls]
[193527.491529]  ? ttls_decrypt+0x5d0/0x5d0 [tempesta_tls]
[193527.493000]  ss_skb_process+0x11b/0x140 [tempesta_fw]
[193527.495499]  ? ip_finish_output2+0x19f/0x3b0
[193527.497245]  tfw_tls_msg_process+0xc0/0x4d0 [tempesta_fw]
[193527.499405]  ? ip_output+0x71/0xe0
[193527.501001]  ? ip_queue_xmit+0x5c/0x3b0
[193527.502922]  __gfsm_fsm_exec+0x56/0x90 [tempesta_fw]
[193527.504117]  tfw_connection_recv+0x41/0x60 [tempesta_fw]
[193527.506733]  ? tfw_connection_send+0x30/0x30 [tempesta_fw]
[193527.508318]  ss_tcp_process_data+0x1ea/0x480 [tempesta_fw]
[193527.509647]  ? dev_hard_start_xmit+0x31/0x1f0
[193527.511152]  ss_tcp_data_ready+0x43/0x90 [tempesta_fw]
[193527.512477]  tcp_data_queue+0x4f5/0xc50
[193527.513134]  tcp_rcv_established+0x27c/0x570
[193527.513847]  tcp_v4_do_rcv+0x129/0x1d0
[193527.514474]  tcp_v4_rcv+0x947/0xa50
[193527.515295]  ip_local_deliver_finish+0x9a/0x1c0
[193527.517538]  ip_local_deliver+0x6b/0xe0
[193527.519457]  ? tcp_v4_early_demux+0x112/0x150
[193527.521606]  ? ip_rcv_finish+0x17a/0x400
[193527.523205]  ip_rcv+0x289/0x3c0
[193527.524626]  ? inet_del_offload+0x40/0x40
[193527.525545]  __netif_receive_skb_core+0x84f/0xb30
[193527.526328]  ? lock_timer_base+0x74/0x90
[193527.527340]  ? process_backlog+0xa3/0x160
[193527.528969]  process_backlog+0xa3/0x160
[193527.530352]  net_rx_action+0x28e/0x3f0
[193527.531587]  __do_softirq+0x10f/0x2a8
[193527.532913]  run_ksoftirqd+0x1c/0x40
[193527.533918]  smpboot_thread_fn+0x10e/0x160
[193527.535624]  kthread+0xff/0x130
[193527.536592]  ? sort_range+0x20/0x20
[193527.537851]  ? kthread_create_on_node+0x70/0x70
[193527.539564]  ret_from_fork+0x35/0x40
[193527.540512] Code: 45 b6 66 44 89 4d b0 66 44 89 55 b2 e8 f7 d9 ff ff 48 8b 8d 70 ff ff ff 41 bb 01 00 00 00 66 44 89 5d b0 49 89 cf e9 da fa ff ff <0f> 0b b9 f4 ff ff ff e9 d4 fe ff ff 48 8b 85 78 ff ff ff 48 8b 
[193527.545381] ---[ end trace 00f20e676f79f25b ]---

Can't give more details this happend once ocasionally.

ktest/asm/export.h

tls/mpool.c

vankoven · 2020-04-24T12:46:15Z

tls/bignum_x86-64.S

+ * %RCX	- B->used (used directly for looping);
+ * %R8	- A->used.
+ */
+ENTRY(mpi_sub_x86_64)


I get warning here:

tempesta/tls/.tmp_bignum_x86-64.o: warning: objtool: mpi_sub_x86_64()+0x6c: sibling call from callable instruction with modified stack frame

I see the warning each time I build the module and I have explored the code. The problem code is

.sub_small_b: clc ANNOTATE_RETPOLINE_SAFE jmpq *%rbx

and it seems objtool doesn't like

jz .sub_small_b pushq %r12

However, the register is pop'ed before the label. The code is fine. I didn't find a way to make objtool like the code, but there were fixes recently in the tool iteself, so probably the warning will go away after #1049

In later #1064 assembly work probably we'll figure out how to adjust the code to keep pushing the register to the stack lately, but avoid the warnings.

tls/mpool.c

tls/ttls.c

tls/bignum.c

failures even with the RSP register check and I don't see a reason to export reliable task_stack_page() and irq_stack_ptr symbols from the kernel just for the assertion. The reason for the false failures is that maximum stack size is large enough to span pages above the stack if RSP is pointing to begin of the allowed stack space.

krizhanovsky · 2020-04-24T16:31:24Z

The call trace #1375 (comment) is quite weird. tls-perf produces only ECC handshakes, so ClientHello step should fail and it does fail - Warning: None of the common ciphersuites is usable is thrown exactly on the step. Meantime the call trace is from the later ServerHello step.

The script while :; do ./tls-perf -T 10 192.168.100.4 443; sleep 1; done even w/o the sleep 1 can not reveal the problem.

Having that the problem has appeared on a5c5143 commit, before the last memory leak and corruption fixes, I assume that the problem is in the corrupted memory.

TODO comments about further p256 optimization notes. Remove unused ECDH context blinding values. Remove outdated comment and static assert.

vankoven

Looks good to me

aleksostapenko and others added 27 commits October 26, 2019 19:12

HTTP/2 Parser implementation (#309).

71d7b7a

HTTP/2 Parser implementation (#309):

dc71702

1. Changes in HPACK decoder to copy only Huffman-decoded and dynamically indexed headers; 2. Appropriate changes in HPACK-decoder/parser unit-tests.

HTTP/2 Parser implementation: some comments added for HPACK functiona…

b5658be

…lity (#309).

HTTP/2 Parser implementation: numerous corrections and cleanups (#309).

9a57b1f

HTTP/2 Parser implementation: removal of outdated (already done) 'TOD…

8f9a906

…O' comments (#309).

HTTP/2 Parser implementation (#309):

fb7c80f

Corrections as a result of HPACK decoder/encoder/parser unit-tests debugging.

HTTP/2 Parser implementation: redesign of HTTP/1.1=>HTTP/2 transforma…

e50d74d

…tion flow (#309).

HTTP/2 Parser implementation: corrections according to debugging resu…

c935484

…lts (#309).

HTTP/2 Parser implementation: corrections in internal response genera…

f2e01bd

…tion (#309).

bump copyrights

c692adb

Divide TfwStr flags and HPACK index members

dae477a

TfwStrs often go in arrays, and there is a 1-byte alignment gap between TfwStr neighbours. `eolen` is always [0,2], 14-bits are enough for hpack index.

Fix compile-time issues with 'stack frame non-standard' warnings

8849e87

Update code comments

ae7ce40

Update Linux kernel patch

e9bf486

Remove unneeded ss_skb_data() function

e8b1f87

Minor style fixups

266581d

Generate HPACK header on compiletime

45c9eba

Merge branch 'master' into ik-1369-merge-master

7b1cce0

# Conflicts: # tempesta_fw/http_parser.c # tempesta_fw/http_sess.c # tempesta_fw/t/unit/test_hpack.c # tempesta_fw/t/unit/test_http_sticky.c

Resolve merge issues

fd48c6e

Fail test if parsed string was composed incorrectly, don't crash

80d2d4a

port changes from avb-1202 to http2

b1e6a11

Set-cookie was added in two branches with different processing, resol…

0620349

…ve duplication conflicts

Add HPACK generator and generated file

d4c682d

Fix some small comments

c67a387

Hpack state machine table in no more generated, product of generated …

a5519f8

…file is used in the source tree

port method override detection feature to HTTP2

5c77db5

Fix pointer assignment

a7de0eb

krizhanovsky requested a review from i-rinat December 27, 2019 15:33

krizhanovsky assigned i-rinat Dec 27, 2019

krizhanovsky changed the title ~~MPI memory pools and profiles~~ #1064: MPI memory pools and profiles Dec 28, 2019

krizhanovsky added 14 commits April 11, 2020 20:06

Specializations for the most frequent subtraction operands.

b9a8798

Small improvement of mpi_fixup_used().

1. Assembly implementations of left and right shifts optimized for

897ceda

SECP 256 EC. 2. Remove unit test for assembly routines - now we test them within test_mpi_math.c

Small optimization for ttls_mpi_sub_int().

6b471fe

Bash script to run all the TLS unit tests.

Remove unused TlsEcpGrp.h

563b1cf

Fixes for RSA:

324043c

1. Fix test_rsa: RSA context can not be allocated on the stack pool since TlsRSACtx->RN will be freed at the end of ttls_mpi_exp_mod(). 2. Minor fixes and cleanups.

Fix a race on RSA blinding values reading.

9c3b39d

Remove the lock and make the blinding values per-cpu. Calculate the initial blinding values on configuration phase.

Merge branch 'ak-bn-mem-profiles' of github.com:tempesta-tech/tempest…

3d7a84d

…a into ak-bn-mem-profiles

Fix unit tests after the merge

1673b56

Remove the FIXME comment as the problem was fixed in

a5c5143

9c3b39d

This was referenced Apr 16, 2020

Verification of GCM tag failed: MAC check failed #1310

Closed

Memory leak on massive TLS handshakes #1382

Closed

krizhanovsky added 2 commits April 21, 2020 00:09

Add skb accounting. Quite useful for memory leaks hunting

539fb7c

Update the kernel patch for the previous commit

68d310c

krizhanovsky mentioned this pull request Apr 22, 2020

BUG at http_limits.c:406 #1404

Closed

krizhanovsky added 2 commits April 23, 2020 21:15

vankoven reviewed Apr 24, 2020

View reviewed changes

krizhanovsky added 2 commits April 24, 2020 19:56

Small fixes and cleanups according to the last code review

13b9fd8

Update Tempesta TLS module version and README.

cae2d3d

TODO comments about further p256 optimization notes. Remove unused ECDH context blinding values. Remove outdated comment and static assert.

vankoven approved these changes Apr 28, 2020

View reviewed changes

krizhanovsky merged commit bcfb458 into master Apr 28, 2020

krizhanovsky deleted the ak-bn-mem-profiles branch April 28, 2020 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#1064: TLS performance imporovements #1375

#1064: TLS performance imporovements #1375

krizhanovsky commented Dec 26, 2019 •

edited

Loading

vankoven commented Apr 24, 2020

vankoven Apr 24, 2020

krizhanovsky Apr 24, 2020

krizhanovsky commented Apr 24, 2020

vankoven left a comment

#1064: TLS performance imporovements #1375

#1064: TLS performance imporovements #1375

Conversation

krizhanovsky commented Dec 26, 2019 • edited Loading

#1064 point 1.

#1064 point 2

TODO

vankoven commented Apr 24, 2020

vankoven Apr 24, 2020

Choose a reason for hiding this comment

krizhanovsky Apr 24, 2020

Choose a reason for hiding this comment

krizhanovsky commented Apr 24, 2020

vankoven left a comment

Choose a reason for hiding this comment

krizhanovsky commented Dec 26, 2019 •

edited

Loading