Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1064: TLS performance imporovements #1375

Merged
merged 154 commits into from
Apr 28, 2020
Merged

#1064: TLS performance imporovements #1375

merged 154 commits into from
Apr 28, 2020

Conversation

krizhanovsky
Copy link
Contributor

@krizhanovsky krizhanovsky commented Dec 26, 2019

#1064 point 1.

Introduce memory pools for the MPIs. MPI memory profile is an MPI memory pool filled at configuration time, so that we can use stream copy to quickly create ready to use MPI image for a particular PK computation.

#1064 point 2

Reduce number of public key checkings and assembly implementations for the core math.

TODO

The PR is buggy and to be fixed during the review.

Sorry for the large commits of mixed logical changes and cleanups.

aleksostapenko and others added 27 commits October 26, 2019 19:12
1. Changes in HPACK decoder to copy only Huffman-decoded and dynamically indexed headers;
2. Appropriate changes in HPACK-decoder/parser unit-tests.
Corrections as a result of HPACK decoder/encoder/parser unit-tests debugging.
TfwStrs often go in arrays, and there is a 1-byte alignment gap between
TfwStr neighbours. `eolen` is always [0,2], 14-bits are enough for
hpack index.
# Conflicts:
#	tempesta_fw/http_parser.c
#	tempesta_fw/http_sess.c
#	tempesta_fw/t/unit/test_hpack.c
#	tempesta_fw/t/unit/test_http_sticky.c
@krizhanovsky krizhanovsky requested a review from i-rinat December 27, 2019 15:33
@krizhanovsky krizhanovsky changed the title MPI memory pools and profiles #1064: MPI memory pools and profiles Dec 28, 2019
Small improvement of mpi_fixup_used().
   SECP 256 EC.

2. Remove unit test for assembly routines - now we test them within
   test_mpi_math.c
Bash script to run all the TLS unit tests.
is the same as the newer 186-3 D.2). The column based calculations
involve more than 50 additions/subtractions. This number can be reduced
to 27 operations if full 8-byte registers are used, which is close to
WolfSSL's Montgomery reduction - about 33 operations). However, the
algorithm may produce bigger than the modulo or negative result, so a
comparison of the result is required and possibley following additions
and subtractions. I tried to use AVX2 for the FIPS modulo, but the
vector premutations are too complex for the instruction set we we need
to empy too many SIMD operations. Moreover, unfortunately we can't
process the whole 32 byte vector at once since we need to handle carry.
All in all AVX2 doesn't work on the small vectors with complex operations.
I need more study on Montgomery vs FIPS mudulo reductions.

1. MPI memory is allocated from memory pools with preallocated space.
   All MPI operations are constant size, so we know precisely how much
   memory a handshake of particular type uses. Thus we do not need to
   check return codes from MPI memory allocation routines - they're
   always succeed if a memory pool was successfully created.

2. Do not check public and private key in normal workflow, do this only
   for debug builds.

3. Small performance improvement of ttls_mpi_safe_cond_assign().
   SECP 256 curve (ported from WolfSSL).

2. Do not call multiplication or squaring for MPIs with value 1.

3. Don't call modulus quasi reduction on MPIs of TlsEcpGrp->bits size -
   single addition or subtraction does the business faster.

4. Replace TlsEcpGrp->pbits and ->nbits by single ->bits.
Current perf profile:

     8.57%  [tempesta_tls]  [k] ecp_mod_p256_x86_64
     4.55%  [tempesta_tls]  [k] ttls_mpi_shift_r
     4.04%  [tempesta_tls]  [k] ttls_mpi_sub_abs
     2.63%  [tempesta_tls]  [k] ttls_mpi_safe_cond_assign
     2.45%  [tempesta_tls]  [k] ttls_mpi_sub_mpi
     2.30%  [tempesta_tls]  [k] ttls_mpi_inv_mod
     2.28%  [tempesta_tls]  [k] ttls_mpi_cmp_mpi
     2.22%  [tempesta_tls]  [k] mpi_mul_x86_64_4
Introduce __check_stack_addr() to properly check address on stack, either
for a current user space process or softirq working on the same stack
after IRQ. This is more reliable check since previously there could be
4 dynamically allocated pages just little bit before the stach, so the
assertion failed.
1. Fix test_rsa: RSA context can not be allocated on the stack pool since
   TlsRSACtx->RN will be freed at the end of ttls_mpi_exp_mod().

2. Minor fixes and cleanups.
1. RSA fixes:
1.1. ttls_mpi_div_mpi() is used in RSA, so replace alloca() with our own
     stack and make it 16 pages in size;
1.2. free allocated memory on failed ttls_rsa_private() call;

2. Fix double free error on ttls_pk_parse_subpubkey() call failure;

3. Use more straightforward stack verification in Mpi Pool;

4. Some minor fixes for Tempesta FW unit tests.
Remove the lock and make the blinding values per-cpu.
Calculate the initial blinding values on configuration phase.
Fix pages leakage on errors and unsgined comparison against -1 (which
is never true) in in ttls_handshake_server_hello() and
ttls_handshake_finished().

Rename ttls_purge_io_ctx() to tfw_tls_purge_io_ctx() since it's not
itn Tempesta TLS (ttls_ prefix).

Small cleanup and code simplification in TLS context cleanup.
code, the caller does this. This bug lead to crash after the previous
fix of the leakage - now the page has no 'extra' reference.

Fix ttls_write_certificate(): fill the first skb frament immediately
to not to crash if we fail somewhere at the middle of the function
and ttls_handshake_server_hello() tries to put absent fragment page.

Minor fixes in ttls_parse_certificate() which isn't used so far.

Simplify ttls_write_certificate() code and make skb fragments usage
more efficient by writing the certificate length on configuration
plase. Data offset was added to tfw_cfg_read_file() to place the
3 byte of the length on the same page.

Simplify ttls_handshake_finished() do not reenter the FSM for
TTLS_SERVER_CHANGE_CIPHER_SPEC state, instead just save the state
to mark the XFRM context ready and move to the next state.
@vankoven
Copy link
Contributor

With Tempesta configuration if RSA key is used (never reproduced with EC keys):

listen 80;
listen 443 proto=https;
cache 0;

tls_certificate /home/user/cert/tfw-root.crt;
tls_certificate_key /home/user/cert/tfw-root.key;

srv_group default {
    server 127.0.0.1:8080 conns_n=1;
}

block_action attack reply;

vhost default {
    proxy_pass default;
}

http_chain {
     -> default;
}

After running several attempts of tls-perf

./tls-perf -T 10  192.168.122.12 443 
debian login: [193427.134548] [tempesta tls] Warning: None of the common ciphersuites is usable (e.g. no suitable certificate)
[193427.136826] [tempesta fw] Warning: Unrecognized TLS receive return code -22, drop packet
[193527.406298] [tempesta fw] Warning: Unrecognized TLS receive return code -28288, drop packet
[193527.409931] [tempesta fw] Warning: Unrecognized TLS receive return code -28288, drop packet
[193527.413461] ------------[ cut here ]------------
[193527.415771] WARNING: CPU: 1 PID: 16 at /home/user/qtc/release/tempesta/tls/bignum.c:1223 ttls_mpi_exp_mod+0x6b3/0x9e0 [tempesta_tls]
[193527.421587] Modules linked in: tempesta_fw(O) tempesta_db(O) tempesta_tls(O) tempesta_lib(O) sha256_ssse3 sha512_ssse3 sha512_generic ccm kvm_intel binfmt_misc snd_hda_codec_generic iTCO_wdt iTCO_vendor_support kvm snd_hda_intel irqbypass crct10dif_pclmul crc32_pclmul snd_hda_codec ghash_clmulni_intel snd_hda_core snd_hwdep snd_pcm snd_timer snd sg soundcore virtio_console virtio_balloon virtio_gpu ttm pcspkr evdev lpc_ich serio_raw drm_kms_helper mfd_core drm shpchp button ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb aesni_intel crypto_simd cryptd glue_helper aes_x86_64 crc32c_generic sr_mod cdrom virtio_blk virtio_net crc32c_intel psmouse ahci i2c_i801 libahci libata sym53c8xx ehci_pci scsi_transport_spi uhci_hcd ehci_hcd scsi_mod virtio_pci virtio_ring virtio usbcore usb_common
[193527.448064]  [last unloaded: tempesta_lib]
[193527.449526] CPU: 1 PID: 16 Comm: ksoftirqd/1 Tainted: G           O    4.14.0-tempesta-kmemleak-amd64 #1 Debian 4.14.32-tfw7-1
[193527.452856] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20191223_100556-anatol 04/01/2014
[193527.455129] task: ffff91cb415d3c00 task.stack: ffffa01d006b0000
[193527.456308] RIP: 0010:ttls_mpi_exp_mod+0x6b3/0x9e0 [tempesta_tls]
[193527.457894] RSP: 0018:ffffa01d006b3408 EFLAGS: 00010206
[193527.459354] RAX: 0000000000000000 RBX: ffff91caf5ac1010 RCX: 0000000000000021
[193527.461390] RDX: 0000000000000020 RSI: ffff91caf5ac1018 RDI: ffff91caf5ac1028
[193527.464588] RBP: ffffa01d006b34c8 R08: ffff91caf5ac1040 R09: 0000000000000008
[193527.468112] R10: 0000000000000148 R11: ffff91cb0d740010 R12: ffff91caf5ac1020
[193527.471518] R13: ffffffffffffffff R14: ffff91caf5ac1040 R15: ffff91caf5ac1018
[193527.474295] FS:  0000000000000000(0000) GS:ffff91cb65500000(0000) knlGS:0000000000000000
[193527.476345] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[193527.477465] CR2: 00007fbd3f3c9018 CR3: 000000004ac0a006 CR4: 00000000003606e0
[193527.478678] Call Trace:
[193527.479456]  ? ttls_mpool+0x18/0x110 [tempesta_tls]
[193527.480394]  ? ttls_mpi_sub_int+0x4f/0xc0 [tempesta_tls]
[193527.481289]  ttls_dhm_make_params+0xa0/0x210 [tempesta_tls]
[193527.482261]  ttls_handshake_server_hello+0xd7c/0x1080 [tempesta_tls]
[193527.483772]  ? ttls_update_checksum+0x96/0x160 [tempesta_tls]
[193527.486626]  ttls_handshake_server_step+0x378/0x1a80 [tempesta_tls]
[193527.489689]  ? shash_async_import+0x40/0x40
[193527.490587]  ttls_recv+0x147/0x680 [tempesta_tls]
[193527.491529]  ? ttls_decrypt+0x5d0/0x5d0 [tempesta_tls]
[193527.493000]  ss_skb_process+0x11b/0x140 [tempesta_fw]
[193527.495499]  ? ip_finish_output2+0x19f/0x3b0
[193527.497245]  tfw_tls_msg_process+0xc0/0x4d0 [tempesta_fw]
[193527.499405]  ? ip_output+0x71/0xe0
[193527.501001]  ? ip_queue_xmit+0x5c/0x3b0
[193527.502922]  __gfsm_fsm_exec+0x56/0x90 [tempesta_fw]
[193527.504117]  tfw_connection_recv+0x41/0x60 [tempesta_fw]
[193527.506733]  ? tfw_connection_send+0x30/0x30 [tempesta_fw]
[193527.508318]  ss_tcp_process_data+0x1ea/0x480 [tempesta_fw]
[193527.509647]  ? dev_hard_start_xmit+0x31/0x1f0
[193527.511152]  ss_tcp_data_ready+0x43/0x90 [tempesta_fw]
[193527.512477]  tcp_data_queue+0x4f5/0xc50
[193527.513134]  tcp_rcv_established+0x27c/0x570
[193527.513847]  tcp_v4_do_rcv+0x129/0x1d0
[193527.514474]  tcp_v4_rcv+0x947/0xa50
[193527.515295]  ip_local_deliver_finish+0x9a/0x1c0
[193527.517538]  ip_local_deliver+0x6b/0xe0
[193527.519457]  ? tcp_v4_early_demux+0x112/0x150
[193527.521606]  ? ip_rcv_finish+0x17a/0x400
[193527.523205]  ip_rcv+0x289/0x3c0
[193527.524626]  ? inet_del_offload+0x40/0x40
[193527.525545]  __netif_receive_skb_core+0x84f/0xb30
[193527.526328]  ? lock_timer_base+0x74/0x90
[193527.527340]  ? process_backlog+0xa3/0x160
[193527.528969]  process_backlog+0xa3/0x160
[193527.530352]  net_rx_action+0x28e/0x3f0
[193527.531587]  __do_softirq+0x10f/0x2a8
[193527.532913]  run_ksoftirqd+0x1c/0x40
[193527.533918]  smpboot_thread_fn+0x10e/0x160
[193527.535624]  kthread+0xff/0x130
[193527.536592]  ? sort_range+0x20/0x20
[193527.537851]  ? kthread_create_on_node+0x70/0x70
[193527.539564]  ret_from_fork+0x35/0x40
[193527.540512] Code: 45 b6 66 44 89 4d b0 66 44 89 55 b2 e8 f7 d9 ff ff 48 8b 8d 70 ff ff ff 41 bb 01 00 00 00 66 44 89 5d b0 49 89 cf e9 da fa ff ff <0f> 0b b9 f4 ff ff ff e9 d4 fe ff ff 48 8b 85 78 ff ff ff 48 8b 
[193527.545381] ---[ end trace 00f20e676f79f25b ]---

Can't give more details this happend once ocasionally.

ktest/asm/export.h Outdated Show resolved Hide resolved
tls/mpool.c Outdated Show resolved Hide resolved
tls/mpool.c Outdated Show resolved Hide resolved
* %RCX - B->used (used directly for looping);
* %R8 - A->used.
*/
ENTRY(mpi_sub_x86_64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get warning here:

tempesta/tls/.tmp_bignum_x86-64.o: warning: objtool: mpi_sub_x86_64()+0x6c: sibling call from callable instruction with modified stack frame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the warning each time I build the module and I have explored the code. The problem code is

.sub_small_b:
        clc
        ANNOTATE_RETPOLINE_SAFE
        jmpq    *%rbx

and it seems objtool doesn't like

        jz      .sub_small_b
        pushq   %r12

However, the register is pop'ed before the label. The code is fine. I didn't find a way to make objtool like the code, but there were fixes recently in the tool iteself, so probably the warning will go away after #1049

In later #1064 assembly work probably we'll figure out how to adjust the code to keep pushing the register to the stack lately, but avoid the warnings.

tls/mpool.c Outdated Show resolved Hide resolved
tls/mpool.c Outdated Show resolved Hide resolved
tls/mpool.c Outdated Show resolved Hide resolved
tls/mpool.c Outdated Show resolved Hide resolved
tls/ttls.c Outdated Show resolved Hide resolved
tls/bignum.c Outdated Show resolved Hide resolved
failures even with the RSP register check and I don't see a reason
to export reliable task_stack_page() and irq_stack_ptr symbols from
the kernel just for the assertion.

The reason for the false failures is that maximum stack size is large
enough to span pages above the stack if RSP is pointing to begin of
the allowed stack space.
@krizhanovsky
Copy link
Contributor Author

The call trace #1375 (comment) is quite weird. tls-perf produces only ECC handshakes, so ClientHello step should fail and it does fail - Warning: None of the common ciphersuites is usable is thrown exactly on the step. Meantime the call trace is from the later ServerHello step.

The script while :; do ./tls-perf -T 10 192.168.100.4 443; sleep 1; done even w/o the sleep 1 can not reveal the problem.

Having that the problem has appeared on a5c5143 commit, before the last memory leak and corruption fixes, I assume that the problem is in the corrupted memory.

TODO comments about further p256 optimization notes.

Remove unused ECDH context blinding values.
Remove outdated comment and static assert.
Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@krizhanovsky krizhanovsky merged commit bcfb458 into master Apr 28, 2020
@krizhanovsky krizhanovsky deleted the ak-bn-mem-profiles branch April 28, 2020 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants