-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign of TCP synchronous sending and data caching #391
Labels
Milestone
Comments
krizhanovsky
changed the title
TCP synchronous sending
Redesign of TCP synchronous sending and data caching
Jan 8, 2016
krizhanovsky
added a commit
that referenced
this issue
Jan 9, 2016
krizhanovsky
added a commit
that referenced
this issue
Jan 9, 2016
krizhanovsky
added a commit
that referenced
this issue
Jan 10, 2016
krizhanovsky
added a commit
that referenced
this issue
Jan 15, 2016
This was referenced Jan 24, 2016
Merged
Open
6 tasks
krizhanovsky
added a commit
that referenced
this issue
Jul 3, 2018
Some FSM DSL defines are moved to lib/fsm.h, http_limit.c ported to the new API. Address #391.12: ss_skb_alloc() extended with an agrument for head room. Many cleanups again.
Closed
14 tasks
krizhanovsky
added a commit
that referenced
this issue
Nov 29, 2018
* Encrypt hash for server finished (missed functionality). * Multiple fixes in handling scatter lists; * Multiple fixes for IV handling in encryption and decryption code. * Fix TLS record header and tag allocation in skb (linked with #391.11). * Many cleanups and nicer debug and errors reporting. Kernel: * Fix TLS skb type handling to call sk_write_xmit() callback. * Reserve room for TLS header in skb headroom. * Reset TCP connection if we can not encrypt data on it instead of retransmit it in plaintext. This leads to warning similar to #984 - leave as TODO for now.
krizhanovsky
added a commit
that referenced
this issue
Dec 25, 2018
Some FSM DSL defines are moved to lib/fsm.h, http_limit.c ported to the new API. Address #391.12: ss_skb_alloc() extended with an agrument for head room. Many cleanups again.
krizhanovsky
added a commit
that referenced
this issue
Dec 25, 2018
* Encrypt hash for server finished (missed functionality). * Multiple fixes in handling scatter lists; * Multiple fixes for IV handling in encryption and decryption code. * Fix TLS record header and tag allocation in skb (linked with #391.11). * Many cleanups and nicer debug and errors reporting. Kernel: * Fix TLS skb type handling to call sk_write_xmit() callback. * Reserve room for TLS header in skb headroom. * Reset TCP connection if we can not encrypt data on it instead of retransmit it in plaintext. This leads to warning similar to #984 - leave as TODO for now.
krizhanovsky
modified the milestones:
0.8 TLS 1.3,
1.1 Network performance & scalability,
1.1 TBD (Network performance & scalability),
1.1 TDB (ML, QUIC, DoH etc.)
Feb 11, 2019
22 tasks
krizhanovsky
modified the milestones:
1.1 TBD (ML, QUIC, DoH etc.),
0.8 TLS 1.3 & Performance,
1.0 Stability - GA
Oct 13, 2019
7 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Current implementation of
ss_send()
and cache logic are unsatisfactory. Following changes should be made, many of them are inspired by Sandstorm Web server. First points are referenced by #534.Queuing
Initially the issue blamed that too many queues are involved in proxying an HTTP request to a server socket (linked with #687):
TfwCliConn->seq_queue
required to order server responses for pipelined requests;TfwSrvConn->fwd_queue
andTfwSrvConn->nip_queue
implementserver_queue_size
andserver_forward_timeout
limits, responsible for server failovering per-connection and per-server logic;Per-server
TfwSrvConn
queues live longer than per-socket TCP send queues and TCP send queues are required to provision ready-to-send data, so neither of the queues can be eliminated. Moreover, TCP windows are dynamically varying, so we have no idea how much data a socket can send when we schedule a data for transmission (which must queue for sending in TCP control block). This also means that we have to live with the fact that TCP can adjust an skb in transmission path since it works with dynamically changing TCP connection parameters.TODO
tfw_cache_build_resp()
can produce very large response. Many of theskb
are simply dropped by Qdisc the others will be sent using TSQ tasklet. So there is no sense to do large TDB scans, but rather the scans should be done by small portions when there is room in Qdisc. Actually, current available room inproxy_buffering
from HTTP message buffering and streaming #498 can be used for now as an amount of memory fetched from the cache. This also reduces bufferbloat. Response transfer rate is also the subject for HTTP QoS (HTTP QoS for asymmetric DDoS mitigation #488): if a client has high QoS mark, it received the data faster, i.e. we should sendmin(client advertised TCP window size, QoS results)
. Response should be assembled with current TCB state knowledge for the client connection to avoid additional splitting at TCP/IP layer, typically as a callback fromtcp_write_xmit()
ortcp_transmit_skb()
where we know exact "right" size of skb (i.e. we can trick the TCP/IP stack by queueing an empty skb with maximum value in settingskb->len
, but fill the skb from the cache later via callback when we know how much data we're going to send) and/or onsk->sk_write_space()
hook. In general we should implement pull strategy: when the data ACKnoledged, we pull next portion of data from the cache. This step reduces skb transformations for cached responses. Proxied messages (requests and responses) can not be optimized this way: there is no difference when to split an skb, but we need to queue it anyway andtcp_write_xmit()
has a knowledge how much data can be sent at particular point of time.MSG_MORE
/skb->xmit_more
provide optimizations up to a NIC driver layer, so the flags must be implemented and used for sending data from the cache and through TLS. It's also seems that Nagle algorithm has no sense in Tempesta since we do our best to form as large skbs as possible, so the algorithm should be switched off.RX softirq can do millions TDB blocks transitions to process one request to large stored file - this leads to significant packet drops from other clients. Moreover, concurrent requests to large files may lead to efficient DoS. Large files should be transferred by smaller blocks in work queue as currently;
Ingress response pieces must be stored at cache as soon as they come to interface while the data is still hot in CPU caches (RFC 7234, 3.1 allows the behavior);
SKBs free in
net_tx_action()
should be just reused for next message to send (so NET_TX_SOFTIRQ should be interchanged with NET_RX_SOFTIRQ). Per-cpu skb caches should be introduced. Memory corruption when adding/changing Connection: header in an HTTP response. #353 also requires different skb linear data allocation in pages or simply reallocate and copy linear data on demand. Also see Do not generate static responses #163 (comment)Copies in
ss_send()
should be eliminated. We copy skbs inss_send()
to be able to resend them in case of server connection failure and compare URI when a response is received. Since we use paged data (and should use headerless skbs - check this!) and get() the pages, all TfwStr pointers from HttpReq to the skb pages remain the same after TCP/IP operations with skbs.tcp_ack()
seems doesn't mangle skbs, but just updates their TCP control block or remove them from send queue, so it seems we don't need real skb copies for ACK processing. Thus, with the previous point implemented so we don't just free skbs, we can (1) just pass an skb to SS/TCP/IP layer and (2) hook when the split skbs are freed and reacquire them for further possible retransmission.Also after Enforce the correct order of responses. Handle non-idempotent requests. #660 Tempesta operates with HTTP queues, so functions like __tfw_http_resp_fwd() sending many HTTP messages in one shot should be rewritten in scatter manner, i.e. calling SS only once for a whole message queue.
Instrumentation patch Server failovering may cause crashes under load or during getting of perfstat #692 (comment) fully replaces SLAB allocator by our own page allocator - probably it has sense. At least better memory utilization will be achieved. Also functions like
skb_split_inside_header()
can win from using skbs withhead_frag
to just manipulate with page fragments instead of copying kmalloc()'ed areas.Current
ss_skb_unroll_slow()
should be optimized somehow, preferably to avoid copies at all, but at least consumed ingress skb must be reused instead of allocating a new skb.[INVALID doesn't make sense with Optimizer for HTTP messages adjustment #1103 and updated Properly store and build HTTP headers #634]
tfw_cache_build_resp()
should keep assembledTfwHttpResp
, probably referenced byTfwCacheEntry
, and return copy of it: copying list of skb and assembledTfwHttpResp
is must faster than scan TDB and assemble all HTTP structures for following adjustments;[DONE in #391 fixes] RX softirq is responsible also for HTTP message retransmissions (requests to a upstream server or responses to client), so we must lock 2 sockets. Giving that requests and responses can came at the same time 2 cpus can try to lock the same sockets in different orders leading to deadlock. Kernel threads using SS interface also suffer from the issue (a socket can be locked by softirq and a kernel thread, see Socket deadlock on ss_send() from thread context #337). This also bad for performance since different CPUs compete for the same sockets. Thus transmission action must be scheduled to proper CPU and performed from TX softirq;
[DONE] Headers writing must be optimized.
__alloc_skb()
makes serious memory provisioning, so the space must be used efficiently: I've seen in some cases that HTTP headers come in separate page, while the skb's page was shared with other skbs. Firstly, a new headers should be added by just moving CRLF instead of inserting a new fragments. Next,ss_skb_alloc_pages()
shouldn't allocate headerless skb, but instead fully use the first page with the allocated skb.Second point is done in Tempesta TLS performance optimizations #1037 . A stronger optimization is proposed in Optimizer for HTTP messages adjustment #1103 .
All work is done is softirq, i.e. we spend more time to process each packet under heavy load, but we don't do anything while system is idle. Small traffic bursts are mitigated by dev queue and Tempesta frees the queue slower than vanilla Linux. However, large traffic bursts lead to TCP buffer overflows in normal system and Tempesta behaves there better since it processes TCP segments faster than usual user-space process. Meantime less loaded CPU is just idles, but could do some useful work like eviction of old entries from various caches. Also more work in softiq means more data structures accessed with larger memory footprint, so CPU cache starvation is more probable. Thus, asynchronous cleanups (e.g. eviction of old cache entries while systems has enough resources) and other non-crucial logic (e.g. traffic classification) should be done in separate kernel threads running of designated CPU cores to mitigate cache pressure. The asynchronous logic can be made synchronous in particular conditions, e.g. garbage collection thread could be awakened if there is not enough memory, i.e. at system stress (like VMM exhausting, NIC packet drops, attack detected and we should not pass the packets etc). Other example of synchronous garbage collection is evict accessed old/invalid entries during data structure scanning or updating (however scanning and eviction requires different lock types);
If a server connection is dropped and we have something to send to a server, we can and we should send the data together with final handshake ACK (somewhat related to TCP Fast Open #144 TCP Fast Open).
There are several calls of
ss_skb_split()
, which allocates and copies memory. It seems we can just use offsets and lengths of the data chunks plus take reference of an skb to avoid thess_skb_split()
calls.[DONE there are no Linux work queues anymore, we use our own
TfwRBQueue
] There is fixed number of CPUs, equal to number of softirqs, so we can fully utilize CPUs by softirqs. Any queueing, like current work queues, is a potential source of bufferbloat;The text was updated successfully, but these errors were encountered: