-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP message buffering and streaming #498
Comments
I have a bunch of questions on the task:
|
Basically, yes, we can do some caching unconditionally. To protect memory exhausting attacks we provide Frang limits. Let's discuss the details in chat.
Not at all. The process is orthogonal to caching. For the cache we should behavior the same way as now for skb chunks of data.
Yes, good point. This is the subject for #710 . I'll add the note to the issue. UPD. The important note from the chat discussion is that |
Visualization of the difference in packet processing between Tempesta and Nginx. TempestaNginx default settings (buffering is on)Nginx, buffering is off |
Nginx log wnen backend closes the connection in the middle of a transfer:
Client (curl) reaction:
|
It's better to refer to Nginx implementation, which is very mature in production. Simply put, the only difference is that we have to implement buffering at the sk/skb level. References: |
Let me give more clarification about the nginx proxy module.
Let me explain some critical steps in the diagram above. If
If |
A relevant discussion https://github.com/tempesta-tech/linux-5.10.35-tfw/pull/17/files#r1637865281 and the subject of the discussion #2108 , which I think should be fixed in this issue: we need to push for transmission not more (or not significantly more) data to a client connection and h2 stream than the connection or stream allows. Since we proxy data, we should backpresure upstream connections and limit the most aggressive clients with #488. |
sysctl_tcp_auto_rbuf_rtt_thresh_us automatically adjust TCP rcv buffer per socket depending on the current conditions. Probably mostly affect #488. |
Probably good to be done with #1902
General requirements
Tempesta must support two modes of operation: HTTP messages buffering as now and streaming. In current mode of operation all HTTP messages are buffered, i.e. we deliver a proxied request or response only when we fully receive it. As opposite each received
skb
must be forwarded to client or server immediately.Full buffering and streaming are two edge modes and intermediate mode must be supported as well - partial message buffering based on TCP receive buffer.
HTTP headers must always be buffered - we need the header to decide what we should do with the message and how to forward/cache/whatever to do with it.
Configuration
The behavior must be controlled by new configuration options. Since Linux doesn't support per-socket
sysctl
s available for a user, we have to introduce our own memory limits for server and client sockets separately.client_mem <soft_limit> <hard_limit>
- controls haw many memory is used to store unanswered client requests and requests with linked responses which can not be forwarded to a client. Ifsoft_limit
is zero, then streaming mode is used, i.e. each receivedskb
is immediately forwarded. Otherwise the message is buffered, but not more than forsoft_bytes
bytes.hard_limit
is2 * soft_limit
by default, see description of in security operations section.client_msg_buffering N
- controls message buffering. Buffer only first N bytes of requests when forwarding requests to backend servers. Longer messages are forwarded part-by-part and never fully assembled in Tempesta. If request headers are longer than N bytes, then they are still buffered, since full set of headers is required to correctly serve and forward the request. The limit is applied per message and must not overcome per-connecton client_conn_bufferng limit.server_msg_buffering N
- same asclient_msg_buffering
but for the server connections.Previous attempts to implement the issue use
client_rmem
very similar totcp_rmem
sysctl, however the currentclient_mem
is very different because it accounts linked server responses.TCP interactions
All ingress
skb
s are immediately evicted and ACKed from the TCP receive queue, so actually we don't use TCP receive buffer for now. With the new enhancement we must account all HTTP data kept in Tempesta memory as residing in the TCP receive buffer of the socket which we received the data on, so TCP will send lower TCP receive windows. Seess_rcv_space_adjust()
andtfw_cli_rmem_{reserve,release}()
in #1183.client_mem <soft_limit> <hard_limit>
determines TCP receive window, but it also account responses. A simplified example forclient_mem 10 20
:In proxy mode we have to slow down fetching data from the server TCP receive queue if we read a response for a slow client, which can't read it with the same speed. Otherwise we can overrun our RAM in many clients, just somewhat slower than the servers (but they not necessary be really slow!). This might lead to HoL problem when a response pipelined by the server after the problem response will stay in the queue for a very long time and a good and quick client will experience significant delays. To cope with the issue
server_queue_size
and largerconns_n
should be used for the server group (please add this to Wiki!). Dynamically allocated server connections from #710 and server HTTP/2 #1125 are more robust solutions to the problem.Following performance counters must be implemented for traceability of the feature (e.g. to debug the problem above):
HTTP streams
HTTP/2 (#309) and HTTP/2 (QUIC, #724) introduce flow control which can efficiently throttle clients, so it seems the TCP window adjustments make sense only for HTTP/1.1 and the issue highly depends on QUIC and HTTP/2. RFC 7540 5.2.2 begins right from the issue of this task - memory constraints and too fast clients which must be limited whereby
WINDOW_UPDATE
.The security aspect of the issue is that clients can request quite large resources and announce very small windows (see RFC 7540 10.5) leading to memory exhaustion on our side (they can do the same with TCP & HTTP/1.1 for now).
At least following things must be done in the issue:
some streaming in context of HTTP QoS for asymmetric DDoS mitigation #488 : we should not keep in memory more data from the server response than a client announced in it's window, i.e. we should announce smaller TCP window for server connection. This point is good to do in generic way: we should handle the window from TCP layer, HTTP/2 and HTTP/3 in future.Actually, this is a tradeoff for buffering and streaming modes, which must be decided by an administrator. HTTP QoS for asymmetric DDoS mitigation #488 can determine malicius/slow clients and mitigate their impact thought.honour the client announced window and do not send more data than it was specified
announce real HTTP/2 window according to the configured buffer size.
X-Accel-Buffering header processing must be implemented to let a client manage the buffering (e.g. Dropbox does this).
If we receive
RST_STREAM
frame in streaming mode, then we should reset our stream with the upstream as well and it store only the head of the transferred response in the cache.Security operation
#995 makes an example how a client can exhaust memory by the very first blocking request and many pipelined requests with large responses. So
client_mem
must account the whole memory spent for a client. If client reaches the soft limit a zero receive window is sent. However, server responses for already processed requests may continue to arrive and if thehard_limit
is reached, then the client connection must be dropped (we have no chance to send a normal error response in this case).A malicious client may send byte by byte in streaming mode to overload a backend. This scenario must be addressed by the implementation, e.g. to configure minimum buffer size - only if an administrator allows 1 byte buffering or so, then only in this case pass through so small stream chunks. The other opportunity is DDoS QoS reduction in sense of automatic classification in #488.
Several implementation notes
These notes must be mirrored in the Wiki.
A streamed message consumes a server connection and we can not scheduler other requests to the connection, so using small buffers isn't desired.
Streamed requests can not be resent on server connection failures.
Tricky cases
From #1183 :
In general, our design must be as simple as possible. Say both the requests go to different connections. Both of them can be streamed or a first request may just require heavy operations on server side, but the 2nd request can be streamed immediately. As we receive
server_msg_buffering
of response data, we link the data withTfwHttpResp
(just as previously - received skbs are linked to the structure), the response is marked as incomplete and stays in the clientseq_queue
. Yes, the server connection is getting on hold. If the server processing the first request is stuck, then failovering process takes place and both the requests will be freed and both the server connections must be reestablised. We also have #710 addressing the problem of hold connections.We need to forward the responses immediately to a client, just mimicing the server. Probably we also should close connection. While the client request is not finished,
TfwHttpReq
should be sitting in server forward queue and we should forward new skbs of the request to the server connection and response skbs to the client connection. ExistingTfwHttpReq
andTfwHttpResp
descriptos should be used for the skb buffered forwarding.A connection dropping or skbs dropping can be good there. It's good to see how HAproxy, Nginx or Tengine behave in the same situations.
Relating issues
This implementation must use TCP receive buffer size to control how much data can be buffered (i.e. be on-the-fly between receive and send sockets). Meantime #488 adjusts TCP receive buffer size to change QoS. So this issue is foundation for #488.
Appropraite testing issue tempesta-tech/tempesta-test#87
See branch https://github.com/tempesta-tech/tempesta/tree/ik-streaming-failed and discussions in #1183
TEST
The text was updated successfully, but these errors were encountered: