Mekhanik evgenii/1196 itog 2 #1973

EvgeniiMekhanik · 2023-09-07T09:18:54Z

No description provided.

krizhanovsky

I reviewed the scheduling algorithm only and it seems it has implementation bugs as well as design weakness - I think we can do better.

I'll review the modern research (e.g. https://www.usenix.org/system/files/nsdi22-paper-gao_peixuan.pdf considers several old approaches and introduces a new one), but so far at least the idea to link stream objects into the list and traverse them looks silly: spatial locality is just poor having that the stream objects are relatively large and not so many of them can fit in the same page (even using the slab allocator).

The first idea is that instead of dealing with real linked list nodes (especially double linked!) with 8 bytes for each pointer, we can move the scheduler metadata out from TfwStream and introduce a packed data structure, which can be placed on the same page with the whole scheduler data and all the TfwStreamSchedEntry nodes. The new data structure can be linked with only 1-2 byte offsets within the memory area. I.e. in the most trivial case we can leave the algorithm, but use a packed data structure with better spacial locality.

Maybe, having good spacial locality (need to estimate how many entities can fit single cache line) we can traverse the anchors list to find the right place for the current deficit.

It makes sense to skew the existing unfairness to that streams already being scheduled (some frames have been sent on them) should get slightly more priority than streams having higher weight to get lower average load time.

fw/http_stream_sched.h

fw/http_stream_sched.c

krizhanovsky · 2023-09-13T21:05:51Z

fw/http_stream_sched.c

+			tfw_h2_sched_stream_enqueue(stream, *parent);
+			entry = &stream->sched;
+		} else {
+			break;


Why do we break here? It seems stream is blocked and have not children, so why don't we try the next stream from the linked list? This should be either fixed or commented.

We can use BUG() here, add appropriate commit for new algorithm

fw/http_stream_sched.c

fw/http_stream.h

fw/http_stream_sched.c

krizhanovsky

Please address all the comments and discussions (verbal meetings as well) about the algorithm in the comments for the file.

A couple of more comments.

It'd be good to benchmark the current implementation (as H2O reference) with an optimized one.

fw/http_stream_sched.c

krizhanovsky · 2023-09-14T11:43:41Z

fw/http_stream_sched.c

+		*dep = &sched->root;
+}
+
+void


https://calendar.perfplanet.com/2018/http2-prioritization/ :

Chrome is the only browser to use the exclusive bit and it uses it on EVERY resource.

~~I hope it's an outdated story, but it'd be good to test Tempesta FW with the algorithm against Chrome.~~ https://tempesta-tech.com/knowledge-base/HTTP2-streams-prioritization/ clearly shows that this isn't an outdated story.

krizhanovsky · 2023-09-14T11:55:05Z

fw/http_stream_sched.c

+		 */
+		tfw_h2_stream_sched_remove(stream);
+		tfw_h2_change_stream_weight(stream, new_weight);
+		tfw_h2_add_stream_dep(stream, new_parent, excl);


This is a common pattern for the scheduler: pull a stream from the priority queue, recalculate it's deficit and reinsert the stream back according to the new deficit. If streams A and B have the same weight, then we just intermix them, i.e. both of the resources will be downloaded with the highest delay.

It seems only progressive JPEGs benefit from the behavior while all other resources, including WEBP or PNGs images, aren't. This should be a browser's job to assign different weight to different resources and the same weights to images only, but I'm not sure that browsers are able to do this. I found only https://calendar.perfplanet.com/2018/http2-prioritization/, but it's old and it seems for example than Firefox assigns the same weights to CSS.

Please test how Firefox and Chrome assign weights with the current implementation (also good to check resources like Instagram with many images or Gmail), document it and probably we'll need to make the algorithm aware about progressive JPEGs and intermix only their streams (it seems this is what Cloudflare does or did).

This code is called during stream reprioritization, because of stream priority frame

EvgeniiMekhanik · 2023-09-14T14:30:51Z

This PR also fixes #1884

const-t · 2023-09-13T12:57:52Z

fw/sock_clnt.c

+		cwnd_avail = (tp->snd_cwnd > tp->packets_out ?
+			tp->snd_cwnd - tp->packets_out : 0);
+		cwnd_avail_in_bytes = cwnd_avail * mss_now;


I have a doubts about calculating cwnd. As I understand this place must use tcp_cwnd_test() for calculation of cwnd, our version is very rough. Further we make a frames relying on cwnd and this is also a question for me, because cwnd it's not an exact size of data that can be pushed into write_queue, exact value is tcp window. For more accurate calculations we must use tcp_mss_split_point() that relies on cwnd. Maybe such rough calculations chosen intentionally, if so please describe it.

Rework this place, but you also not right here. tcp_cwnd_test returns min(halfcwnd, cwnd - in_flight) on each iteration of the loop for calculation count of bytes that can be sent per ONE loop iteration. Since we call this function on each loop iteration it returns 0 only when cwnd - in_flight became equal to zero! So we can use cwnd - in_flight for our purpose, but also we should take into account len of socket write queue, because we can push a lot of skbs on previous iteration if headers are big (we ignore cwnd when we make headers) or if there are some special frames were sent. We also cant use tcp_mss_split_point because it deal with skb. We can use it only if we make frame for each skb (use skb->len as a limit of frame length), but it i think, that it is bad, since we should use tls encryption for each skb in this case. So i think that rough calculation is better.

fw/http_frame.c

fw/http_stream.h

const-t · 2023-09-13T15:22:24Z

fw/http_frame.c

+#endif
+
+static int
+tfw_h2_make_frames_for_stream(TfwH2Ctx *ctx, TfwStream *stream,


I would prefer to rename this function. Now we also encode headers here.

const-t · 2023-09-13T15:35:02Z

fw/http_stream.h

- * @processed		- count of bytes, processed during prepare xmit
- * 			  callback;
- * @nskbs		- count of skbs processed during prepare xmit callback;
+ * @mark		- mark of the resp skb_head;


Looks like we can loose mark in tcp_write_xmit() -> tso_fragment(). Do we have tests for mark forwarded to backend?

Create new issue #1987 because this problem is not resolved in this patch

fw/http_stream.h

const-t · 2023-09-18T13:11:43Z

fw/http_frame.c

@@ -2548,6 +2561,8 @@ do {									\
 			stream->xmit.is_blocked = !stream->rem_wnd;
 			T_FSM_EXIT();
 		}
+
+		ADJUST_TMP_AVAILABLE_CWND(*cwnd_awail, tmp_cwnd_awail);


I can suggest do this on transition to next state from HTTP2_MAKE_HEADERS_FRAMES and HTTP2_MAKE_CONTINUATION_FRAMES, because check it every DATA frame which will be much more than HEADERS, not looks right.

fw/sock_clnt.c

fw/ss_skb.h

fw/sock.c

const-t · 2023-09-19T09:06:48Z

fw/http_frame.c

+
+		stream->xmit.state = HTTP2_RELEASE_RESPONSE;
+
+		T_FSM_NEXT();


I believe we can use T_FSM_JMP here and other appropriate places that do not switch on every transition.

There are a lot of changes in this commit: - All skbs which contain HEADERS or DATA are placed in the queue of the corresponding stream, instead of being added to the socket write queue. - In tcp_push_pending_frames() we call special socket callback, which is used to move skbs from stream with highest priority to the socket write queue. Count of moved skbs is determined by current TCP CWND. Delayed queuing of skb to the socket write queue allows higher priority data to be placed ahead of lower priority one. - We encode headers during the previos operation, because otherwise there can be a situation then we send header which is encoded using hpack dynamic table, before the same header which was not encoded and client can't decode it. - We make frames during the previous operation, when skb is not located in the socket write queue, so we can not care about sequence numbers and so on. - Rewrite HTTP2 framing to state machine. - We don't need skb flags anymore. - Move applying of new settings to ss_do_send, because new settings should be applyed just before ack to this settings is pushed to the socket write queue, otherwise new settings can be applyed and some frames with this new settings will be sent before ack to this settings. Closes #1946

Previously when we send RST stream to the client we immediately move stream to the closed queue. If client open a new stream we shrink this queue and delete streams from this queue, so there is a chance that we have no time to send RST stream to the client. Now we move such stream to the closed queue only after sending RST stream. Also make little fix in stream processing. We should move stream to HTTP2_STREAM_REM_CLOSED state when client don't close stream from it's side. We should process stream in HTTP2_STREAM_CLOSED state since we don't move client to closed queue immediately.

Since all hpack encoding was moved to the 'ss_do_send' function we don't need any additional locks, since 'ss_do_send' is always called under the socket lock.

HTTP2_PRIORITY frames are allowed for the idle streams. Moreover some clients use this ability to set weight and dependency for idle stream, so we should handle HTTP2_PRIORITY for such streams.

According to our investigation ebtree is the fastest data structure for stream prioritization, so we should port it to our project.

Implement stream scheduler to have ability to get the most priority stream. HTTP2 connection context contains stream scheduler and each stream also contains scheduler for its children, so we have a tree of schedulers. Each scheduler contains ebtree of children streams. This tree is used to find the most priority stream.

By default we use delayed queuing of skbs to socket write queue, that allows higher priority data to be placed ahead of lower priority data. Setting "latency_optimized_write" option to "false" disable this behaviour.

During working on broken tests, the following problems were identified and fixed: - we should zeroed skb->cb before pass skb to the kernel, because we use it for our purpose. - we should call `ss_skb_init_for_xmit` before setting mark and tls type to prevent it possible zeroed in this function. - HTTP2_STREAM_REM_CLOSED state is a state when we send RST_STREAM or END_STREAM flag to remote peer from any state other then open or local-half closed. (This state need to handle a situation when stream had been already closed on server side, but the client is not aware about that yet). Rework stream processing according this description. - STREAM_RECV_PROCESS set ctx state to HTTP2_IGNORE_FRAME_DATA in case when stream processing fails. But we don't take it into account in our state machine. Fix this problem (see changes for state machine HTTP2_RECV_DATA, HTTP2_RECV_HEADER, HTTP2_RECV_CONT).

- Codestyle fixed - Remove `tfw_h2_destroy_stream_send_queue` because we already iterate over all streams, when we close the connection and free all associated resources. So we don't need special function and should destroy stream send queue when we free all other resourses. - Change algorithm of calculation of available send window. Now we calculate sender and receiver window and select the smallest of them. - Do not change Conn_Closing value. - Do not setup sk_fill_write_queue for not HTTP2 connections. We use this function to make frames according current available window. For HTTP1 we don't need to make frames and don't need to call this function. - Calculate stream deficit using precomputed values from the table not using devision, because devision is very slow operation. - Remove unnecessary macros XMIT_FSM_JMP - Move comment about `tfw_http_msg_linear_transform` before function definition - Use memset instead of bzero_fast to zeroed skb->sb - Add comments

According review we decide to remove this option, because this option is true by default and there is no sense to disable implement ability to disable it.

Use BITMAP instead of array of bool values to save settings which we should apply.

Previously we send error response and close connection after sending all pending data. But we should not send responses in case of connection level error (when we close connection after error response). This patch fix this problem. Also decrease size and remove holes from some structures.

- Ajust counf of skbs, which were previously (during making frames) pushed to socket write queue. In `tcp_write_xmit` main loop `cong_win` is calculated on each loop iteration and if we calculate `cong_win` for making frames without taking into account previously pushed skbs we push more data into socket write queue then we can send. - Other frames (from any stream) MUST NOT occur between the HEADERS frame and any CONTINUATION frames that might follow. So add such frames to postponed list and send them later.

In case of TCP segmentation we can catch ENOMEM when we try to encrypt several skb. The reason is that we don't limit count of skb for encryption only the total len. But if the len of each skb is equal to 1, count of nents will be equal to TLS_MAX_PAYLOAD_SIZE (16384) and sgt list allocation fails. Probably a lot of memory allocation errors in TCP segmentation tests causes because of this problem.

Pass ss_action as a new argument to `sk_fill_write_queue`. Send TCP shutdown if ss_action is SS_SHUTDOWN (function is called from `ss_shutdown`) and there is no pending data in our scheduler and do not send it if it is called from `ss_do_close`.

Purge stream send queue and do not terminate connection in case of STREAM_FSM_RES_TERM_STREAM error. Also free stream->xmit.postponed skbs when connection is closed.

- Codestyle fixes. - Do not add stream to closed queue until all pending data will be sent. - Do not send END_STREAM in CONTINUATION frames because it is prohibited by RFC 9113.

- Do not split skb if it contains both HEADERS and DATA frames. This was the main problem: for small response which is located in one skb this split increase count of sending skbs twice! - Do not calculate snd_wnd for each frame. - Take into account current length of socket write queue during snd_wnd calculation.

Now Tempesta FW has ability to accept both HTTP1 and HTTP2 connections on the same port. When we create new connection we don't know real connection type, so we should set `sk_fill_write_queue` callback for all tls connections and just return 0 in this callback if it is not HTTP2 later.

We should adjust count of skb fragments even if `tail_frags` is equal to zero. Also remove unnecessary linear transformation during making frames.

EvgeniiMekhanik requested review from krizhanovsky and const-t September 7, 2023 09:18

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 8 times, most recently from d121627 to 2c52de8 Compare September 12, 2023 09:47

krizhanovsky requested changes Sep 13, 2023

View reviewed changes

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 2 times, most recently from eb9f218 to a97b4fa Compare September 14, 2023 14:06

krizhanovsky reviewed Sep 14, 2023

View reviewed changes

const-t requested changes Sep 18, 2023

View reviewed changes

const-t reviewed Sep 19, 2023

View reviewed changes

EvgeniiMekhanik marked this pull request as draft September 22, 2023 09:07

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 7 times, most recently from d549c9c to bb97f30 Compare September 29, 2023 13:32

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch from bb97f30 to 8334f23 Compare September 29, 2023 15:02

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 3 times, most recently from 0912d77 to 8d636f1 Compare October 16, 2023 12:16

EvgeniiMekhanik marked this pull request as ready for review October 16, 2023 12:17

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch from ba856b0 to 40209a9 Compare June 25, 2024 14:47

krizhanovsky mentioned this pull request Jun 25, 2024

fix(1346): fix control frames flood #2108

Open

EvgeniiMekhanik added 25 commits June 27, 2024 11:12

Update linux kernel patch

e272f97

Remove hpack table spin lock and guard

e79b8cb

Since all hpack encoding was moved to the 'ss_do_send' function we don't need any additional locks, since 'ss_do_send' is always called under the socket lock.

Implement handling of HTTP2_PRIORITY frames for idle streams

5c2c799

HTTP2_PRIORITY frames are allowed for the idle streams. Moreover some clients use this ability to set weight and dependency for idle stream, so we should handle HTTP2_PRIORITY for such streams.

Port ebtree to tempesta

40561b8

According to our investigation ebtree is the fastest data structure for stream prioritization, so we should port it to our project.

Implement "latency_optimized_write" option

8d1d28c

By default we use delayed queuing of skbs to socket write queue, that allows higher priority data to be placed ahead of lower priority data. Setting "latency_optimized_write" option to "false" disable this behaviour.

Codestyle fixes according review

8c07d7b

Remove latency_optimized_write option

6cc37ed

According review we decide to remove this option, because this option is true by default and there is no sense to disable implement ability to disable it.

Rework applying new settings according review

29ed953

Use BITMAP instead of array of bool values to save settings which we should apply.

Add comments about our scheduler algorithm

5217ce1

Move some http2 related logic to separate files

028567f

Add comments and codestyle fixes

5bb8bda

Do not terminate connection in case of STREAM_FSM_RES_TERM_STREAM

b84bf4b

Purge stream send queue and do not terminate connection in case of STREAM_FSM_RES_TERM_STREAM error. Also free stream->xmit.postponed skbs when connection is closed.

Fix according review

fd8769f

- Codestyle fixes. - Do not add stream to closed queue until all pending data will be sent. - Do not send END_STREAM in CONTINUATION frames because it is prohibited by RFC 9113.

Fix __extend_pgfrags function.

f355a6f

We should adjust count of skb fragments even if `tail_frags` is equal to zero. Also remove unnecessary linear transformation during making frames.

EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch from 40209a9 to f355a6f Compare June 27, 2024 08:15

EvgeniiMekhanik merged commit 96fb92d into master Jun 27, 2024
1 check passed

EvgeniiMekhanik deleted the MekhanikEvgenii/1196-itog-2 branch June 27, 2024 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mekhanik evgenii/1196 itog 2 #1973

Mekhanik evgenii/1196 itog 2 #1973

EvgeniiMekhanik commented Sep 7, 2023

krizhanovsky left a comment

krizhanovsky Sep 13, 2023

EvgeniiMekhanik Oct 17, 2023

krizhanovsky left a comment

krizhanovsky Sep 14, 2023 •

edited

Loading

krizhanovsky Sep 14, 2023 •

edited

Loading

EvgeniiMekhanik Oct 24, 2023

EvgeniiMekhanik commented Sep 14, 2023

const-t Sep 13, 2023

EvgeniiMekhanik Oct 17, 2023

const-t Sep 13, 2023

EvgeniiMekhanik Oct 16, 2023

const-t Sep 13, 2023

EvgeniiMekhanik Oct 16, 2023

const-t Sep 18, 2023

EvgeniiMekhanik Oct 16, 2023

const-t Sep 19, 2023

EvgeniiMekhanik Oct 16, 2023

Mekhanik evgenii/1196 itog 2 #1973

Mekhanik evgenii/1196 itog 2 #1973

Conversation

EvgeniiMekhanik commented Sep 7, 2023

krizhanovsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krizhanovsky left a comment

Choose a reason for hiding this comment

krizhanovsky Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

krizhanovsky Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EvgeniiMekhanik commented Sep 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krizhanovsky Sep 14, 2023 •

edited

Loading

krizhanovsky Sep 14, 2023 •

edited

Loading