-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mekhanik evgenii/1196 itog 2 #1973
Conversation
d121627
to
2c52de8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the scheduling algorithm only and it seems it has implementation bugs as well as design weakness - I think we can do better.
I'll review the modern research (e.g. https://www.usenix.org/system/files/nsdi22-paper-gao_peixuan.pdf considers several old approaches and introduces a new one), but so far at least the idea to link stream objects into the list and traverse them looks silly: spatial locality is just poor having that the stream objects are relatively large and not so many of them can fit in the same page (even using the slab allocator).
The first idea is that instead of dealing with real linked list nodes (especially double linked!) with 8 bytes for each pointer, we can move the scheduler metadata out from TfwStream
and introduce a packed data structure, which can be placed on the same page with the whole scheduler data and all the TfwStreamSchedEntry
nodes. The new data structure can be linked with only 1-2 byte offsets within the memory area. I.e. in the most trivial case we can leave the algorithm, but use a packed data structure with better spacial locality.
Maybe, having good spacial locality (need to estimate how many entities can fit single cache line) we can traverse the anchors list to find the right place for the current deficit.
It makes sense to skew the existing unfairness to that streams already being scheduled (some frames have been sent on them) should get slightly more priority than streams having higher weight to get lower average load time.
fw/http_stream_sched.c
Outdated
tfw_h2_sched_stream_enqueue(stream, *parent); | ||
entry = &stream->sched; | ||
} else { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we break here? It seems stream
is blocked and have not children, so why don't we try the next stream from the linked list? This should be either fixed or commented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use BUG() here, add appropriate commit for new algorithm
eb9f218
to
a97b4fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address all the comments and discussions (verbal meetings as well) about the algorithm in the comments for the file.
A couple of more comments.
It'd be good to benchmark the current implementation (as H2O reference) with an optimized one.
*dep = &sched->root; | ||
} | ||
|
||
void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://calendar.perfplanet.com/2018/http2-prioritization/ :
Chrome is the only browser to use the exclusive bit and it uses it on EVERY resource.
I hope it's an outdated story, but it'd be good to test Tempesta FW with the algorithm against Chrome. https://tempesta-tech.com/knowledge-base/HTTP2-streams-prioritization/ clearly shows that this isn't an outdated story.
fw/http_stream_sched.c
Outdated
*/ | ||
tfw_h2_stream_sched_remove(stream); | ||
tfw_h2_change_stream_weight(stream, new_weight); | ||
tfw_h2_add_stream_dep(stream, new_parent, excl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a common pattern for the scheduler: pull a stream from the priority queue, recalculate it's deficit and reinsert the stream back according to the new deficit. If streams A and B have the same weight, then we just intermix them, i.e. both of the resources will be downloaded with the highest delay.
It seems only progressive JPEGs benefit from the behavior while all other resources, including WEBP or PNGs images, aren't. This should be a browser's job to assign different weight to different resources and the same weights to images only, but I'm not sure that browsers are able to do this. I found only https://calendar.perfplanet.com/2018/http2-prioritization/, but it's old and it seems for example than Firefox assigns the same weights to CSS.
Please test how Firefox and Chrome assign weights with the current implementation (also good to check resources like Instagram with many images or Gmail), document it and probably we'll need to make the algorithm aware about progressive JPEGs and intermix only their streams (it seems this is what Cloudflare does or did).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is called during stream reprioritization, because of stream priority frame
This PR also fixes #1884 |
fw/sock_clnt.c
Outdated
cwnd_avail = (tp->snd_cwnd > tp->packets_out ? | ||
tp->snd_cwnd - tp->packets_out : 0); | ||
cwnd_avail_in_bytes = cwnd_avail * mss_now; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a doubts about calculating cwnd. As I understand this place must use tcp_cwnd_test()
for calculation of cwnd, our version is very rough. Further we make a frames relying on cwnd and this is also a question for me, because cwnd it's not an exact size of data that can be pushed into write_queue, exact value is tcp window. For more accurate calculations we must use tcp_mss_split_point()
that relies on cwnd. Maybe such rough calculations chosen intentionally, if so please describe it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rework this place, but you also not right here. tcp_cwnd_test
returns min(halfcwnd, cwnd - in_flight)
on each iteration of the loop for calculation count of bytes that can be sent per ONE loop iteration. Since we call this function on each loop iteration it returns 0 only when cwnd - in_flight became equal to zero! So we can use cwnd - in_flight for our purpose, but also we should take into account len of socket write queue, because we can push a lot of skbs on previous iteration if headers are big (we ignore cwnd when we make headers) or if there are some special frames were sent. We also cant use tcp_mss_split_point because it deal with skb. We can use it only if we make frame for each skb (use skb->len as a limit of frame length), but it i think, that it is bad, since we should use tls encryption for each skb in this case. So i think that rough calculation is better.
fw/http_frame.c
Outdated
#endif | ||
|
||
static int | ||
tfw_h2_make_frames_for_stream(TfwH2Ctx *ctx, TfwStream *stream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to rename this function. Now we also encode headers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
fw/http_stream.h
Outdated
* @processed - count of bytes, processed during prepare xmit | ||
* callback; | ||
* @nskbs - count of skbs processed during prepare xmit callback; | ||
* @mark - mark of the resp skb_head; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can loose mark in tcp_write_xmit() -> tso_fragment()
. Do we have tests for mark forwarded to backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create new issue #1987 because this problem is not resolved in this patch
fw/http_frame.c
Outdated
@@ -2548,6 +2561,8 @@ do { \ | |||
stream->xmit.is_blocked = !stream->rem_wnd; | |||
T_FSM_EXIT(); | |||
} | |||
|
|||
ADJUST_TMP_AVAILABLE_CWND(*cwnd_awail, tmp_cwnd_awail); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can suggest do this on transition to next state from HTTP2_MAKE_HEADERS_FRAMES
and HTTP2_MAKE_CONTINUATION_FRAMES
, because check it every DATA frame which will be much more than HEADERS, not looks right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
fw/http_frame.c
Outdated
|
||
stream->xmit.state = HTTP2_RELEASE_RESPONSE; | ||
|
||
T_FSM_NEXT(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can use T_FSM_JMP
here and other appropriate places that do not switch
on every transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
d549c9c
to
bb97f30
Compare
bb97f30
to
8334f23
Compare
0912d77
to
8d636f1
Compare
ba856b0
to
40209a9
Compare
There are a lot of changes in this commit: - All skbs which contain HEADERS or DATA are placed in the queue of the corresponding stream, instead of being added to the socket write queue. - In tcp_push_pending_frames() we call special socket callback, which is used to move skbs from stream with highest priority to the socket write queue. Count of moved skbs is determined by current TCP CWND. Delayed queuing of skb to the socket write queue allows higher priority data to be placed ahead of lower priority one. - We encode headers during the previos operation, because otherwise there can be a situation then we send header which is encoded using hpack dynamic table, before the same header which was not encoded and client can't decode it. - We make frames during the previous operation, when skb is not located in the socket write queue, so we can not care about sequence numbers and so on. - Rewrite HTTP2 framing to state machine. - We don't need skb flags anymore. - Move applying of new settings to ss_do_send, because new settings should be applyed just before ack to this settings is pushed to the socket write queue, otherwise new settings can be applyed and some frames with this new settings will be sent before ack to this settings. Closes #1946
Previously when we send RST stream to the client we immediately move stream to the closed queue. If client open a new stream we shrink this queue and delete streams from this queue, so there is a chance that we have no time to send RST stream to the client. Now we move such stream to the closed queue only after sending RST stream. Also make little fix in stream processing. We should move stream to HTTP2_STREAM_REM_CLOSED state when client don't close stream from it's side. We should process stream in HTTP2_STREAM_CLOSED state since we don't move client to closed queue immediately.
Since all hpack encoding was moved to the 'ss_do_send' function we don't need any additional locks, since 'ss_do_send' is always called under the socket lock.
HTTP2_PRIORITY frames are allowed for the idle streams. Moreover some clients use this ability to set weight and dependency for idle stream, so we should handle HTTP2_PRIORITY for such streams.
According to our investigation ebtree is the fastest data structure for stream prioritization, so we should port it to our project.
Implement stream scheduler to have ability to get the most priority stream. HTTP2 connection context contains stream scheduler and each stream also contains scheduler for its children, so we have a tree of schedulers. Each scheduler contains ebtree of children streams. This tree is used to find the most priority stream.
By default we use delayed queuing of skbs to socket write queue, that allows higher priority data to be placed ahead of lower priority data. Setting "latency_optimized_write" option to "false" disable this behaviour.
During working on broken tests, the following problems were identified and fixed: - we should zeroed skb->cb before pass skb to the kernel, because we use it for our purpose. - we should call `ss_skb_init_for_xmit` before setting mark and tls type to prevent it possible zeroed in this function. - HTTP2_STREAM_REM_CLOSED state is a state when we send RST_STREAM or END_STREAM flag to remote peer from any state other then open or local-half closed. (This state need to handle a situation when stream had been already closed on server side, but the client is not aware about that yet). Rework stream processing according this description. - STREAM_RECV_PROCESS set ctx state to HTTP2_IGNORE_FRAME_DATA in case when stream processing fails. But we don't take it into account in our state machine. Fix this problem (see changes for state machine HTTP2_RECV_DATA, HTTP2_RECV_HEADER, HTTP2_RECV_CONT).
- Codestyle fixed - Remove `tfw_h2_destroy_stream_send_queue` because we already iterate over all streams, when we close the connection and free all associated resources. So we don't need special function and should destroy stream send queue when we free all other resourses. - Change algorithm of calculation of available send window. Now we calculate sender and receiver window and select the smallest of them. - Do not change Conn_Closing value. - Do not setup sk_fill_write_queue for not HTTP2 connections. We use this function to make frames according current available window. For HTTP1 we don't need to make frames and don't need to call this function. - Calculate stream deficit using precomputed values from the table not using devision, because devision is very slow operation. - Remove unnecessary macros XMIT_FSM_JMP - Move comment about `tfw_http_msg_linear_transform` before function definition - Use memset instead of bzero_fast to zeroed skb->sb - Add comments
According review we decide to remove this option, because this option is true by default and there is no sense to disable implement ability to disable it.
Use BITMAP instead of array of bool values to save settings which we should apply.
Previously we send error response and close connection after sending all pending data. But we should not send responses in case of connection level error (when we close connection after error response). This patch fix this problem. Also decrease size and remove holes from some structures.
- Ajust counf of skbs, which were previously (during making frames) pushed to socket write queue. In `tcp_write_xmit` main loop `cong_win` is calculated on each loop iteration and if we calculate `cong_win` for making frames without taking into account previously pushed skbs we push more data into socket write queue then we can send. - Other frames (from any stream) MUST NOT occur between the HEADERS frame and any CONTINUATION frames that might follow. So add such frames to postponed list and send them later.
In case of TCP segmentation we can catch ENOMEM when we try to encrypt several skb. The reason is that we don't limit count of skb for encryption only the total len. But if the len of each skb is equal to 1, count of nents will be equal to TLS_MAX_PAYLOAD_SIZE (16384) and sgt list allocation fails. Probably a lot of memory allocation errors in TCP segmentation tests causes because of this problem.
Pass ss_action as a new argument to `sk_fill_write_queue`. Send TCP shutdown if ss_action is SS_SHUTDOWN (function is called from `ss_shutdown`) and there is no pending data in our scheduler and do not send it if it is called from `ss_do_close`.
Purge stream send queue and do not terminate connection in case of STREAM_FSM_RES_TERM_STREAM error. Also free stream->xmit.postponed skbs when connection is closed.
- Codestyle fixes. - Do not add stream to closed queue until all pending data will be sent. - Do not send END_STREAM in CONTINUATION frames because it is prohibited by RFC 9113.
- Do not split skb if it contains both HEADERS and DATA frames. This was the main problem: for small response which is located in one skb this split increase count of sending skbs twice! - Do not calculate snd_wnd for each frame. - Take into account current length of socket write queue during snd_wnd calculation.
Now Tempesta FW has ability to accept both HTTP1 and HTTP2 connections on the same port. When we create new connection we don't know real connection type, so we should set `sk_fill_write_queue` callback for all tls connections and just return 0 in this callback if it is not HTTP2 later.
We should adjust count of skb fragments even if `tail_frags` is equal to zero. Also remove unnecessary linear transformation during making frames.
40209a9
to
f355a6f
Compare
No description provided.