Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mekhanik evgenii/1196 itog 2 #1973

Merged
merged 25 commits into from
Jun 27, 2024
Merged

Conversation

EvgeniiMekhanik
Copy link
Contributor

No description provided.

@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 8 times, most recently from d121627 to 2c52de8 Compare September 12, 2023 09:47
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the scheduling algorithm only and it seems it has implementation bugs as well as design weakness - I think we can do better.

I'll review the modern research (e.g. https://www.usenix.org/system/files/nsdi22-paper-gao_peixuan.pdf considers several old approaches and introduces a new one), but so far at least the idea to link stream objects into the list and traverse them looks silly: spatial locality is just poor having that the stream objects are relatively large and not so many of them can fit in the same page (even using the slab allocator).

The first idea is that instead of dealing with real linked list nodes (especially double linked!) with 8 bytes for each pointer, we can move the scheduler metadata out from TfwStream and introduce a packed data structure, which can be placed on the same page with the whole scheduler data and all the TfwStreamSchedEntry nodes. The new data structure can be linked with only 1-2 byte offsets within the memory area. I.e. in the most trivial case we can leave the algorithm, but use a packed data structure with better spacial locality.

Maybe, having good spacial locality (need to estimate how many entities can fit single cache line) we can traverse the anchors list to find the right place for the current deficit.

It makes sense to skew the existing unfairness to that streams already being scheduled (some frames have been sent on them) should get slightly more priority than streams having higher weight to get lower average load time.

fw/http_stream_sched.h Outdated Show resolved Hide resolved
fw/http_stream_sched.h Outdated Show resolved Hide resolved
fw/http_stream_sched.h Outdated Show resolved Hide resolved
fw/http_stream_sched.c Outdated Show resolved Hide resolved
fw/http_stream_sched.c Outdated Show resolved Hide resolved
tfw_h2_sched_stream_enqueue(stream, *parent);
entry = &stream->sched;
} else {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we break here? It seems stream is blocked and have not children, so why don't we try the next stream from the linked list? This should be either fixed or commented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use BUG() here, add appropriate commit for new algorithm

fw/http_stream_sched.c Outdated Show resolved Hide resolved
fw/http_stream.h Show resolved Hide resolved
fw/http_stream_sched.c Outdated Show resolved Hide resolved
fw/http_stream_sched.c Outdated Show resolved Hide resolved
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 2 times, most recently from eb9f218 to a97b4fa Compare September 14, 2023 14:06
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address all the comments and discussions (verbal meetings as well) about the algorithm in the comments for the file.

A couple of more comments.

It'd be good to benchmark the current implementation (as H2O reference) with an optimized one.

fw/http_stream_sched.c Outdated Show resolved Hide resolved
*dep = &sched->root;
}

void
Copy link
Contributor

@krizhanovsky krizhanovsky Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://calendar.perfplanet.com/2018/http2-prioritization/ :

Chrome is the only browser to use the exclusive bit and it uses it on EVERY resource.

I hope it's an outdated story, but it'd be good to test Tempesta FW with the algorithm against Chrome. https://tempesta-tech.com/knowledge-base/HTTP2-streams-prioritization/ clearly shows that this isn't an outdated story.

*/
tfw_h2_stream_sched_remove(stream);
tfw_h2_change_stream_weight(stream, new_weight);
tfw_h2_add_stream_dep(stream, new_parent, excl);
Copy link
Contributor

@krizhanovsky krizhanovsky Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common pattern for the scheduler: pull a stream from the priority queue, recalculate it's deficit and reinsert the stream back according to the new deficit. If streams A and B have the same weight, then we just intermix them, i.e. both of the resources will be downloaded with the highest delay.

It seems only progressive JPEGs benefit from the behavior while all other resources, including WEBP or PNGs images, aren't. This should be a browser's job to assign different weight to different resources and the same weights to images only, but I'm not sure that browsers are able to do this. I found only https://calendar.perfplanet.com/2018/http2-prioritization/, but it's old and it seems for example than Firefox assigns the same weights to CSS.

Please test how Firefox and Chrome assign weights with the current implementation (also good to check resources like Instagram with many images or Gmail), document it and probably we'll need to make the algorithm aware about progressive JPEGs and intermix only their streams (it seems this is what Cloudflare does or did).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is called during stream reprioritization, because of stream priority frame

@EvgeniiMekhanik
Copy link
Contributor Author

This PR also fixes #1884

fw/sock_clnt.c Outdated
Comment on lines 228 to 233
cwnd_avail = (tp->snd_cwnd > tp->packets_out ?
tp->snd_cwnd - tp->packets_out : 0);
cwnd_avail_in_bytes = cwnd_avail * mss_now;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a doubts about calculating cwnd. As I understand this place must use tcp_cwnd_test() for calculation of cwnd, our version is very rough. Further we make a frames relying on cwnd and this is also a question for me, because cwnd it's not an exact size of data that can be pushed into write_queue, exact value is tcp window. For more accurate calculations we must use tcp_mss_split_point() that relies on cwnd. Maybe such rough calculations chosen intentionally, if so please describe it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rework this place, but you also not right here. tcp_cwnd_test returns min(halfcwnd, cwnd - in_flight) on each iteration of the loop for calculation count of bytes that can be sent per ONE loop iteration. Since we call this function on each loop iteration it returns 0 only when cwnd - in_flight became equal to zero! So we can use cwnd - in_flight for our purpose, but also we should take into account len of socket write queue, because we can push a lot of skbs on previous iteration if headers are big (we ignore cwnd when we make headers) or if there are some special frames were sent. We also cant use tcp_mss_split_point because it deal with skb. We can use it only if we make frame for each skb (use skb->len as a limit of frame length), but it i think, that it is bad, since we should use tls encryption for each skb in this case. So i think that rough calculation is better.

fw/http_frame.c Outdated Show resolved Hide resolved
fw/http_stream.h Outdated Show resolved Hide resolved
fw/http_frame.c Outdated
#endif

static int
tfw_h2_make_frames_for_stream(TfwH2Ctx *ctx, TfwStream *stream,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to rename this function. Now we also encode headers here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

fw/http_stream.h Outdated
* @processed - count of bytes, processed during prepare xmit
* callback;
* @nskbs - count of skbs processed during prepare xmit callback;
* @mark - mark of the resp skb_head;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can loose mark in tcp_write_xmit() -> tso_fragment(). Do we have tests for mark forwarded to backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create new issue #1987 because this problem is not resolved in this patch

fw/http_stream.h Outdated Show resolved Hide resolved
fw/http_frame.c Outdated
@@ -2548,6 +2561,8 @@ do { \
stream->xmit.is_blocked = !stream->rem_wnd;
T_FSM_EXIT();
}

ADJUST_TMP_AVAILABLE_CWND(*cwnd_awail, tmp_cwnd_awail);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can suggest do this on transition to next state from HTTP2_MAKE_HEADERS_FRAMES and HTTP2_MAKE_CONTINUATION_FRAMES, because check it every DATA frame which will be much more than HEADERS, not looks right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

fw/sock_clnt.c Outdated Show resolved Hide resolved
fw/ss_skb.h Outdated Show resolved Hide resolved
fw/sock.c Outdated Show resolved Hide resolved
fw/http_frame.c Outdated

stream->xmit.state = HTTP2_RELEASE_RESPONSE;

T_FSM_NEXT();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can use T_FSM_JMP here and other appropriate places that do not switch on every transition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as draft September 22, 2023 09:07
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 7 times, most recently from d549c9c to bb97f30 Compare September 29, 2023 13:32
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch from bb97f30 to 8334f23 Compare September 29, 2023 15:02
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch 3 times, most recently from 0912d77 to 8d636f1 Compare October 16, 2023 12:16
@EvgeniiMekhanik EvgeniiMekhanik marked this pull request as ready for review October 16, 2023 12:17
There are a lot of changes in this commit:
- All skbs which contain HEADERS or DATA are placed in the
  queue of the corresponding stream, instead of being added
  to the socket write queue.
- In tcp_push_pending_frames() we call special socket callback,
  which is used to move skbs from stream with highest priority
  to the socket write queue. Count of moved skbs is determined
  by current TCP CWND. Delayed queuing of skb to the socket
  write queue allows higher priority data to be placed ahead
  of lower priority one.
- We encode headers during the previos operation, because
  otherwise there can be a situation then we send header
  which is encoded using hpack dynamic table, before the
  same header which was not encoded and client can't
  decode it.
- We make frames during the previous operation, when skb is not
  located in the socket write queue, so we can not care about
  sequence numbers and so on.
- Rewrite HTTP2 framing to state machine.
- We don't need skb flags anymore.
- Move applying of new settings to ss_do_send, because new
  settings should be applyed just before ack to this settings
  is pushed to the socket write queue, otherwise new settings
  can be applyed and some frames with this new settings will
  be sent before ack to this settings.

Closes #1946
Previously when we send RST stream to the client
we immediately move stream to the closed queue.
If client open a new stream we shrink this queue
and delete streams from this queue, so there is
a chance that we have no time to send RST stream
to the client. Now we move such stream to the
closed queue only after sending RST stream.
Also make little fix in stream processing.
We should move stream to HTTP2_STREAM_REM_CLOSED
state when client don't close stream from it's
side.
We should process stream in HTTP2_STREAM_CLOSED
state since we don't move client to closed queue
immediately.
Since all hpack encoding was moved to the
'ss_do_send' function we don't need any
additional locks, since 'ss_do_send'
is always called under the socket lock.
HTTP2_PRIORITY frames are allowed for the idle streams. Moreover
some clients use this ability to set weight and dependency for
idle stream, so we should handle HTTP2_PRIORITY for such streams.
According to our investigation ebtree is the fastest
data structure for stream prioritization, so we should
port it to our project.
Implement stream scheduler to have ability to get
the most priority stream. HTTP2 connection context
contains stream scheduler and each stream also
contains scheduler for its children, so we have a
tree of schedulers. Each scheduler contains ebtree
of children streams. This tree is used to find the
most priority stream.
By default we use delayed queuing of skbs to
socket write queue, that allows higher priority
data to be placed ahead of lower priority data.
Setting "latency_optimized_write" option to
"false" disable this behaviour.
During working on broken tests, the following
problems were identified and fixed:
- we should zeroed skb->cb before pass skb
  to the kernel, because we use it for our
  purpose.
- we should call `ss_skb_init_for_xmit` before
  setting mark and tls type to prevent it possible
  zeroed in this function.
- HTTP2_STREAM_REM_CLOSED state is a state when we
  send RST_STREAM or END_STREAM flag to remote peer
  from any state other then open or local-half closed.
  (This state need to handle a situation when stream
  had been already closed on server side, but the
  client is not aware about that yet). Rework stream
  processing according this description.
- STREAM_RECV_PROCESS set ctx state to
  HTTP2_IGNORE_FRAME_DATA in case when stream processing
  fails. But we don't take it into account in our state
  machine. Fix this problem (see changes for state
  machine HTTP2_RECV_DATA, HTTP2_RECV_HEADER,
  HTTP2_RECV_CONT).
- Codestyle fixed
- Remove `tfw_h2_destroy_stream_send_queue` because we
  already iterate over all streams, when we close the
  connection and free all associated resources. So we
  don't need special function and should destroy stream
  send queue when we free all other resourses.
- Change algorithm of calculation of available send
  window. Now we calculate sender and receiver window
  and select the smallest of them.
- Do not change Conn_Closing value.
- Do not setup sk_fill_write_queue for not HTTP2
  connections. We use this function to make frames
  according current available window. For HTTP1 we
  don't need to make frames and don't need to call
  this function.
- Calculate stream deficit using precomputed values
  from the table not using devision, because devision
  is very slow operation.
- Remove unnecessary macros XMIT_FSM_JMP
- Move comment about `tfw_http_msg_linear_transform`
  before function definition
- Use memset instead of bzero_fast to zeroed skb->sb
- Add comments
According review we decide to remove this
option, because this option is true by default
and there is no sense to disable implement
ability to disable it.
Use BITMAP instead of array of bool values
to save settings which we should apply.
Previously we send error response and close
connection after sending all pending data.
But we should not send responses in case of
connection level error (when we close connection
after error response). This patch fix this problem.
Also decrease size and remove holes from some
structures.
- Ajust counf of skbs, which were previously (during making
  frames) pushed to socket write queue. In `tcp_write_xmit`
  main loop `cong_win` is calculated on each loop iteration
  and if we calculate `cong_win` for making frames without
  taking into account previously pushed skbs we push more
  data into socket write queue then we can send.
- Other frames (from any stream) MUST NOT occur between
  the HEADERS frame and any CONTINUATION frames that might
  follow. So add such frames to postponed list and send
  them later.
In case of TCP segmentation we can catch ENOMEM
when we try to encrypt several skb. The reason
is that we don't limit count of skb for encryption
only the total len. But if the len of each skb is
equal to 1, count of nents will be equal to
TLS_MAX_PAYLOAD_SIZE (16384) and sgt list allocation
fails. Probably a lot of memory allocation errors
in TCP segmentation tests causes because of this
problem.
Pass ss_action as a new argument to `sk_fill_write_queue`.
Send TCP shutdown if ss_action is SS_SHUTDOWN (function
is called from `ss_shutdown`) and there is no pending data
in our scheduler and do not send it if it is called from
`ss_do_close`.
Purge stream send queue and do not terminate connection
in case of STREAM_FSM_RES_TERM_STREAM error.
Also free stream->xmit.postponed skbs when connection
is closed.
- Codestyle fixes.
- Do not add stream to closed queue
  until all pending data will be sent.
- Do not send END_STREAM in CONTINUATION frames
  because it is prohibited by RFC 9113.
- Do not split skb if it contains both HEADERS and
  DATA frames. This was the main problem: for small
  response which is located in one skb this split
  increase count of sending skbs twice!
- Do not calculate snd_wnd for each frame.
- Take into account current length of socket write
  queue during snd_wnd calculation.
Now Tempesta FW has ability to accept both HTTP1 and
HTTP2 connections on the same port. When we create
new connection we don't know real connection type,
so we should set `sk_fill_write_queue` callback
for all tls connections and just return 0 in this
callback if it is not HTTP2 later.
We should adjust count of skb fragments even if
`tail_frags` is equal to zero.
Also remove unnecessary linear transformation
during making frames.
@EvgeniiMekhanik EvgeniiMekhanik force-pushed the MekhanikEvgenii/1196-itog-2 branch from 40209a9 to f355a6f Compare June 27, 2024 08:15
@EvgeniiMekhanik EvgeniiMekhanik merged commit 96fb92d into master Jun 27, 2024
1 check passed
@EvgeniiMekhanik EvgeniiMekhanik deleted the MekhanikEvgenii/1196-itog-2 branch June 27, 2024 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants