NETOBSERV-1112: Add TCP RTT calculation to ebpf-agent and userspace. #117

dushyantbehl · 2023-04-28T03:16:20Z

Fixes #102

Feature is disabled by default and can be enable with an environment variable flag.

Here are the CPU and Memory Numbers before and after the PR

handshake SYN and ACK packets. Introduces a new ebpf map for tracking incoming and outgoing packets for RTT. Can be extended to any other way of TCP sequence tracking like TCPSecr/TCPval, or continous TCP latency tracking and other protocols like ICMP etc. Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

Small changes to bpf rtt calculations. Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

dushyantbehl · 2023-04-28T13:30:59Z

Tested on my end to be working end to end for calculating RTT for TCP handshake.

eranra · 2023-04-30T07:56:14Z

@dushyantbehl, thanks for working on this PR ( and also, thanks, @praveingk ). @dushyantbehl Can you add some text that explains the logic of the calculation? I suggest using .md file and keep under docs. This will help both to review + for the future to explain what is happening.

eranra · 2023-04-30T07:58:21Z

@jotak @jpinsonneau
the agent reports RTT just at the beginning of flows --- so all future/other flows in the conversation (with the same L4 hash) do not get the info. I wonder what we want to show in the flows table? maybe we need to 'clone' the RTT info in all flows from the same conversation? (but this is not really measured information)

eranra · 2023-04-30T08:24:53Z

@dushyantbehl I ran some high-level test, and the results are ok, but not exactly what I expected::

(1) I ran in shell#1 while true; do wget http://hgd-speedtest-1.tele2.net/1MB.zip; sleep 10; done
This downloads a file from a known test server

(2) I ran in shell#2 sudo FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 ./bin/netobserv-ebpf-agent
starts the agent

(3) I ran in shell#3 ./bin/flowlogs-dump-collector -listen_port=9999 2>&1 >/dev/null | grep 90.130.74.149
dump to screen

The output I get is::

$ ./bin/flowlogs-dump-collector -listen_port=9999 2>&1 >/dev/null | grep 90.130.74.149
ipv4: 11:15:34.760990 eth0 IP 172.26.89.33:35262 > 90.130.74.149:80: protocol:tcp type: 0 code: 0 dir:1 bytes:20945 packets:315 flags:530 ends: 11:15:35.550359 rtt(us): 0
ipv4: 11:15:34.853181 eth0 IP 90.130.74.149:80 > 172.26.89.33:35262: protocol:tcp type: 0 code: 0 dir:0 bytes:1069618 packets:315 flags:272 ends: 11:15:35.549528 rtt(us): 92190
ipv4: 11:15:35.639099 eth0 IP 90.130.74.149:80 > 172.26.89.33:35262: protocol:tcp type: 0 code: 0 dir:0 bytes:66 packets:1 flags:512 ends: 11:15:35.639099 rtt(us): 0
ipv4: 11:15:35.639127 eth0 IP 172.26.89.33:35262 > 90.130.74.149:80: protocol:tcp type: 0 code: 0 dir:1 bytes:66 packets:1 flags:16 ends: 11:15:35.639127 rtt(us): 0
ipv4: 11:15:45.560058 eth0 IP 172.26.89.33:38364 > 90.130.74.149:80: protocol:tcp type: 0 code: 0 dir:1 bytes:74 packets:1 flags:2 ends: 11:15:45.560058 rtt(us): 0
ipv4: 11:15:45.654908 eth0 IP 90.130.74.149:80 > 172.26.89.33:38364: protocol:tcp type: 0 code: 0 dir:0 bytes:1069354 packets:311 flags:784 ends: 11:15:46.489739 rtt(us): 94850
ipv4: 11:15:45.654979 eth0 IP 172.26.89.33:38364 > 90.130.74.149:80: protocol:tcp type: 0 code: 0 dir:1 bytes:20739 packets:312 flags:528 ends: 11:15:46.489763 rtt(us): 0
ipv4: 11:15:56.404436 eth0 IP 172.26.89.33:47890 > 90.130.74.149:80: protocol:tcp type: 0 code: 0 dir:1 bytes:21077 packets:317 flags:530 ends: 11:15:57.340466 rtt(us): 0
ipv4: 11:15:56.499668 eth0 IP 90.130.74.149:80 > 172.26.89.33:47890: protocol:tcp type: 0 code: 0 dir:0 bytes:1069684 packets:316 flags:784 ends: 11:15:57.340431 rtt(us): 95232
ipv4: 11:16:07.255185 eth0 IP 172.26.89.33:55564 > 90.130.74.149:80: protocol:tcp type: 0 code: 0 dir:1 bytes:20417 packets:307 flags:530 ends: 11:16:08.124477 rtt(us): 0
ipv4: 11:16:07.344960 eth0 IP 90.130.74.149:80 > 172.26.89.33:55564: protocol:tcp type: 0 code: 0 dir:0 bytes:1068958 packets:305 flags:784 ends: 11:16:08.124456 rtt(us): 89774
ipv4: 11:16:18.044159 eth0 IP 172.26.89.33:47140 > 90.130.74.149:80: protocol:tcp type: 0 code: 0 dir:1 bytes:18797 packets:271 flags:530 ends: 11:16:18.967787 rtt(us): 0
ipv4: 11:16:18.133470 eth0 IP 90.130.74.149:80 > 172.26.89.33:47140: protocol:tcp type: 0 code: 0 dir:0 bytes:1067836 packets:288 flags:784 ends: 11:16:18.967766 rtt(us): 89310
ipv4: 11:16:28.982189 eth0 IP 90.130.74.149:80 > 172.26.89.33:53638: protocol:tcp type: 0 code: 0 dir:0 bytes:1069156 packets:308 flags:784 ends: 11:16:29.758358 rtt(us): 87864
ipv4: 11:16:28.894324 eth0 IP 172.26.89.33:53638 > 90.130.74.149:80: protocol:tcp type: 0 code: 0 dir:1 bytes:20615 packets:310 flags:530 ends: 11:16:29.758379 rtt(us): 0

When I run ping hgd-speedtest-1.tele2.net, I get

PING hgd-speedtest-1.tele2.net (90.130.74.149) 56(84) bytes of data.
64 bytes from hgd-speedtest-1.tele2.net (90.130.74.149): icmp_seq=1 ttl=49 time=88.0 ms
64 bytes from hgd-speedtest-1.tele2.net (90.130.74.149): icmp_seq=2 ttl=49 time=89.6 ms
64 bytes from hgd-speedtest-1.tele2.net (90.130.74.149): icmp_seq=3 ttl=49 time=89.0 ms
64 bytes from hgd-speedtest-1.tele2.net (90.130.74.149): icmp_seq=4 ttl=49 time=88.7 ms
64 bytes from hgd-speedtest-1.tele2.net (90.130.74.149): icmp_seq=5 ttl=49 time=89.7 ms
64 bytes from hgd-speedtest-1.tele2.net (90.130.74.149): icmp_seq=6 ttl=49 time=88.8 ms

Now, the rtt 88ms ( from ping) is similar to the reported RTT we get from the eBPF agent :-) This looks good. The thing that is a bit challenging is the direction of the reports and the zeros. I see that we get RTT just in the return path (which makes sense), but it is a bit strange. We need to fix that in the UI or maybe in the user-space agent somehow. Can you add to the dump also the flags so we will know in which flows we get "syn-acks" and ion which reports we do not?

eranra

It works :-)

jpinsonneau · 2023-05-02T07:35:00Z

@jotak @jpinsonneau the agent reports RTT just at the beginning of flows --- so all future/other flows in the conversation (with the same L4 hash) do not get the info. I wonder what we want to show in the flows table? maybe we need to 'clone' the RTT info in all flows from the same conversation? (but this is not really measured information)

From the conversation ID we can link back to the first flow and show any missing info. I would be more in favor of that than duplicating data; even if it would require an extra query in some cases.

FYI there is a doc for conversation tracking followup. Feel free to add any thoughts 😸

codecov · 2023-05-03T08:47:10Z

Codecov Report

Merging #117 (bbee31c) into main (27baf29) will decrease coverage by 0.47%.
The diff coverage is 9.09%.

@@            Coverage Diff             @@
##             main     #117      +/-   ##
==========================================
- Coverage   40.11%   39.64%   -0.47%     
==========================================
  Files          31       31              
  Lines        2124     2154      +30     
==========================================
+ Hits          852      854       +2     
- Misses       1227     1254      +27     
- Partials       45       46       +1

Flag	Coverage Δ
unittests	`39.64% <9.09%> (-0.47%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/agent/agent.go	`37.84% <0.00%> (-1.37%)`	⬇️
pkg/ebpf/tracer.go	`0.00% <0.00%> (ø)`
pkg/flow/record.go	`64.10% <28.57%> (-7.33%)`	⬇️
pkg/exporter/proto.go	`82.97% <100.00%> (+0.24%)`	⬆️

jotak · 2023-05-03T08:55:14Z

Thanks @dushyantbehl !

@jotak @jpinsonneau the agent reports RTT just at the beginning of flows --- so all future/other flows in the conversation (with the same L4 hash) do not get the info. I wonder what we want to show in the flows table? maybe we need to 'clone' the RTT info in all flows from the same conversation? (but this is not really measured information)

Help me understand.. the RTT is the time elapsed between the SYN emission and the ACK reception - so when when you say future flows in the conversation do not get the info, do you mean there could be more SYN/ACK for the same n-tuples? Or do you see other cases of potential loss of information?

Also .. I have a concern, is it safe with sampling? ( @dushyantbehl @eranra @praveingk )
E.g. if the ingress/ACK of the handshake wasn't seen due to sampling, but subsequent ACK are received, is it going to result in screwed up measurements?

bpf/flows.c

dushyantbehl · 2023-05-03T09:41:32Z

@jotak @jpinsonneau the agent reports RTT just at the beginning of flows --- so all future/other flows in the conversation (with the same L4 hash) do not get the info. I wonder what we want to show in the flows table? maybe we need to 'clone' the RTT info in all flows from the same conversation? (but this is not really measured information)

From the conversation ID we can link back to the first flow and show any missing info. I would be more in favor of that than duplicating data; even if it would require an extra query in some cases.

FYI there is a doc for conversation tracking followup. Feel free to add any thoughts 😸

@jpinsonneau I am not able to access the doc could you please share access with my redhat account, username is dbehl.

dushyantbehl · 2023-05-03T10:02:20Z

Thanks @dushyantbehl !

@jotak @jpinsonneau the agent reports RTT just at the beginning of flows --- so all future/other flows in the conversation (with the same L4 hash) do not get the info. I wonder what we want to show in the flows table? maybe we need to 'clone' the RTT info in all flows from the same conversation? (but this is not really measured information)

Help me understand.. the RTT is the time elapsed between the SYN emission and the ACK reception - so when when you say future flows in the conversation do not get the info, do you mean there could be more SYN/ACK for the same n-tuples? Or do you see other cases of potential loss of information?

Correct, RTT here is for initial handshake right now, the differnece between the time when SYN packet gets sent and when a corresponding ACK packet is received.
I think I should fix my comment because what I meant was that right now this patch will calculate RTT for handshake packets and not for all packets in the flow so we won't see the RTT measurement for many flow records being evicted from ebpf-agent which correspond to packets other than initial handshake.

Ideally we should do calculations for all packets because latency can vary in the network at any time which is something we can add on top of this patch.

Also .. I have a concern, is it safe with sampling? ( @dushyantbehl @eranra @praveingk ) E.g. if the ingress/ACK of the handshake wasn't seen due to sampling, but subsequent ACK are received, is it going to result in screwed up measurements?

Its safe to the level that this won't cause any wrong RTT values to be shown but if a SYN or its corresponding ACK packet gets sampled we won't see RTT measurement for that flow handshake because any ACK that will come later will have a different sequence id and not the id of first one.

The only problem I foresee is if first ACKs are re-transmitted due to some reason the RTT calculation might not reflect real world round trip latency but in that case I'm not sure how best to mitigate this issue other than perform RTT measurement for all packets in the flow and then take a running average or something on top.

dushyantbehl · 2023-05-03T10:05:42Z

Now, the rtt 88ms ( from ping) is similar to the reported RTT we get from the eBPF agent :-) This looks good. The thing that is a bit challenging is the direction of the reports and the zeros. I see that we get RTT just in the return path (which makes sense), but it is a bit strange. We need to fix that in the UI or maybe in the user-space agent somehow. Can you add to the dump also the flags so we will know in which flows we get "syn-acks" and ion which reports we do not?

Sure thing, I'm assuming by the dump you mean flowlogs-dump-collector.go it already has the flags just that they are not printed as SYN/ACK/FIN etc but just dumped in binary form. I can change that if that's what you meant.

github-actions · 2023-06-28T06:07:57Z

New image: quay.io/netobserv/netobserv-ebpf-agent:6ae2ab2. It will expire after two weeks.

dushyantbehl · 2023-06-28T12:29:46Z

/ok-to-test

github-actions · 2023-06-28T12:31:40Z

New image: quay.io/netobserv/netobserv-ebpf-agent:b4e571b. It will expire after two weeks.

openshift-ci-robot · 2023-06-28T13:48:32Z

@dushyantbehl: This pull request references NETOBSERV-1112 which is a valid jira issue.

In response to this:

Fixes #102

Feature is disabled by default and can be enable with an environment variable flag.

Here are the CPU and Memory Numbers before and after the PR

Related PR - netobserv/flowlogs-pipeline#446

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bpf/flows.c

pkg/agent/agent.go

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

msherif1234 · 2023-06-29T11:10:00Z

Thanks @dushyantbehl !!
/lgtm

dushyantbehl · 2023-06-30T03:18:02Z

/ok-to-test

github-actions · 2023-06-30T03:19:27Z

New image: quay.io/netobserv/netobserv-ebpf-agent:c1d9ac0. It will expire after two weeks.

openshift-ci-robot · 2023-06-30T05:42:19Z

@dushyantbehl: This pull request references NETOBSERV-1112 which is a valid jira issue.

In response to this:

Fixes #102

Feature is disabled by default and can be enable with an environment variable flag.

Here are the CPU and Memory Numbers before and after the PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Related PR - netobserv/netobserv-ebpf-agent#117 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

jotak · 2023-07-03T08:13:57Z

/lgtm
/approve

@dushyantbehl thanks a lot and sorry for the long time to wait!

openshift-ci · 2023-07-03T08:14:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jotak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Add support for RoundTripTime in FLP. Related PR - netobserv/netobserv-ebpf-agent#117 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Update the ebpf-agent go dependency. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl added 2 commits April 28, 2023 08:39

Add updated bpf bindings and protobuf

bcf925c

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

dushyantbehl mentioned this pull request Apr 28, 2023

NETOBSERV-1112: Support adding of connection setup timestamp #102

Closed

Add rtt to record.go and record_test.go and fix verifier issue

d5bdd84

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

dushyantbehl force-pushed the main branch from 8b5460d to d5bdd84 Compare April 28, 2023 05:51

Add RTT to exported record as a time duration in protobuf

cbf1fe8

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

dushyantbehl marked this pull request as draft April 28, 2023 09:58

openshift-ci bot added the do-not-merge/work-in-progress label Apr 28, 2023

dushyantbehl force-pushed the main branch 2 times, most recently from 15e5896 to 9517a51 Compare April 28, 2023 13:26

Add flowrtt to flowlogs-dump.

388a545

Small changes to bpf rtt calculations. Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

dushyantbehl force-pushed the main branch from 9517a51 to 388a545 Compare April 28, 2023 13:27

dushyantbehl marked this pull request as ready for review April 28, 2023 13:30

openshift-ci bot removed the do-not-merge/work-in-progress label Apr 28, 2023

eranra previously approved these changes Apr 30, 2023

View reviewed changes

openshift-ci bot assigned eranra Apr 30, 2023

openshift-ci bot added the lgtm label Apr 30, 2023

jotak reviewed May 3, 2023

View reviewed changes

bpf/flows.c Outdated Show resolved Hide resolved

Add documentation on how rtt is calculated.

0f2e5b3

dushyantbehl dismissed eranra’s stale review via 0f2e5b3 May 4, 2023 08:14

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 28, 2023

dushyantbehl force-pushed the main branch from 68c6238 to 7431e15 Compare June 28, 2023 12:29

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 28, 2023

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 28, 2023

msherif1234 reviewed Jun 28, 2023

View reviewed changes

bpf/flows.c Outdated Show resolved Hide resolved

msherif1234 reviewed Jun 28, 2023

View reviewed changes

bpf/flows.c Outdated Show resolved Hide resolved

msherif1234 reviewed Jun 28, 2023

View reviewed changes

pkg/agent/agent.go Outdated Show resolved Hide resolved

dushyantbehl force-pushed the main branch from 7431e15 to 9354bb2 Compare June 29, 2023 06:57

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 29, 2023

Change file structure and add more documentation on RTT patch

bbee31c

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the main branch from 9354bb2 to bbee31c Compare June 29, 2023 08:11

openshift-ci bot assigned msherif1234 Jun 29, 2023

openshift-ci bot added the lgtm label Jun 29, 2023

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 30, 2023

dushyantbehl added a commit to dushyantbehl/flowlogs-pipeline that referenced this pull request Jun 30, 2023

Add support for RoundTripTime in FLP.

a50c089

Related PR - netobserv/netobserv-ebpf-agent#117 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl added a commit to dushyantbehl/flowlogs-pipeline that referenced this pull request Jun 30, 2023

Add support for RoundTripTime in FLP.

853a445

Related PR - netobserv/netobserv-ebpf-agent#117 Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

openshift-ci bot assigned jotak Jul 3, 2023

openshift-ci bot added the approved label Jul 3, 2023

openshift-merge-robot merged commit 1b85351 into netobserv:main Jul 3, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NETOBSERV-1112: Add TCP RTT calculation to ebpf-agent and userspace. #117

NETOBSERV-1112: Add TCP RTT calculation to ebpf-agent and userspace. #117

dushyantbehl commented Apr 28, 2023 •

edited

Loading

dushyantbehl commented Apr 28, 2023

eranra commented Apr 30, 2023

eranra commented Apr 30, 2023

eranra commented Apr 30, 2023

eranra left a comment

jpinsonneau commented May 2, 2023

codecov bot commented May 3, 2023 •

edited

Loading

jotak commented May 3, 2023

dushyantbehl commented May 3, 2023

dushyantbehl commented May 3, 2023

dushyantbehl commented May 3, 2023

github-actions bot commented Jun 28, 2023

dushyantbehl commented Jun 28, 2023

github-actions bot commented Jun 28, 2023

openshift-ci-robot commented Jun 28, 2023 •

edited by openshift-ci bot

Loading

msherif1234 commented Jun 29, 2023

dushyantbehl commented Jun 30, 2023

github-actions bot commented Jun 30, 2023

openshift-ci-robot commented Jun 30, 2023 •

edited by openshift-ci bot

Loading

jotak commented Jul 3, 2023

openshift-ci bot commented Jul 3, 2023

NETOBSERV-1112: Add TCP RTT calculation to ebpf-agent and userspace. #117

NETOBSERV-1112: Add TCP RTT calculation to ebpf-agent and userspace. #117

Conversation

dushyantbehl commented Apr 28, 2023 • edited Loading

dushyantbehl commented Apr 28, 2023

eranra commented Apr 30, 2023

eranra commented Apr 30, 2023

eranra commented Apr 30, 2023

eranra left a comment

Choose a reason for hiding this comment

jpinsonneau commented May 2, 2023

codecov bot commented May 3, 2023 • edited Loading

Codecov Report

jotak commented May 3, 2023

dushyantbehl commented May 3, 2023

dushyantbehl commented May 3, 2023

dushyantbehl commented May 3, 2023

github-actions bot commented Jun 28, 2023

dushyantbehl commented Jun 28, 2023

github-actions bot commented Jun 28, 2023

openshift-ci-robot commented Jun 28, 2023 • edited by openshift-ci bot Loading

msherif1234 commented Jun 29, 2023

dushyantbehl commented Jun 30, 2023

github-actions bot commented Jun 30, 2023

openshift-ci-robot commented Jun 30, 2023 • edited by openshift-ci bot Loading

jotak commented Jul 3, 2023

openshift-ci bot commented Jul 3, 2023

dushyantbehl commented Apr 28, 2023 •

edited

Loading

codecov bot commented May 3, 2023 •

edited

Loading

openshift-ci-robot commented Jun 28, 2023 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jun 30, 2023 •

edited by openshift-ci bot

Loading