Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1112: This patch fixes a bug where RTT was not visible for flow logs at times. #159

Merged
merged 2 commits into from
Aug 4, 2023

Conversation

dushyantbehl
Copy link
Contributor

@dushyantbehl dushyantbehl commented Jul 20, 2023

The calculation of RTT in ebpf agent was initially implemented to be doing something like this,

image

We were storing the timestamp of outgoing SYN and incoming ACK, calculating the RTT as Incoming ACK - Outgoing SYN.

This works for calculations done on both client and server endpoints as both have one SYN which is sent out and one ack which is received in response to that to start a TCP connection.

ebpf agent on the other hand does not connect only at a single interface in Openshift environment. If we look at the same communication being done from inside a container it would look like this.

Screenshot 2023-08-01 at 7 03 14 PM

In this case the container contains a veth pair and ebpf agent attaches to the veth interface in the root namespace, let's call veth for a vpeer in the container namespace.
Any packet emerging from vpeer which is put at the egress queue of container will come at the ingress queue of the veth interface. Hence if we try to calculate RTT using the previous logic of outgoing SYN and incoming ACK then we would end up calculating the timestamp of SYN(Y) and ACK(Y) which will be just the response time of the container and not a correct value of the RTT.

Also note that the ebpf agent attaches to many interfaces in Openshift and can attach to interfaces like eth0 which always act like a Client or Server endpoints and hence the RTT calculation of Incoming ACK - Outgoing SYN will work for such endpoints.

So to account for both the cases we changed the RTT calculation in a simple way, we simply take the RTT of SYN/SYN+ACK/ACK separately, i.e. we will calculate RTT of SYN(X) and ACK(X) and SYN(Y) and ACK(Y) and simply take the maximum of both values.
This is needed because ebpf agent attaches at both type of interfaces, root veth interfaces which act like middleboxes and eth0 interfaces which act like final connection endpoints.

Note that the flows are in reverse direction so we need to calculate RTT for reverse flows in ebpf agent but its pretty easy to do.
Also, this calculation is interface specific so we need to add interface id to the sequence identifier.

@dushyantbehl
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 20, 2023
@github-actions
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:3f822c7

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=3f822c7 make set-agent-image

@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Merging #159 (e4d2dff) into main (338d2b2) will decrease coverage by 0.23%.
Report is 2 commits behind head on main.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main     #159      +/-   ##
==========================================
- Coverage   38.82%   38.60%   -0.23%     
==========================================
  Files          31       31              
  Lines        2246     2259      +13     
==========================================
  Hits          872      872              
- Misses       1325     1338      +13     
  Partials       49       49              
Flag Coverage Δ
unittests 38.60% <0.00%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
pkg/ebpf/tracer.go 0.00% <0.00%> (ø)

bpf/rtt_tracker.h Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 26, 2023
@dushyantbehl
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 27, 2023
@github-actions
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:3f822c7

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=3f822c7 make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 28, 2023
@dushyantbehl dushyantbehl added bug Something isn't working ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. and removed do-not-merge/hold labels Jul 28, 2023
@github-actions
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:0ffaeef

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=0ffaeef make set-agent-image

@dushyantbehl dushyantbehl changed the title RTT latency is visible in one direction, This PR adds RTT measurement to both direction flows. NETOBSERV-1112: This patch fixes a bug where RTT was not visible for flow logs at times. Jul 28, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jul 28, 2023

@dushyantbehl: This pull request references NETOBSERV-1112 which is a valid jira issue.

In response to this:

In a client to server communication, client sends SYN, server sends ACK. The RTT is calculated when ACK arrives and is reported on the ingress flow from Server->Client as the ACK packet corresponds to that flow.
This patch reverses the flow direction and finds the Client -> Server flow entry and adds RTT to the aggregated metrics.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bpf/types.h Outdated
// No need to emit this struct. It's used only in kernel space
typedef struct flow_seq_id_t {
u16 src_port;
u16 dst_port;
u8 src_ip[IP_MAX_LEN];
u8 dst_ip[IP_MAX_LEN];
u32 seq_id;
u8 transport_protocol;
u32 if_index; // OS interface index
u8 __padding;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add manual padding attribute will takecare of it, also seq_id is becoming more like flow_id I feel we can use the global hash map to calculate RTT directly none tcp fields like icmp ones can be set to 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added transport_protocol to make it future proof, to address your comment of seq id collision. and to make it similar to the dns seq_id to combine them in future if needed.
Padding is something I can remove.
if index is needed because flow agent connects to many interfaces in the same path which can affect time stamp when SYN and ACK were recorded to a. large extent if the same flow passes through eth0 -> bridge -> container veth -> container namespace ethernet etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah its just more and more becoming similar to flow_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There won't be more additions to it in my opinion. This map also stores seq_id which is the primary identifier here which does not have to be part of flow_id.

@@ -101,6 +101,8 @@ func NewFlowFetcher(cfg *FlowFetcherConfig) (*FlowFetcher, error) {
if enableRtt == 0 {
// Cannot set the size of map to be 0 so set it to 1.
spec.Maps[flowSequencesMap].MaxEntries = uint32(1)
} else {
log.Infof("RTT calculations are enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for debugging ? we can know what is enabled or not from looking at the configs env var correct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just meant as info to see this via logs yeah...can be removed safely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only problem I see is that if both directions (Ingress and Egress are not enabled) RTT calculations will be disabled and message will be present only in logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it even possible to disable tc on one side but not the other ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well do you mean tc queues? I am not sure but don't think its possible.
What I had meant is if ebpf agent somehow is set to attach only at one of Ingress or Egress, the DIRECTION configuration for agent. I believe right now this option is not exposed to the operator so we shouldn't worry too much.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Aug 1, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Aug 1, 2023

@dushyantbehl: This pull request references NETOBSERV-1112 which is a valid jira issue.

In response to this:

The calculation of RTT in ebpf agent was initially implemented to be doing something like this,

image

We were storing the timestamp of outgoing SYN and incoming ACK, calculating the RTT as Incoming ACK - Outgoing SYN.

This works for calculations done on both client and server endpoints as both have one SYN which is sent out and one ack which is received in response to that to start a TCP connection.

ebpf agent on the other hand does not connect only at a single interface in Openshift environment. If we look at the same communication being done from inside a container it would look like this.

Screenshot 2023-08-01 at 7 03 14 PM

In this case the container contains a veth pair and ebpf agent attaches to the veth interface in the root namespace, let's call veth for a vpeer in the container namespace.
Any packet emerging from vpeer which is put at the egress queue of container will come at the ingress queue of the veth interface. Hence if we try to calculate RTT using the previous logic of outgoing SYN and incoming ACK then we would end up calculating the timestamp of SYN(Y) and ACK(Y) which will be just the response time of the container and not a correct value of the RTT.

Also note that the ebpf agent attaches to many interfaces in Openshift and can attach to interfaces like eth0 which always act like a Client or Server endpoints and hence the RTT calculation of Incoming ACK - Outgoing SYN will work for such endpoints.

So to account for both the cases we changed the RTT calculation in a simple way, we simply take the RTT of SYN/SYN+ACK/ACK separately, i.e. we will calculate RTT of SYN(X) and ACK(X) and SYN(Y) and ACK(Y) and simply take the maximum of both values.
This is needed because ebpf agent attaches at both type of interfaces, root veth interfaces which act like middleboxes and eth0 interfaces which act like final connection endpoints.

Note that the flows are in reverse direction so we need to calculate RTT for reverse flows in ebpf agent but its pretty easy to do.
Also, this calculation is interface specific so we need to add interface id to the sequence identifier.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msherif1234
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 1, 2023
@jotak
Copy link
Member

jotak commented Aug 2, 2023

@dushyantbehl it might be a naive question, but the problem that it is solving, couldn't it be "just" solved by matching the interface on ACK? (ie. making sure the ACK we get is on the same interface as the initial SYN) ? I guess by adding the interface as a seq map values?

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dushyantbehl @msherif1234 Was it tested end to end in ocp?
If so, I'm fine to merge if you're both ok with it (I just had a comment, wondering if there was a simpler approach possible - as this is a quite different algorithm now)

dst->if_index = src->if_index;

// Fields which should be reversed
dst->direction = (src->direction == INGRESS) ? EGRESS : INGRESS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal but:

Suggested change
dst->direction = (src->direction == INGRESS) ? EGRESS : INGRESS;
dst->direction = 1 - src->direction;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an end to end testing at my level and @jpinsonneau and @ronensc (who reported the issue at their end) did the testing too.

on your suggestion, I fear if we write it like that and in future change INGRESS and EGRESS to be something other than 0 and 1 then the trick might fail silently, hence maybe better to write it explictily.

@dushyantbehl
Copy link
Contributor Author

@dushyantbehl it might be a naive question, but the problem that it is solving, couldn't it be "just" solved by matching the interface on ACK? (ie. making sure the ACK we get is on the same interface as the initial SYN) ? I guess by adding the interface as a seq map values?

@jotak yes what you suggest is also part of the solution to match interface when matching SYN and ACK and it is done by adding interface to the seq_id map structure.

u32 if_index; // OS interface index

@jotak
Copy link
Member

jotak commented Aug 3, 2023

I have the feeling that this new algorithm will still cause troubles especially when there is sampling. When you say:

we will calculate RTT of SYN(X) and ACK(X) and SYN(Y) and ACK(Y) and simply take the maximum of both values.

That works when we capture all the flows, because there is either "SYN(X) / ACK(X)" or "SYN(Y) / ACK(Y)" that corresponds to the actual RTT, whereas the other one is just process time and must be ignored. So taking the biggest should, in general, match the RTT and that's fine... But that doesn't work with sampling if you don't capture the flows that are relevant for RTT, then you would store the wrong value, which could screw up all the metrics computed from that.

So I have the feeling we need to find something else.

It sounds like the root cause of these problem is that we're never sure what INGRESS or EGRESS really mean (and we solved that problem in FLP with something called "reinterpret_direction", but we can't have the same solution here, we need to find something else). I'm not sure how we could do that. Maybe by reversing the meaning of INGRESS and EGRESS depending on which interface we're on?

Alternatively maybe we should restrict on which interfaces we are computing RTT, e.g. in your graph above we would use only eth0 and ignore veth. The downside is that it decreases the chances to get the RTT when sampling is on.

wdyt?

@dushyantbehl
Copy link
Contributor Author

I have the feeling that this new algorithm will still cause troubles especially when there is sampling. When you say:

we will calculate RTT of SYN(X) and ACK(X) and SYN(Y) and ACK(Y) and simply take the maximum of both values.

That works when we capture all the flows, because there is either "SYN(X) / ACK(X)" or "SYN(Y) / ACK(Y)" that corresponds to the actual RTT, whereas the other one is just process time and must be ignored. So taking the biggest should, in general, match the RTT and that's fine... But that doesn't work with sampling if you don't capture the flows that are relevant for RTT, then you would store the wrong value, which could screw up all the metrics computed from that.

This flow makes 3 packets, 1) SYN(X) 2) SYN(Y) ACK(X) 3) ACK(Y)
if by any chance of SAMPLING we miss only one packet out of 1 and 2 then the calculation could be wrong and provide a very small value. This can be fixed by having a check at some point in the pipeline to not show values lower than some micro seconds.

SAMPLING the way it is done right now interacts with all packets so we have to implement a workaround or change SAMPLING for this.

So I have the feeling we need to find something else.

It sounds like the root cause of these problem is that we're never sure what INGRESS or EGRESS really mean (and we solved that problem in FLP with something called "reinterpret_direction", but we can't have the same solution here, we need to find something else). I'm not sure how we could do that. Maybe by reversing the meaning of INGRESS and EGRESS depending on which interface we're on?

We cannot directly determine from ebpf which type of interface we are connected on, a veth interface could either be vpeer or veth from the diagram above. This problem is unique here because ebpf agent connects to all possible interfaces in the system hence we need to look at both side of flows.

Alternatively maybe we should restrict on which interfaces we are computing RTT, e.g. in your graph above we would use only eth0 and ignore veth. The downside is that it decreases the chances to get the RTT when sampling is on.

There will be a high chance to miss some flows in the system if we take this approach in my opinion. Traffic from one pod on one host to another pod on a different host is encapsulated in host IP so if we only calculate at interfaces like eth0 we will only get host ip addresses as part of flow and not the actual pod IP which we might be interested in.

wdyt?

@msherif1234
Copy link
Contributor

at somepoint I was working on new hook that just triggered when tcp state change independent off tc hook https://github.com/netobserv/netobserv-ebpf-agent/pull/106/files#diff-f256390f160e8c3dfdbea43540fc179790fed2b28286efdf60bea2a4711f26c1R53 I feel that would get us out of the issue with sampling and missing state update ?

I am planning to bring that PR back to life at somepoint as its quiet out of date but instead of generating events we will just create flows in the hashmap

@jotak
Copy link
Member

jotak commented Aug 3, 2023

@dushyantbehl

if by any chance of SAMPLING we miss only one packet out of 1 and 2 then the calculation could be wrong and provide a very small value. This can be fixed by having a check at some point in the pipeline to not show values lower than some micro seconds.

So in that case I think this would be on the agent to not show values of microseconds magnitude, rather than later in the pipeline: because that's due to implementation details of the agent, and the other components should be able to trust what the agent sends.

We cannot directly determine from ebpf which type of interface we are connected on, a veth interface could either be vpeer or veth from the diagram above. This problem is unique here because ebpf agent connects to all possible interfaces in the system hence we need to look at both side of flows.

Maybe we should try asking around ... ovn, ovs folks might have an idea that we're missing, about how we could solve that? We get the if_index right, maybe we can do something based on that?

@msherif1234 yeah that would be an interesting thing to try, I agree!

@dushyantbehl dushyantbehl reopened this Aug 3, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Aug 3, 2023

@dushyantbehl: This pull request references NETOBSERV-1112 which is a valid jira issue.

In response to this:

The calculation of RTT in ebpf agent was initially implemented to be doing something like this,

image

We were storing the timestamp of outgoing SYN and incoming ACK, calculating the RTT as Incoming ACK - Outgoing SYN.

This works for calculations done on both client and server endpoints as both have one SYN which is sent out and one ack which is received in response to that to start a TCP connection.

ebpf agent on the other hand does not connect only at a single interface in Openshift environment. If we look at the same communication being done from inside a container it would look like this.

Screenshot 2023-08-01 at 7 03 14 PM

In this case the container contains a veth pair and ebpf agent attaches to the veth interface in the root namespace, let's call veth for a vpeer in the container namespace.
Any packet emerging from vpeer which is put at the egress queue of container will come at the ingress queue of the veth interface. Hence if we try to calculate RTT using the previous logic of outgoing SYN and incoming ACK then we would end up calculating the timestamp of SYN(Y) and ACK(Y) which will be just the response time of the container and not a correct value of the RTT.

Also note that the ebpf agent attaches to many interfaces in Openshift and can attach to interfaces like eth0 which always act like a Client or Server endpoints and hence the RTT calculation of Incoming ACK - Outgoing SYN will work for such endpoints.

So to account for both the cases we changed the RTT calculation in a simple way, we simply take the RTT of SYN/SYN+ACK/ACK separately, i.e. we will calculate RTT of SYN(X) and ACK(X) and SYN(Y) and ACK(Y) and simply take the maximum of both values.
This is needed because ebpf agent attaches at both type of interfaces, root veth interfaces which act like middleboxes and eth0 interfaces which act like final connection endpoints.

Note that the flows are in reverse direction so we need to calculate RTT for reverse flows in ebpf agent but its pretty easy to do.
Also, this calculation is interface specific so we need to add interface id to the sequence identifier.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dushyantbehl
Copy link
Contributor Author

@dushyantbehl

if by any chance of SAMPLING we miss only one packet out of 1 and 2 then the calculation could be wrong and provide a very small value. This can be fixed by having a check at some point in the pipeline to not show values lower than some micro seconds.

So in that case I think this would be on the agent to not show values of microseconds magnitude, rather than later in the pipeline: because that's due to implementation details of the agent, and the other components should be able to trust what the agent sends.

Actually I feel that doing this filtering at agent will be much more hardcoded behavior, a very simple fix is to mark in SAMPLING to not throw away SYN/ACK packets which are needed for many other behavior as well.

We cannot directly determine from ebpf which type of interface we are connected on, a veth interface could either be vpeer or veth from the diagram above. This problem is unique here because ebpf agent connects to all possible interfaces in the system hence we need to look at both side of flows.

Maybe we should try asking around ... ovn, ovs folks might have an idea that we're missing, about how we could solve that? We get the if_index right, maybe we can do something based on that?

This is not doable based on if_index or anything else, it just gives name and id of the interface while the direction of capture depends on where the traffic is coming from. With the current patch only 2 packets are being looked at and RTT is being calculated based on them which is the standard approach.

bpf/configs.h Outdated
@@ -6,5 +6,6 @@
volatile const u32 sampling = 0;
volatile const u8 trace_messages = 0;
volatile const u8 enable_rtt = 0;
volatile const u64 min_rtt = 50000; //50 micro seconds
Copy link
Member

@jotak jotak Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dushyantbehl ! this will avoid having inconsistent values. We can iterate later to find a better solution. I don't despair of finding a good solution closer to your first implementation, which I found simpler but would need a twist, as you put it in this PR description :-)

If that's good for @msherif1234 , that's good for me too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jotak

bpf/configs.h Outdated
@@ -6,5 +6,6 @@
volatile const u32 sampling = 0;
volatile const u8 trace_messages = 0;
volatile const u8 enable_rtt = 0;
volatile const u64 min_rtt = 50000; //50 micro seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are u planning to make this configurable ? because I don't see the userspace pieces to do that if not then u can define in it flows.c where its actually used and no need to make volatile in that case

u64 *prev_ts = (u64 *)bpf_map_lookup_elem(&flow_sequences, seq_id);
if (prev_ts != NULL) {
u64 rtt = pkt->current_ts - *prev_ts;
// Because of SAMPLING the way it is done if we miss one of SYN/SYN+ACK/ACK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also add FIXME here and indicate this a temp workaround for this case till we have better solution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. And since this is fixme I would make the above variable non volatile.

Signed-off-by: Dushyant Behl <dushyantbehl@hotmail.com>
Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
@msherif1234
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 4, 2023
@dushyantbehl
Copy link
Contributor Author

/approve

@openshift-ci
Copy link

openshift-ci bot commented Aug 4, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dushyantbehl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Aug 4, 2023
@openshift-merge-robot openshift-merge-robot merged commit 6d9d2e7 into netobserv:main Aug 4, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants