Skip to content

Commit

Permalink
Merge pull request #63 from msherif1234/rtt-update
Browse files Browse the repository at this point in the history
NETOBSERV-1287: rtt blog updates to reflect last changes in ebpf agent
  • Loading branch information
msherif1234 authored Jan 23, 2024
2 parents ba83949 + 5146e5b commit 29dec26
Show file tree
Hide file tree
Showing 4 changed files with 161 additions and 140 deletions.
1 change: 1 addition & 0 deletions blogs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@
- [Real-Time Per Flow Packets Drop](./packet_drops/index.md)
- [Secondary interface support](./secondary_interface/index.md)
- [What's New in Network Observability 1.4](./whats_new_1.4/index.md)
- [TCP Round Trip Time tracking](./rtt/index.md)
Binary file modified blogs/rtt/images/rtt_graphs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
160 changes: 160 additions & 0 deletions blogs/rtt/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Network Observability TCP Handshake Round Trip Time

By: Dushyant Behl, Julien Pinsonneau and Mohamed S. Mahmoud

In OpenShift Container Platform (OCP), ensuring efficient packet delivery is
paramount for maintaining seamless communication between applications.
However, challenges like network congestion, misconfigured systems,
or hardware limitations can lead to slow connections, impacting overall
performance. Round Trip Time (RTT), typically measured in milliseconds,
plays a crucial role in monitoring network health and diagnosing issues.

## Implementing Smooth Round-Trip Time (SRTT) with eBPF

The RTT is the time it takes for a packet to travel from the sender to the receiver
and back. In a network, RTT can vary due to factors like network congestion,
varying route lengths, and other dynamic conditions.
SRTT is introduced to provide a more consistent and less jittery representation
of the RTT.

In TCP, RTT is a crucial metric.

Our implementation leverages eBPF to register to `fentry` eBPF hook
for `tcp_rcv_established()`.
We extract the SRTT (smooth round-trip time) value from TCP sockets, correlating it
to existing flows and enriching them with RTT values in nanoseconds.

When a new Netobserv flow is created, and the RTT (Round-Trip Time) feature is enabled,
an initial RTT of `10usec` is assigned.
This initial value for RTT may be considered quite low.

Upon triggering the eBPF (Extended Berkeley Packet Filter) socket, the flow RTT
value is updated to reflect the maximum RTT value for that specific flow.

For more detailed explanation of smoothed RTT estimation, refer to [Karn's algorithm paper](http://ccr.sigcomm.org/archive/1995/jan95/ccr-9501-partridge87.pdf)

![TCP based RTT calculations](./images/tcp_rtt_calculations.png)

### Why using `fentry` eBPF hook

The eBPF `fentry` programs have lower overhead as they trigger
the hook before calling the kernel function of interest.

In our implementation:

1. Register and link `fentry` hook for kernel's `tcp_rcv_established()`

```cgo
SEC("fentry/tcp_rcv_established")
int BPF_PROG(tcp_rcv_fentry, struct sock *sk, struct sk_buff *skb) {
if (sk == NULL || skb == NULL) {
return 0;
}
return calculate_flow_rtt_tcp(sk, skb);
}
```

2. Reconstruct the Netobserv flow key, including incoming interface Layer2, Layer3, and Layer4 info.

3. Match existing flows in the PerCPU hashmap flow table and enrich them with srtt info from TCP sockets. If
multiple SRTT values exist for the same flow, we take the maximum value.

Currently, our approach calculates RTT only for the TCP packets so flows which are non-TCP do not show RTT information.

## Potential Use Cases

Flow RTT capture from eBPF `flow_monitor` hookpoint can serve various purposes:

- Network Monitoring: Gain insights into TCP handshakes, helping
network administrators identify unusual patterns, potential bottlenecks, or
performance issues.

- Troubleshooting: Debug TCP-related issues by tracking latency and identifying
misconfigurations.

## How to Enable RTT

To enable this feature we need to create a FlowCollector object with the following
fields enabled in eBPF config section as below:

```yaml
apiVersion: flows.netobserv.io/v1beta2
kind: FlowCollector
metadata:
name: cluster
spec:
agent:
type: eBPF
ebpf:
features:
- FlowRTT
```
## A Quick Tour in the UI
Once the `FlowRTT` feature is enabled, the OCP console plugin automatically adapts
to provide additional filter and show information across Netflow Traffic page views.

Open your OCP Console and move to
`Administrator view` -> `Observe` -> `Network Traffic` page as usual.

A new filter, `Flow RTT` is available in the common section:

![flow rtt filter](./images/flow_rtt_filter.png)

The `FlowRTT` filter will allow you to capture any flow that has an RTT more than a specific time in nanoseconds.

For production users, filtering on the `TCP` protocol, `Ingress` direction,
and looking for `FlowRTT` values greater than 10,000,000 nanoseconds (10ms)
can help identify TCP flows with high latency.
This filtering approach allows users to focus on specific network flows that may
be experiencing significant delays.
By setting a threshold of `10ms`, you can efficiently isolate and address potential
latency issues in your TCP traffic.

### Overview

New graphs are introduced in the `Advanced options` -> `Manage panels` popup:

![advanced options](./images/advanced_options.png)

- Top X average TCP handshake Round Trip Time with overall (donut or lines)
- Bottom X minimum TCP handshake Round Trip Time with overall (donut or lines)
- Top X maximum TCP handshake Round Trip Time with overall (donut or lines)
- Top X 90th percentile TCP handshake Round Trip Time with overall (donut or lines)
- Top X 99th percentile TCP handshake Round Trip Time with overall (donut or lines)
-
![rtt graphs](./images/rtt_graphs.png)

These two graphs can help you to identify the slowest TCP flows and their trends
over time. Use the filters to drill down into specific pods, namespaces or nodes.

### Traffic flows

The table view shows the `Flow RTT` in both column and side panel.

![rtt table](./images/rtt_table.png)

### Topology

Last but not least, the topology view displays min / max / avg / p90 / p99 `RTT`
latency on edges.
Clicking on a node or an edge will allow you to see per direction metrics and
the related graph.

![rtt topology](./images/rtt_topology.png)

### Future improvments

Here is a non exhaustive list of future improvements coming for a full featured
Round Trip Time analysis:

- Latest RTT in topology view
- Prometheus metrics and alerting

## Feedback

We hope you liked this article !

Netobserv is an OpenSource project [available on github](https://github.com/netobserv).
Feel free to share your [ideas](https://github.com/netobserv/network-observability-operator/discussions/categories/ideas), [use cases](https://github.com/netobserv/network-observability-operator/discussions/categories/show-and-tell) or [ask the community for help](https://github.com/netobserv/network-observability-operator/discussions/categories/q-a).
140 changes: 0 additions & 140 deletions blogs/rtt/round_trip_time.md

This file was deleted.

0 comments on commit 29dec26

Please sign in to comment.