Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lokiless netobserv #72

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Conversation

memodi
Copy link
Contributor

@memodi memodi commented Jul 2, 2024

NETOBSERV-1686 - Lokiless Network Observability blog

```yaml
loki:
enable: false
```
Copy link
Member

@jotak jotak Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also add some context, such as telling that, without Loki, metrics are still being sent to Prometheus - and since Prometheus is in core OpenShift payload, you don't need further backend installation and configuration. It might be obvious for us but not for everyone.

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great blog! thanks @memodi !

@jotak
Copy link
Member

jotak commented Jul 3, 2024

@memodi I did some query tests this morning - but this is using monolithic Loki, not the best for queries :

Query								Prom			Loki

Whole topology cluster-wide / owners / 5min			397ms			1s
Whole overview cluster-wide / owners / 5min			300ms			2s

Whole topology cluster-wide / owners / 30min			369ms			6s
Whole overview cluster-wide / owners / 30min			343ms			11s

Whole topology cluster-wide / owners / 3h			416ms			3s
Whole overview cluster-wide / owners / 3h			368ms			6s

memodi and others added 3 commits July 3, 2024 10:24
Co-authored-by: Joel Takvorian <joel.takvorian@homeblocks.net>
Co-authored-by: Joel Takvorian <joel.takvorian@homeblocks.net>
@memodi
Copy link
Contributor Author

memodi commented Jul 3, 2024

@memodi I did some query tests this morning - but this is using monolithic Loki, not the best for queries :

thanks @jotak, I was hoping we could leverage netobserv_prom_calls_duration_bucket/netobserv_loki_calls_duration_bucket metrics to do this measurements when averaging for say 1000 queries (I could write quick UI test for that), could compare with queries like:

histogram_quantile(0.9, sum(rate(netobserv_loki_calls_duration_bucket{code="200"}[5m])) by (le, code))
histogram_quantile(0.9, sum(rate(netobserv_loki_calls_duration_bucket{code="200"}[5m])) by (le, code))

what do you think?

## Performance and Resource utilization gains

### Query performance:
Prometheus queries are blazing fast compared to Loki queries, but don't take my word for it, let's look at the data from the query performance tests:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a spot to introduce the difference between the two solution (flows vs metrics) and go in details in ## Trade-offs

Comment on lines 32 to 40
Below table shows 90th Percentile query times for each table:

| Time Range | Loki | Prometheus
| :--------: | :-------: | :----------:
| Last 5m | 2287 ms | 95.5 ms
| Last 1h | 4581 ms | 236 ms
| Last 6h | > 10 s | 394 ms

As time range to fetch network flows gets wider, Loki queries tends to get slower or timing out, while Prometheus queries is able render the data within fraction of a second.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you get these numbers ? Is it from your browser networking tab ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpinsonneau I wrote a small script that does 50 queries for each Prometheus and Loki: https://github.com/memodi/openshift-tests-private/commit/4ea4683753c1e6e945ecde0dcc0ee871270b6920 and after test is done gathered prom. metrics for loki and prom duration calls with query like:

histogram_quantile(0.9, sum(rate(netobserv_loki_calls_duration_bucket{code="200"}[5m])) by (le, code))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's cool ! You should link the test in the blog 😸

Comment on lines 61 to 63
1. Without storage of network flows it no longer provides Traffic flows table. <TODO: insert a picture Traffic table greyed out>

2. Per-pod level of resource granularity is not available since it causes Prometheus metrics to have high cardinality. <TODO: insert a picture where diff between with-Loki and without-Loki Scope>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be interesting to also compare the output storage size with and without loki

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by output storage size? I have mentioned here that users don't need to provision additional storage:
https://github.com/netobserv/documents/pull/72/files#diff-77e07c919145d98ff5ecc61c81fe91d345f160619e8ddbf9e39d4dab956d7921R43

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus storage increases when enabling netobserv metrics.
There are even capabilities to export to remote storages: https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations

The goal here would be to be able to say how much netobserv metrics could take (depending of what you enabled of course) compared to Loki storage when storing every flows

Copy link
Contributor Author

@memodi memodi Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus storage increases when enabling netobserv metrics.

Do you think it would cause significant increase in prometheus storage by having these metrics published? cc @jotak @stleerh - wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could depending of the cardinality of the selected metrics. But it's also important to highlight low impact configurations 😸

Copy link
Contributor Author

@memodi memodi Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jotak @jpinsonneau I did some testing to figure prometheus impact by running NDH workload on 9 worker nodes:

when netobserv is deployed:
head series/chunks - jumps from ~470K to ~1M

Storage: `prometheus_tsdb_wal_storage_size_bytes` which shows high rate of increase [1] (not sure though if this is metric is indicator of prometheus' storage
Prom. CPU: doubles from 0.15 to 0.3 [2]
Prom. RSS:  jumps from 2.5G to 4.7 [3]

[1]
prom_wal_rate

[2]
prom_cpu_usage

[3]
prom_memory_usage

This is when default metrics are published when loki is disabled.
I am not sure though this is the absolute difference between with and without Loki since many of the metrics we already published even when Loki is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to see the total storage size you should use prometheus_tsdb_storage_blocks_bytes: The number of bytes that are currently used for local storage by all blocks.

You can also have a look to prometheus_tsdb_head_chunks_storage_size_bytes: Size of the chunks_head directory in parallel.

Anyways these results are not bad but we should inform the customers about these in the doc.

Copy link
Member

@jotak jotak Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@memodi when you say:

Prom. CPU: doubles from 0.15 to 0.3 [2]
Prom. RSS: jumps from 2.5G to 4.7 [3]

Is it before, after, or during the NDH workloads being generated?

As you say we need to compare with loki enabled to check the diff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jotak it's during the NDH runs. CPU returns to the previous levels after the workload is completed, however RSS jump is more stickier probably because of those active series are still held in prometheus memory.

@jpinsonneau for some reason prometheus_tsdb_storage_blocks_bytes is always 0 in default config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@memodi so maybe it's also worth check what's the memory overhead if we run NDH without netobserv at all, because creating pods etc. does have an impact on metrics even without netobserv

@memodi memodi changed the title [WIP] Lokiless netobserv Lokiless netobserv Aug 14, 2024
@memodi memodi requested a review from skrthomas August 14, 2024 18:24
Copy link
Contributor

@skrthomas skrthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work. This blog is really good and will be super helpful. Let me know if you have any questions about my comments. Mostly copy edits and a few questions/suggestions.

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved
blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved
blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved
When configured as above, Network Observability's Prometheus metrics will continue to get scraped by OpenShift's cluster Prometheus without any additonal configuration and Network Traffic console will use Prometheus as a source for fetching the data.

## Performance and Resource utilization gains

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you can add a generic "gainZZZ" sentence or two here to introduce the two subheadings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, can you PTAL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thanks Mehul :)

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved
blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved
blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved
blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved
blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved
blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved
memodi and others added 4 commits August 15, 2024 10:39
Co-authored-by: Sara Thomas <sarthoma@redhat.com>
Co-authored-by: Sara Thomas <sarthoma@redhat.com>
Co-authored-by: Sara Thomas <sarthoma@redhat.com>
Copy link
Contributor

@stleerh stleerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the user, this is about whether you use Prometheus for metrics only or whether you also have Loki. It's not likely that someone will turn off Prometheus. So while it's beneficial to show that Prometheus is better than Loki and saves a lot of resources, it's not about picking Prometheus or Loki.

Should the article be focused on "Do you really need Loki?" Then the section on alerts doesn't really fit in because it talks about the features of Prometheus which you will have.

@@ -0,0 +1,138 @@
# Network Observability without Loki
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a catchy title like "Light-weight Network Observability"

Copy link
Contributor

@skrthomas skrthomas Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my opinion and an observation of how we use the same kinds of words to describe different aspects of the NetObserv toolset, but I think the "without Loki" part is still important here and also "Operator". Maybe "Light-weight Network Observability Operator without Loki". In the docs, we also describe the CLI as light-weight Network Observability, but its not the Operator.

By: Mehul Modi, Steven Lee

Recently, the Network Observability Operator released version 1.6, which added a major enhancement to provide network insights for your OpenShift cluster without Loki. This enhancement was also featured in [What's new in Network Observability 1.6](../whats_new_1.6) blog, providing a quick overview of the feature. In this blog, lets look at some of the advantages and trade-offs users would have when deploying the Network Observability Operator with Loki disabled. As more metrics are enabled by default with this feature, we'll also demonstrate a use-case on how those metrics can benefit users for real world scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assume the audience knows what Loki is or how it's being used in Network Observability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Recently, the Network Observability Operator released version 1.6, which added a major enhancement to provide network insights for your OpenShift cluster without Loki. This enhancement was also featured in [What's new in Network Observability 1.6](https://developers.redhat.com/articles/2024/08/12/whats-new-network-observability-16) blog, providing a quick overview of the feature. Until this release, Loki was required to be deployed alongside Network Observability to store the network flows data. In this blog, lets look at some of the advantages and trade-offs users would have when deploying the Network Observability Operator with Loki disabled. As more metrics are enabled by default with this feature, we'll also demonstrate a use-case on how those metrics can benefit users for real world scenarios.

how about this text here to say how Loki was used.
cc @skrthomas

Copy link
Contributor

@skrthomas skrthomas Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I wonder if we can link to Joel's other Loki-less blog somewhere here; I think there's a nice historical deep dive there. Maybe something like this:

Recently, the Network Observability Operator released version 1.6, which added a major enhancement to provide network insights for your OpenShift cluster without Loki. This work builds on an effort to reduce dependency on Loki, which began starting with the 1.4 release of Network Observability. Previously, Loki was a requirement for deploying the Network Observability Operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like our comments crossed in the ether, but your suggestion looks good too Mehul. My only question is whether or not we want to mention that this reduced dependency on Loki has been a WIP for a couple releases now? I know 1.6 is the most robst version of this so maybe we don't want to mention anything about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to leave out the history from this.


* **Test**: We conducted 50 identical queries for 3 separate time ranges to render a topology view for both Loki and Prometheus. Such a query requests all K8s Owners for the workload running in an OpenShift Cluster that had network flows associated to them. Since we did not have any applications running, only Infrastructure workloads generated network traffic. In Network Observability such an unfiltered view renders topology rendered as follows:

![unfiltered topology view](images/owner_screenshot.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for showing Figure 1?  It kind of makes Network Observability look bad with this messy, complicated topology.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was kinda dubious about it for the same reason, I removed the image and also updated the text of this para.


1. Test bed 1: node-density-heavy workload ran against 25 nodes cluster.
2. Test bed 2: ingress-perf workload ran against 65 nodes cluster.
3. Test bed 3: cluster-density-v2 workload ran against 120 nodes cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more explanation of these three test beds.  The audience might not understand what "node-density-heavy workload" means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following graphs show total vCPU, memory and storage usage for a recommended Network Observability stack - flowlogs-pipeline, eBPF-agent, Kafka, Prometheus and optionally Loki for production clusters.

![Compare total vCPUs utilized with and without Loki](<blogs/lokiless_netobserv/images/vCPUs consumed by NetObserv stack.png/Total vCPUs consumed.png>)
![Compare total RSS utilized with and without Loki](<blogs/lokiless_netobserv/images/Memory consumed by NetObserv stack.png>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These image links are broken in the document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the links


As seen across the test beds, we find a storage savings of 90% when Network Observability is configured without Loki.

<sup>*</sup> actual resource utilization may depend on various factors such as network traffic, FlowCollector sampling size, and the number of workloads and nodes in an OpenShift Container Platform cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not much said about CPU and memory usage compared to storage.  I think CPU and memory are the more important resources since they are the ones that drive up the cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mentioned it here about CPU and Memory savings in intro paragraph: https://github.com/netobserv/documents/pull/72/files#diff-77e07c919145d98ff5ecc61c81fe91d345f160619e8ddbf9e39d4dab956d7921R42 , can you add suggestions what else we could say here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants