Lokiless netobserv #72

memodi · 2024-07-02T20:00:58Z

NETOBSERV-1686 - Lokiless Network Observability blog

jotak · 2024-07-03T06:39:51Z

blogs/lokiless_netobserv/index.md

+```yaml
+loki:
+  enable: false
+```


You could also add some context, such as telling that, without Loki, metrics are still being sent to Prometheus - and since Prometheus is in core OpenShift payload, you don't need further backend installation and configuration. It might be obvious for us but not for everyone.

blogs/lokiless_netobserv/index.md

jotak

Great blog! thanks @memodi !

jotak · 2024-07-03T11:05:00Z

@memodi I did some query tests this morning - but this is using monolithic Loki, not the best for queries :

Query								Prom			Loki

Whole topology cluster-wide / owners / 5min			397ms			1s
Whole overview cluster-wide / owners / 5min			300ms			2s

Whole topology cluster-wide / owners / 30min			369ms			6s
Whole overview cluster-wide / owners / 30min			343ms			11s

Whole topology cluster-wide / owners / 3h			416ms			3s
Whole overview cluster-wide / owners / 3h			368ms			6s

Co-authored-by: Joel Takvorian <joel.takvorian@homeblocks.net>

memodi · 2024-07-03T17:26:31Z

@memodi I did some query tests this morning - but this is using monolithic Loki, not the best for queries :

thanks @jotak, I was hoping we could leverage netobserv_prom_calls_duration_bucket/netobserv_loki_calls_duration_bucket metrics to do this measurements when averaging for say 1000 queries (I could write quick UI test for that), could compare with queries like:

histogram_quantile(0.9, sum(rate(netobserv_loki_calls_duration_bucket{code="200"}[5m])) by (le, code))
histogram_quantile(0.9, sum(rate(netobserv_loki_calls_duration_bucket{code="200"}[5m])) by (le, code))

what do you think?

jpinsonneau · 2024-07-11T08:11:53Z

blogs/lokiless_netobserv/index.md

+## Performance and Resource utilization gains
+
+### Query performance:
+Prometheus queries are blazing fast compared to Loki queries, but don't take my word for it, let's look at the data from the query performance tests: 


That's probably a spot to introduce the difference between the two solution (flows vs metrics) and go in details in ## Trade-offs

jpinsonneau · 2024-07-11T08:13:44Z

blogs/lokiless_netobserv/index.md

+  Below table shows 90th Percentile query times for each table:
+
+  | Time Range | Loki      | Prometheus
+  | :--------: | :-------: | :----------:
+  | Last 5m    | 2287 ms   | 95.5 ms
+  | Last 1h    | 4581 ms   | 236 ms
+  | Last 6h    | > 10 s    | 394 ms
+
+As time range to fetch network flows gets wider, Loki queries tends to get slower or timing out, while Prometheus queries is able render the data within fraction of a second.


How did you get these numbers ? Is it from your browser networking tab ?

@jpinsonneau I wrote a small script that does 50 queries for each Prometheus and Loki: https://github.com/memodi/openshift-tests-private/commit/4ea4683753c1e6e945ecde0dcc0ee871270b6920 and after test is done gathered prom. metrics for loki and prom duration calls with query like:

histogram_quantile(0.9, sum(rate(netobserv_loki_calls_duration_bucket{code="200"}[5m])) by (le, code))

that's cool ! You should link the test in the blog 😸

blogs/lokiless_netobserv/index.md

jpinsonneau · 2024-07-11T08:20:44Z

blogs/lokiless_netobserv/index.md

+1. Without storage of network flows it no longer provides Traffic flows table. <TODO: insert a picture Traffic table greyed out>
+
+2. Per-pod level of resource granularity is not available since it causes Prometheus metrics to have high cardinality. <TODO: insert a picture where diff between with-Loki and without-Loki Scope>


it would be interesting to also compare the output storage size with and without loki

what do you mean by output storage size? I have mentioned here that users don't need to provision additional storage:
https://github.com/netobserv/documents/pull/72/files#diff-77e07c919145d98ff5ecc61c81fe91d345f160619e8ddbf9e39d4dab956d7921R43

Prometheus storage increases when enabling netobserv metrics.
There are even capabilities to export to remote storages: https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations

The goal here would be to be able to say how much netobserv metrics could take (depending of what you enabled of course) compared to Loki storage when storing every flows

Prometheus storage increases when enabling netobserv metrics.

Do you think it would cause significant increase in prometheus storage by having these metrics published? cc @jotak @stleerh - wdyt?

It could depending of the cardinality of the selected metrics. But it's also important to highlight low impact configurations 😸

@jotak @jpinsonneau I did some testing to figure prometheus impact by running NDH workload on 9 worker nodes:

when netobserv is deployed: head series/chunks - jumps from ~470K to ~1M Storage: `prometheus_tsdb_wal_storage_size_bytes` which shows high rate of increase [1] (not sure though if this is metric is indicator of prometheus' storage Prom. CPU: doubles from 0.15 to 0.3 [2] Prom. RSS: jumps from 2.5G to 4.7 [3]

[1]

[2]

[3]

This is when default metrics are published when loki is disabled.
I am not sure though this is the absolute difference between with and without Loki since many of the metrics we already published even when Loki is enabled.

I think to see the total storage size you should use prometheus_tsdb_storage_blocks_bytes: The number of bytes that are currently used for local storage by all blocks.

You can also have a look to prometheus_tsdb_head_chunks_storage_size_bytes: Size of the chunks_head directory in parallel.

Anyways these results are not bad but we should inform the customers about these in the doc.

@memodi when you say:

Prom. CPU: doubles from 0.15 to 0.3 [2]
Prom. RSS: jumps from 2.5G to 4.7 [3]

Is it before, after, or during the NDH workloads being generated?

As you say we need to compare with loki enabled to check the diff

@jotak it's during the NDH runs. CPU returns to the previous levels after the workload is completed, however RSS jump is more stickier probably because of those active series are still held in prometheus memory.

@jpinsonneau for some reason prometheus_tsdb_storage_blocks_bytes is always 0 in default config.

@memodi so maybe it's also worth check what's the memory overhead if we run NDH without netobserv at all, because creating pods etc. does have an impact on metrics even without netobserv

blogs/lokiless_netobserv/index.md

skrthomas

Awesome work. This blog is really good and will be super helpful. Let me know if you have any questions about my comments. Mostly copy edits and a few questions/suggestions.

blogs/lokiless_netobserv/index.md

skrthomas · 2024-08-14T19:26:41Z

blogs/lokiless_netobserv/index.md

+When configured as above, Network Observability's Prometheus metrics will continue to get scraped by OpenShift's cluster Prometheus without any additonal configuration and Network Traffic console will use Prometheus as a source for fetching the data.
+
+## Performance and Resource utilization gains
+


I wonder if you can add a generic "gainZZZ" sentence or two here to introduce the two subheadings?

added, can you PTAL?

Lgtm, thanks Mehul :)

blogs/lokiless_netobserv/index.md

Co-authored-by: Sara Thomas <sarthoma@redhat.com>

stleerh

For the user, this is about whether you use Prometheus for metrics only or whether you also have Loki. It's not likely that someone will turn off Prometheus. So while it's beneficial to show that Prometheus is better than Loki and saves a lot of resources, it's not about picking Prometheus or Loki.

Should the article be focused on "Do you really need Loki?" Then the section on alerts doesn't really fit in because it talks about the features of Prometheus which you will have.

stleerh · 2024-08-15T14:44:23Z

blogs/lokiless_netobserv/index.md

@@ -0,0 +1,138 @@
+# Network Observability without Loki


Need a catchy title like "Light-weight Network Observability"

Just my opinion and an observation of how we use the same kinds of words to describe different aspects of the NetObserv toolset, but I think the "without Loki" part is still important here and also "Operator". Maybe "Light-weight Network Observability Operator without Loki". In the docs, we also describe the CLI as light-weight Network Observability, but its not the Operator.

stleerh · 2024-08-15T14:45:43Z

blogs/lokiless_netobserv/index.md

+By: Mehul Modi, Steven Lee
+
+Recently, the Network Observability Operator released version  1.6, which added a major enhancement to provide network insights for your OpenShift cluster without Loki. This enhancement was also featured in [What's new in Network Observability 1.6](../whats_new_1.6) blog, providing a quick overview of the feature. In this blog, lets look at some of the advantages and trade-offs users would have when deploying the Network Observability Operator with Loki disabled. As more metrics are enabled by default with this feature, we'll also demonstrate a use-case on how those metrics can benefit users for real world scenarios.
+


Should we assume the audience knows what Loki is or how it's being used in Network Observability?

Suggested change

Recently, the Network Observability Operator released version 1.6, which added a major enhancement to provide network insights for your OpenShift cluster without Loki. This enhancement was also featured in [What's new in Network Observability 1.6](https://developers.redhat.com/articles/2024/08/12/whats-new-network-observability-16) blog, providing a quick overview of the feature. Until this release, Loki was required to be deployed alongside Network Observability to store the network flows data. In this blog, lets look at some of the advantages and trade-offs users would have when deploying the Network Observability Operator with Loki disabled. As more metrics are enabled by default with this feature, we'll also demonstrate a use-case on how those metrics can benefit users for real world scenarios.

how about this text here to say how Loki was used.
cc @skrthomas

That's a good point. I wonder if we can link to Joel's other Loki-less blog somewhere here; I think there's a nice historical deep dive there. Maybe something like this:

Recently, the Network Observability Operator released version 1.6, which added a major enhancement to provide network insights for your OpenShift cluster without Loki. This work builds on an effort to reduce dependency on Loki, which began starting with the 1.4 release of Network Observability. Previously, Loki was a requirement for deploying the Network Observability Operator.

Looks like our comments crossed in the ether, but your suggestion looks good too Mehul. My only question is whether or not we want to mention that this reduced dependency on Loki has been a WIP for a couple releases now? I know 1.6 is the most robst version of this so maybe we don't want to mention anything about it.

I think it's okay to leave out the history from this.

stleerh · 2024-08-15T14:47:11Z

blogs/lokiless_netobserv/index.md

+
+* **Test**: We conducted 50 identical queries for 3 separate time ranges to render a topology view for both Loki and Prometheus. Such a query requests all K8s Owners for the workload running in an OpenShift Cluster that had  network flows associated to them. Since we did not have any applications running, only Infrastructure workloads generated network traffic. In Network Observability such an unfiltered view renders topology rendered as follows: 
+
+    ![unfiltered topology view](images/owner_screenshot.png)


What's the reason for showing Figure 1? It kind of makes Network Observability look bad with this messy, complicated topology.

I was kinda dubious about it for the same reason, I removed the image and also updated the text of this para.

stleerh · 2024-08-15T14:48:57Z

blogs/lokiless_netobserv/index.md

+
+1. Test bed 1: node-density-heavy workload ran against 25 nodes cluster.
+2. Test bed 2: ingress-perf workload ran against 65 nodes cluster.
+3. Test bed 3: cluster-density-v2 workload ran against 120 nodes cluster


Need more explanation of these three test beds. The audience might not understand what "node-density-heavy workload" means.

I updated the info from the official docs here: https://docs.openshift.com/container-platform/4.12/observability/network_observability/configuring-operator.html#network-observability-total-resource-usage-table_network_observability

stleerh · 2024-08-15T14:49:40Z

blogs/lokiless_netobserv/index.md

+The following graphs show total vCPU, memory and storage usage for a recommended Network Observability stack  - flowlogs-pipeline, eBPF-agent, Kafka, Prometheus and optionally Loki for production clusters.
+
+![Compare total vCPUs utilized with and without Loki](<blogs/lokiless_netobserv/images/vCPUs consumed by NetObserv stack.png/Total vCPUs consumed.png>)
+![Compare total RSS utilized with and without Loki](<blogs/lokiless_netobserv/images/Memory consumed by NetObserv stack.png>)


These image links are broken in the document.

updated the links

stleerh · 2024-08-15T14:54:24Z

blogs/lokiless_netobserv/index.md

+
+As seen across the test beds, we find a storage savings of 90% when Network Observability is configured without Loki.
+
+<sup>*</sup> actual resource utilization may depend on various factors such as network traffic, FlowCollector sampling size, and the number of workloads and nodes in an OpenShift Container Platform cluster


There's not much said about CPU and memory usage compared to storage. I think CPU and memory are the more important resources since they are the ones that drive up the cost.

I have mentioned it here about CPU and Memory savings in intro paragraph: https://github.com/netobserv/documents/pull/72/files#diff-77e07c919145d98ff5ecc61c81fe91d345f160619e8ddbf9e39d4dab956d7921R42 , can you add suggestions what else we could say here?

lokiless netobserv blog

77b4b43

memodi requested review from jotak, msherif1234, OlivierCazade, stleerh and jpinsonneau July 2, 2024 20:00

minor text update

fad3e6d

jotak reviewed Jul 3, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved

jotak reviewed Jul 3, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved

jotak reviewed Jul 3, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved

jotak reviewed Jul 3, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved

jotak reviewed Jul 3, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Show resolved Hide resolved

jotak reviewed Jul 3, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved

jotak reviewed Jul 3, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved

jotak reviewed Jul 3, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Show resolved Hide resolved

jotak reviewed Jul 3, 2024

View reviewed changes

memodi and others added 3 commits July 3, 2024 10:24

Update blogs/lokiless_netobserv/index.md

458a71b

Co-authored-by: Joel Takvorian <joel.takvorian@homeblocks.net>

Update blogs/lokiless_netobserv/index.md

766e676

Co-authored-by: Joel Takvorian <joel.takvorian@homeblocks.net>

capitalize P

2407c9a

memodi added 4 commits July 3, 2024 15:41

update alert

dce5f65

review comments

2a49eb6

updates

6c2f475

query performance times

973f5db

jpinsonneau reviewed Jul 11, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved

jpinsonneau reviewed Jul 11, 2024

View reviewed changes

jotak reviewed Jul 17, 2024

View reviewed changes

blogs/lokiless_netobserv/index.md Outdated Show resolved Hide resolved

memodi added 4 commits July 17, 2024 11:37

review comments and updating querying times

a69a3c1

total resource metrics including prometheus

c27e721

updates on storage

894ba30

storage updates and add more screenshots

1a21777

memodi changed the title ~~[WIP] Lokiless netobserv~~ Lokiless netobserv Aug 14, 2024

memodi requested a review from skrthomas August 14, 2024 18:24

skrthomas reviewed Aug 14, 2024

View reviewed changes

memodi and others added 4 commits August 15, 2024 10:39

Apply suggestions from code review

ee9fd85

Co-authored-by: Sara Thomas <sarthoma@redhat.com>

review comments

159e9ab

Apply suggestions from code review

4577b54

Co-authored-by: Sara Thomas <sarthoma@redhat.com>

Update blogs/lokiless_netobserv/index.md

8405056

Co-authored-by: Sara Thomas <sarthoma@redhat.com>

stleerh reviewed Aug 15, 2024

View reviewed changes

memodi added 2 commits August 15, 2024 12:16

title update, image links and review comments

0392d32

Update blogs/lokiless_netobserv/index.md

857d5a0

		1. Without storage of network flows it no longer provides Traffic flows table. <TODO: insert a picture Traffic table greyed out>

		2. Per-pod level of resource granularity is not available since it causes Prometheus metrics to have high cardinality. <TODO: insert a picture where diff between with-Loki and without-Loki Scope>

		When configured as above, Network Observability's Prometheus metrics will continue to get scraped by OpenShift's cluster Prometheus without any additonal configuration and Network Traffic console will use Prometheus as a source for fetching the data.

		## Performance and Resource utilization gains

		By: Mehul Modi, Steven Lee

		Recently, the Network Observability Operator released version 1.6, which added a major enhancement to provide network insights for your OpenShift cluster without Loki. This enhancement was also featured in [What's new in Network Observability 1.6](../whats_new_1.6) blog, providing a quick overview of the feature. In this blog, lets look at some of the advantages and trade-offs users would have when deploying the Network Observability Operator with Loki disabled. As more metrics are enabled by default with this feature, we'll also demonstrate a use-case on how those metrics can benefit users for real world scenarios.


		* Test: We conducted 50 identical queries for 3 separate time ranges to render a topology view for both Loki and Prometheus. Such a query requests all K8s Owners for the workload running in an OpenShift Cluster that had network flows associated to them. Since we did not have any applications running, only Infrastructure workloads generated network traffic. In Network Observability such an unfiltered view renders topology rendered as follows:

		![unfiltered topology view](images/owner_screenshot.png)


		As seen across the test beds, we find a storage savings of 90% when Network Observability is configured without Loki.

		<sup>*</sup> actual resource utilization may depend on various factors such as network traffic, FlowCollector sampling size, and the number of workloads and nodes in an OpenShift Container Platform cluster

Lokiless netobserv #72

Are you sure you want to change the base?

Lokiless netobserv #72

Conversation

memodi commented Jul 2, 2024

jotak Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

jotak left a comment

Choose a reason for hiding this comment

jotak commented Jul 3, 2024 • edited Loading

memodi commented Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

memodi Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

memodi Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jotak Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skrthomas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stleerh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skrthomas Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skrthomas Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jotak Jul 3, 2024 •

edited

Loading

jotak commented Jul 3, 2024 •

edited

Loading

memodi commented Jul 3, 2024 •

edited

Loading

memodi Jul 12, 2024 •

edited

Loading

memodi Jul 19, 2024 •

edited

Loading

jotak Jul 22, 2024 •

edited

Loading

skrthomas Aug 15, 2024 •

edited

Loading

skrthomas Aug 15, 2024 •

edited

Loading