NETOBSERV-719 Documentation update #200

jotak · 2022-11-22T17:24:46Z

Update configuration section (quick-filters, exporters)
New section "performance fine-tuning"
New section "securing data and communications"

openshift-ci · 2022-11-22T17:24:51Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

jotak · 2022-11-22T17:26:38Z

TODO:

Screenshots updates/new
Alert link
Quick filters reference (NETOBSERV-682)
More info on loki installation via the operator
CSV up/downstream handling

memodi · 2022-11-22T17:51:46Z

PR netobserv/documents#35 adds alert.

jotak · 2022-11-22T17:56:28Z

README.md

+
+In addition to sampling, other settings can help you get an optimal setup without compromising on the observability.
+
+Here is what you should pay attention to:


@stleerh text here / below overlaps a bit with your recommendation doc, so we should align them ; I'm waiting for your comments and suggestions! :-)

Specifically on the Loki side, these are the only parameters that can be tweaked if you use Loki Operator.

ingestionBurstSize

ingestionRate

maxGlobalStreamsPerTenant

memodi · 2022-11-22T18:02:04Z

README.md

+
+- `spec.loki.batchWait` and `spec.loki.batchSize` control the batching mechanism, ie. how often data is flushed out to Loki. Like in the eBPF agent batching, higher values will generate fewer traffic and consume less CPU, however it will increase a bit the memory consumption of `flowlogs-pipeline`, and may increase a bit collection latency.
+
+- `spec.loki.minBackoff`, `spec.loki.maxBackoff` and `spec.loki.maxRetries` control the retry mechanism. Retries may happen when Loki is unreachable or when it returns errors. Often, it is due to the rate limits configured on Loki server. When such situation occurs, it might not always be the best solution to increase rate limits (on server configuration side) or to increase retries. Increasing rate limits will put more pressure on Loki, so expect more memory and CPU usage, and also more traffic. Increasing retries will put more pressure on `flowlogs-pipeline`, as it will retain data for longer and accumulate more flows to send. When all the retry attempts fail, flows are simply dropped. Flow drops are counted in the metric `netobserv_loki_dropped_entries_total`.


instead of will put pressure - suggest to say cause high memory utilization

it's not just memory, increasing flows processing also increases CPU, network usage etc., right?

memodi · 2022-11-22T18:08:14Z

README.md

+
+NetObserv is currently meant to be used by cluster / network admins. There is not yet multi-tenancy implemented, which means a user having access to the data will be able to see *all* the data. It is up to the cluster admins to restrict the access to the NetObserv and Loki namespaces and services according to their security policy.
+
+To make authenticated queries to Loki, set `spec.loki.authToken` to `FORWARD`: the console plugin will then forward the logged-in Console user token to Loki.


should we link the RBAC configs here that we'd need depending on whether it's FORWARD or HOST?

jotak · 2022-11-25T08:32:23Z

This PR should be merged first: netobserv/documents#37
once it is, I will update this one with new links

eranra · 2022-11-27T13:55:14Z

README.md

@@ -69,21 +69,31 @@ _Pre-requisite: OpenShift 4.10 or above_



@jotak @stleerh

I suggest adding explanations on the flag, and in particular the required ones ::

$ go run ./main.go --help Usage of /tmp/go-build1418114459/b001/exe/main: -console-plugin-image string The image of the Console Plugin -ebpf-agent-image string The image of the eBPF agent -flowlogs-pipeline-image string The image of Flowlogs Pipeline -health-probe-bind-address string The address the probe endpoint binds to. (default ":8081") -kubeconfig string Paths to a kubeconfig. Only required if out-of-cluster. -leader-elect Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager. -metrics-bind-address string The address the metric endpoint binds to. (default ":8080") -v print version -zap-devel Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error) (default true) -zap-encoder value Zap log encoding (one of 'json' or 'console') -zap-log-level value Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity -zap-stacktrace-level value Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic'). -zap-time-encoding value Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.

I think it's not really for this README.md which is focused on running the operator from the end user pespective, but maybe more for DEVELOPMENT.md.

ok ... agree

eranra · 2022-11-27T13:57:12Z

README.md


-![Flow table](./docs/assets/network-traffic-main.png)
+This dashboard shows overall, aggregated statistics on the cluster traffic. The stats can be refined with comprehensive filtering and display options. Different levels of aggregations are available: per node, per namespace, per owner or per pod/service). It allows to identify biggest talkers in different contexts: for instance, top X inter-namespace flows, or top X pod-to-pod flows within a namespace, etc.


s/traffic/network

s/statistics/metrics

README.md

eranra · 2022-11-27T14:00:43Z

README.md

+
+![Overview](./docs/assets/overview-dashboard.png)
+
+#### Topology



I think that both here and in the above overview it makes sense to describe the ability to update the view every amount of time to get an almost live view of the information

README.md

eranra · 2022-11-27T14:03:39Z

README.md

+
+These views are accessible directly from the main menu, and also as contextual tabs for any Pod, Deployment, Service (etc.) in their details page, with preset filters to focus on that resource.
+
+![Contextual topology](./docs/assets/topology-pod.png)


I suggest to show something that in not connected --- showing Loki as the pod might confuse

I think it's okay, but if you want an alternative, then you can filter on "ingress" to get traffic going to/from the router.

eranra · 2022-11-27T14:05:12Z

README.md


-_Coming soon_

 ### Grafana


Now that the metrics are available in the console, should we move the grafana explanation to hack folder or something developer/advanced and out of this README?

Grafana might still be the main argument especially for non openshift users, even if we haven't contributed much to that part yet. So I'd prefer to keep it visible

ok --- although the console also has a none-OCP incarnation that we can connect in none-OCP scenarios. My main reason to talk about that is that the README becomes a bit longer.

eranra

@jotak I think that the most complicated prereq for netobserv is Loki --- and we will get a lot of confusion and complaints on documentation. The sentence here https://github.com/netobserv/network-observability-operator/pull/200/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R17 is very-very challenging for someone that tries to deploy Loki --- can you maybe add a link here to installation instructions?

- Update configuration section (quick-filters, exporters) - New section "performance fine-tuning" - New section "securing data and communications"

OlivierCazade

Thanks!

jotak · 2022-11-28T12:11:44Z

@eranra sorry your link is broken, can you tell me hat is the very-very challenging sentence about Loki?
Note, I am still planning to add more info on Loki installation, that's still wip

eranra · 2022-11-28T12:18:06Z

@jotak sorry about the link --- the sentence I was referring to is "Please read the operator description. You will need to install Loki; some instructions are provided there."

- Add doc/links on Loki installation - Remove description from CSV, replaced with external description in markdown files, injected at bundle-build time. It offers several advantages: 1/ easier to read (e.g. in IDE markdown viewer) 2/ easier to compare upstream/downstream 3/ ability to update doc in a single PR rather than having to split work in upstream + midstream - Update Makefile bundle build to inject markdown description - Fix broken regex in makefile for links update

jotak · 2022-11-28T14:42:21Z

Last commit introduces mainly two things:

Modified CSV description, in particular with more info about Loki installation
New mechanism to inject markdown description into CSV, with upstream and downstream descriptions available here

jotak · 2022-11-28T14:42:31Z

@eranra ah, but I didn't plan to change that .. That's to avoid duplication, you know? 😄
Though with my last commit I've externalized the CSV description in another markdown file, so I could add a link to the relevant section now

stleerh · 2022-11-30T16:26:24Z

Makefile

-.PHONY: bundle-for-test
-bundle-for-test: bundle-prepare
-	$(KUSTOMIZE) build config/manifests | $(SED) -e 's~:container-image:~$(IMG)~' | $(SED) -e 's~:created-at:~$(DATE)~' | $(OPSDK) generate bundle -q --overwrite --version 0.0.0-$(VERSION) $(BUNDLE_METADATA_OPTS)
+	$(SED) -e 's/^/    /' config/manifests/bases/$(CSV_DESC) > tmp-desc


Add trap "rm -f tmp-desc" EXIT to clean up temp files. This also cleans up if someone presses ctrl-c to abort.

Better to use tmp-desc.$$.

Actually, it doesn't work with make (it needs workarounds). I'd go ahead with my ugly way but feel free to clean this out if you find a better way! :)

stleerh · 2022-11-30T16:32:31Z

Makefile

+		| $(SED) -e "/':full-description:'/r tmp-desc" \
+		| $(SED) -e "s/':full-description:'/|\-/" \
+		| $(OPSDK) generate bundle -q --overwrite --version $(BUNDLE_VERSION) $(BUNDLE_METADATA_OPTS)
+	rm tmp-desc


Remove line 258 if you have 'trap' above.

stleerh · 2022-11-30T16:53:03Z

README.md

@@ -69,21 +69,33 @@ _Pre-requisite: OpenShift 4.10 or above_

 If the OpenShift Console is detected in the cluster, a console plugin is deployed when a `FlowCollector` is installed. It adds new pages and tabs to the console:

- A flow table, with powerful filtering and display options
+#### Overview dashboard


I thought we were not going to use "dashboard" in Network Traffic to avoid the confusion with the "Dashboards" panel. My suggestion was to call it "Graphs" or possibly "Metrics".

So I'd rather go with "metrics", ok

(or charts)

stleerh · 2022-11-30T16:56:49Z

README.md

+
+These views are accessible directly from the main menu, and also as contextual tabs for any Pod, Deployment, Service (etc.) in their details page, with preset filters to focus on that resource.
+
+![Contextual topology](./docs/assets/topology-pod.png)


I think it's okay, but if you want an alternative, then you can filter on "ingress" to get traffic going to/from the router.

stleerh · 2022-11-30T17:05:19Z

README.md

+
+- Kafka (`spec.deploymentModel: KAFKA` and `spec.kafka`): when enabled, integrates the flow collection pipeline with Kafka, by splitting ingestion from transformation (kube enrichment, derived metrics, ...). Kafka can provide better scalability, resiliency and high availability ([view more details](https://www.redhat.com/en/topics/integration/what-is-apache-kafka)). Assumes Kafka is already deployed and a topic is created. For convenience, we provide a quick deployment using [strimzi](https://strimzi.io/): run `make deploy-kafka` from the repository.
+
+- Exporters (`spec.exporters`, _experimental_) an optional list of exporters to which to send enriched flows. Currently only KAFKA is supported. This allows you to define any custom storage or processing that can read from Kafka. This feature is flagged as _experimental_ as it has not been thoroughly or stress tested yet, so use at your own risks.


Change: "risk" (singular).

stleerh · 2022-11-30T17:07:58Z

README.md


 - Loki (`spec.loki`): configure here how to reach Loki. The default values match the Loki quick install paths mentioned in the _Getting Started_ section, but you may have to configure differently if you used another installation method.

- Kafka (`spec.kafka`): when enabled, integrate the flow collection pipeline with Kafka, by splitting ingestion from transformation (kube enrichment, derived metrics, ...). Kafka can provide better scalability, resiliency and high availability ([view more details](https://www.redhat.com/en/topics/integration/what-is-apache-kafka)). Assumes Kafka is already deployed and a topic is created. For convenience, we provide a quick deployment using [strimzi](https://strimzi.io/): run `make deploy-kafka` from the repository.
+- Quick filters (`spec.consolePlugin.quickFilters`): configure preset filters to be displayed in the Console plugin. They offer a way to quickly switch from filters to others, such as showing / hiding pods network, or infrastructure network, or application network, etc. They can be tuned to reflect the different workloads running on your cluster. For a list of available filters, [check this page](./docs/QuickFilters.md).


In general, it should be more clear on where these settings are in the Flow Collector UI. For example, it's not clear that spec.loki refers to the field called "querierUrl". (I know the example wasn't technically changed in this commit.)

well that's not where each setting is accurately described, it's just a summary. The full comprehensive setting link is given above.
Nevertheless I'm adding links to out loki install guides here as well.

stleerh · 2022-11-30T17:34:36Z

README.md

+
+#### With Kafka
+
+More performance fine-tuning is possible when using Kafka, ie. with `spec.deploymentModel` set to `KAFKA`:


Suggest to use Kafka if you have bursty traffic.

done - but not here: rather in the "settings to pay attention to" section above where we already give some hints on why to use kafka

openshift-ci · 2022-12-02T09:49:27Z

New changes are detected. LGTM label has been removed.

jotak · 2022-12-02T09:52:30Z

/approve

openshift-ci · 2022-12-02T09:52:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jotak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ng v5) Fixes netobserv#200

jotak requested review from mariomac, OlivierCazade, stleerh, memodi and jpinsonneau November 22, 2022 17:24

openshift-ci bot added the do-not-merge/work-in-progress label Nov 22, 2022

jotak commented Nov 22, 2022

View reviewed changes

memodi requested changes Nov 22, 2022

View reviewed changes

openshift-ci bot assigned memodi Nov 22, 2022

jotak force-pushed the refresh-doc branch from c03f381 to c98cef2 Compare November 23, 2022 13:09

mariomac approved these changes Nov 24, 2022

View reviewed changes

openshift-ci bot assigned mariomac Nov 24, 2022

openshift-ci bot added the lgtm label Nov 24, 2022

jotak force-pushed the refresh-doc branch from e1fa556 to b9c1df2 Compare November 24, 2022 17:46

openshift-ci bot removed the lgtm label Nov 24, 2022

eranra reviewed Nov 27, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

eranra reviewed Nov 27, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

eranra reviewed Nov 27, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

eranra reviewed Nov 27, 2022

View reviewed changes

jotak added 2 commits November 28, 2022 10:50

NETOBSERV-719 Documentation update

470bc44

- Update configuration section (quick-filters, exporters) - New section "performance fine-tuning" - New section "securing data and communications"

Quick filters doc

c50b20f

OlivierCazade approved these changes Nov 28, 2022

View reviewed changes

openshift-ci bot assigned OlivierCazade Nov 28, 2022

openshift-ci bot added the lgtm label Nov 28, 2022

readme feedback

97945d2

jotak force-pushed the refresh-doc branch from b9c1df2 to 97945d2 Compare November 28, 2022 12:08

openshift-ci bot removed the lgtm label Nov 28, 2022

jotak marked this pull request as ready for review November 28, 2022 14:37

openshift-ci bot removed the do-not-merge/work-in-progress label Nov 28, 2022

Link to description

c54d0e8

jotak mentioned this pull request Nov 30, 2022

NETOBSERV-727 Update bundle process #205

Merged

eranra approved these changes Nov 30, 2022

View reviewed changes

openshift-ci bot assigned eranra Nov 30, 2022

openshift-ci bot added the lgtm label Nov 30, 2022

stleerh reviewed Nov 30, 2022

View reviewed changes

jpinsonneau approved these changes Dec 1, 2022

View reviewed changes

openshift-ci bot assigned jpinsonneau Dec 1, 2022

openshift-ci bot removed the lgtm label Dec 2, 2022

Steven's feedback

bfa6016

jotak force-pushed the refresh-doc branch from a2dab3c to bfa6016 Compare December 2, 2022 09:50

openshift-ci bot added the approved label Dec 2, 2022

jotak merged commit 5dc099b into netobserv:main Dec 2, 2022

KalmanMeth pushed a commit to KalmanMeth/network-observability-operator that referenced this pull request Feb 13, 2023

NETOBSERV-341 configure netflow v5 separately (breaking change if usi…

2051cf5

…ng v5) Fixes netobserv#200


		In addition to sampling, other settings can help you get an optimal setup without compromising on the observability.

		Here is what you should pay attention to:


		- `spec.loki.batchWait` and `spec.loki.batchSize` control the batching mechanism, ie. how often data is flushed out to Loki. Like in the eBPF agent batching, higher values will generate fewer traffic and consume less CPU, however it will increase a bit the memory consumption of `flowlogs-pipeline`, and may increase a bit collection latency.

		- `spec.loki.minBackoff`, `spec.loki.maxBackoff` and `spec.loki.maxRetries` control the retry mechanism. Retries may happen when Loki is unreachable or when it returns errors. Often, it is due to the rate limits configured on Loki server. When such situation occurs, it might not always be the best solution to increase rate limits (on server configuration side) or to increase retries. Increasing rate limits will put more pressure on Loki, so expect more memory and CPU usage, and also more traffic. Increasing retries will put more pressure on `flowlogs-pipeline`, as it will retain data for longer and accumulate more flows to send. When all the retry attempts fail, flows are simply dropped. Flow drops are counted in the metric `netobserv_loki_dropped_entries_total`.


		NetObserv is currently meant to be used by cluster / network admins. There is not yet multi-tenancy implemented, which means a user having access to the data will be able to see all the data. It is up to the cluster admins to restrict the access to the NetObserv and Loki namespaces and services according to their security policy.

		To make authenticated queries to Loki, set `spec.loki.authToken` to `FORWARD`: the console plugin will then forward the logged-in Console user token to Loki.

		@@ -69,21 +69,31 @@ _Pre-requisite: OpenShift 4.10 or above_


		![Flow table](./docs/assets/network-traffic-main.png)
		This dashboard shows overall, aggregated statistics on the cluster traffic. The stats can be refined with comprehensive filtering and display options. Different levels of aggregations are available: per node, per namespace, per owner or per pod/service). It allows to identify biggest talkers in different contexts: for instance, top X inter-namespace flows, or top X pod-to-pod flows within a namespace, etc.


		![Overview](./docs/assets/overview-dashboard.png)

		#### Topology


		These views are accessible directly from the main menu, and also as contextual tabs for any Pod, Deployment, Service (etc.) in their details page, with preset filters to focus on that resource.

		![Contextual topology](./docs/assets/topology-pod.png)


		- Kafka (`spec.deploymentModel: KAFKA` and `spec.kafka`): when enabled, integrates the flow collection pipeline with Kafka, by splitting ingestion from transformation (kube enrichment, derived metrics, ...). Kafka can provide better scalability, resiliency and high availability ([view more details](https://www.redhat.com/en/topics/integration/what-is-apache-kafka)). Assumes Kafka is already deployed and a topic is created. For convenience, we provide a quick deployment using [strimzi](https://strimzi.io/): run `make deploy-kafka` from the repository.

		- Exporters (`spec.exporters`, _experimental_) an optional list of exporters to which to send enriched flows. Currently only KAFKA is supported. This allows you to define any custom storage or processing that can read from Kafka. This feature is flagged as _experimental_ as it has not been thoroughly or stress tested yet, so use at your own risks.


		#### With Kafka

		More performance fine-tuning is possible when using Kafka, ie. with `spec.deploymentModel` set to `KAFKA`:

NETOBSERV-719 Documentation update #200

NETOBSERV-719 Documentation update #200

Conversation

jotak commented Nov 22, 2022

openshift-ci bot commented Nov 22, 2022

jotak commented Nov 22, 2022 • edited Loading

memodi commented Nov 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jotak commented Nov 25, 2022

eranra Nov 27, 2022 • edited Loading

Choose a reason for hiding this comment

jotak Nov 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eranra left a comment

Choose a reason for hiding this comment

OlivierCazade left a comment

Choose a reason for hiding this comment

jotak commented Nov 28, 2022

eranra commented Nov 28, 2022

jotak commented Nov 28, 2022

jotak commented Nov 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jotak Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

openshift-ci bot commented Dec 2, 2022

jotak commented Dec 2, 2022

openshift-ci bot commented Dec 2, 2022

jotak commented Nov 22, 2022 •

edited

Loading

eranra Nov 27, 2022 •

edited

Loading

jotak Nov 28, 2022 •

edited

Loading

jotak Dec 2, 2022 •

edited

Loading