-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-719 Documentation update #200
Conversation
jotak
commented
Nov 22, 2022
- Update configuration section (quick-filters, exporters)
- New section "performance fine-tuning"
- New section "securing data and communications"
Skipping CI for Draft Pull Request. |
TODO:
|
PR netobserv/documents#35 adds alert. |
|
||
In addition to sampling, other settings can help you get an optimal setup without compromising on the observability. | ||
|
||
Here is what you should pay attention to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stleerh text here / below overlaps a bit with your recommendation doc, so we should align them ; I'm waiting for your comments and suggestions! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically on the Loki side, these are the only parameters that can be tweaked if you use Loki Operator.
- ingestionBurstSize
- ingestionRate
- maxGlobalStreamsPerTenant
|
||
- `spec.loki.batchWait` and `spec.loki.batchSize` control the batching mechanism, ie. how often data is flushed out to Loki. Like in the eBPF agent batching, higher values will generate fewer traffic and consume less CPU, however it will increase a bit the memory consumption of `flowlogs-pipeline`, and may increase a bit collection latency. | ||
|
||
- `spec.loki.minBackoff`, `spec.loki.maxBackoff` and `spec.loki.maxRetries` control the retry mechanism. Retries may happen when Loki is unreachable or when it returns errors. Often, it is due to the rate limits configured on Loki server. When such situation occurs, it might not always be the best solution to increase rate limits (on server configuration side) or to increase retries. Increasing rate limits will put more pressure on Loki, so expect more memory and CPU usage, and also more traffic. Increasing retries will put more pressure on `flowlogs-pipeline`, as it will retain data for longer and accumulate more flows to send. When all the retry attempts fail, flows are simply dropped. Flow drops are counted in the metric `netobserv_loki_dropped_entries_total`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of will put pressure
- suggest to say cause high memory utilization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not just memory, increasing flows processing also increases CPU, network usage etc., right?
README.md
Outdated
|
||
NetObserv is currently meant to be used by cluster / network admins. There is not yet multi-tenancy implemented, which means a user having access to the data will be able to see *all* the data. It is up to the cluster admins to restrict the access to the NetObserv and Loki namespaces and services according to their security policy. | ||
|
||
To make authenticated queries to Loki, set `spec.loki.authToken` to `FORWARD`: the console plugin will then forward the logged-in Console user token to Loki. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we link the RBAC configs here that we'd need depending on whether it's FORWARD
or HOST
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This PR should be merged first: netobserv/documents#37 |
@@ -69,21 +69,31 @@ _Pre-requisite: OpenShift 4.10 or above_ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding explanations on the flag, and in particular the required ones ::
$ go run ./main.go --help
Usage of /tmp/go-build1418114459/b001/exe/main:
-console-plugin-image string
The image of the Console Plugin
-ebpf-agent-image string
The image of the eBPF agent
-flowlogs-pipeline-image string
The image of Flowlogs Pipeline
-health-probe-bind-address string
The address the probe endpoint binds to. (default ":8081")
-kubeconfig string
Paths to a kubeconfig. Only required if out-of-cluster.
-leader-elect
Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.
-metrics-bind-address string
The address the metric endpoint binds to. (default ":8080")
-v print version
-zap-devel
Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error) (default true)
-zap-encoder value
Zap log encoding (one of 'json' or 'console')
-zap-log-level value
Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
-zap-stacktrace-level value
Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
-zap-time-encoding value
Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not really for this README.md which is focused on running the operator from the end user pespective, but maybe more for DEVELOPMENT.md
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok ... agree
README.md
Outdated
|
||
![Flow table](./docs/assets/network-traffic-main.png) | ||
This dashboard shows overall, aggregated statistics on the cluster traffic. The stats can be refined with comprehensive filtering and display options. Different levels of aggregations are available: per node, per namespace, per owner or per pod/service). It allows to identify biggest talkers in different contexts: for instance, top X inter-namespace flows, or top X pod-to-pod flows within a namespace, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/traffic/network
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/statistics/metrics
|
||
![Overview](./docs/assets/overview-dashboard.png) | ||
|
||
#### Topology | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that both here and in the above overview it makes sense to describe the ability to update the view every amount of time to get an almost live view of the information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
These views are accessible directly from the main menu, and also as contextual tabs for any Pod, Deployment, Service (etc.) in their details page, with preset filters to focus on that resource. | ||
|
||
![Contextual topology](./docs/assets/topology-pod.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to show something that in not connected --- showing Loki as the pod might confuse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay, but if you want an alternative, then you can filter on "ingress" to get traffic going to/from the router.
|
||
_Coming soon_ | ||
|
||
### Grafana |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the metrics are available in the console, should we move the grafana explanation to hack folder or something developer/advanced and out of this README?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grafana might still be the main argument especially for non openshift users, even if we haven't contributed much to that part yet. So I'd prefer to keep it visible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok --- although the console also has a none-OCP incarnation that we can connect in none-OCP scenarios. My main reason to talk about that is that the README becomes a bit longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jotak I think that the most complicated prereq for netobserv is Loki --- and we will get a lot of confusion and complaints on documentation. The sentence here https://github.com/netobserv/network-observability-operator/pull/200/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R17 is very-very challenging for someone that tries to deploy Loki --- can you maybe add a link here to installation instructions?
- Update configuration section (quick-filters, exporters) - New section "performance fine-tuning" - New section "securing data and communications"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@eranra sorry your link is broken, can you tell me hat is the very-very challenging sentence about Loki? |
@jotak sorry about the link --- the sentence I was referring to is "Please read the operator description. You will need to install Loki; some instructions are provided there." |
- Add doc/links on Loki installation - Remove description from CSV, replaced with external description in markdown files, injected at bundle-build time. It offers several advantages: 1/ easier to read (e.g. in IDE markdown viewer) 2/ easier to compare upstream/downstream 3/ ability to update doc in a single PR rather than having to split work in upstream + midstream - Update Makefile bundle build to inject markdown description - Fix broken regex in makefile for links update
Last commit introduces mainly two things:
|
@eranra ah, but I didn't plan to change that .. That's to avoid duplication, you know? 😄 |
.PHONY: bundle-for-test | ||
bundle-for-test: bundle-prepare | ||
$(KUSTOMIZE) build config/manifests | $(SED) -e 's~:container-image:~$(IMG)~' | $(SED) -e 's~:created-at:~$(DATE)~' | $(OPSDK) generate bundle -q --overwrite --version 0.0.0-$(VERSION) $(BUNDLE_METADATA_OPTS) | ||
$(SED) -e 's/^/ /' config/manifests/bases/$(CSV_DESC) > tmp-desc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add trap "rm -f tmp-desc" EXIT
to clean up temp files. This also cleans up if someone presses ctrl-c to abort.
Better to use tmp-desc.$$
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it doesn't work with make
(it needs workarounds). I'd go ahead with my ugly way but feel free to clean this out if you find a better way! :)
| $(SED) -e "/':full-description:'/r tmp-desc" \ | ||
| $(SED) -e "s/':full-description:'/|\-/" \ | ||
| $(OPSDK) generate bundle -q --overwrite --version $(BUNDLE_VERSION) $(BUNDLE_METADATA_OPTS) | ||
rm tmp-desc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove line 258 if you have 'trap' above.
README.md
Outdated
@@ -69,21 +69,33 @@ _Pre-requisite: OpenShift 4.10 or above_ | |||
|
|||
If the OpenShift Console is detected in the cluster, a console plugin is deployed when a `FlowCollector` is installed. It adds new pages and tabs to the console: | |||
|
|||
- A flow table, with powerful filtering and display options | |||
#### Overview dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were not going to use "dashboard" in Network Traffic to avoid the confusion with the "Dashboards" panel. My suggestion was to call it "Graphs" or possibly "Metrics".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'd rather go with "metrics", ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or charts)
|
||
These views are accessible directly from the main menu, and also as contextual tabs for any Pod, Deployment, Service (etc.) in their details page, with preset filters to focus on that resource. | ||
|
||
![Contextual topology](./docs/assets/topology-pod.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay, but if you want an alternative, then you can filter on "ingress" to get traffic going to/from the router.
README.md
Outdated
|
||
- Kafka (`spec.deploymentModel: KAFKA` and `spec.kafka`): when enabled, integrates the flow collection pipeline with Kafka, by splitting ingestion from transformation (kube enrichment, derived metrics, ...). Kafka can provide better scalability, resiliency and high availability ([view more details](https://www.redhat.com/en/topics/integration/what-is-apache-kafka)). Assumes Kafka is already deployed and a topic is created. For convenience, we provide a quick deployment using [strimzi](https://strimzi.io/): run `make deploy-kafka` from the repository. | ||
|
||
- Exporters (`spec.exporters`, _experimental_) an optional list of exporters to which to send enriched flows. Currently only KAFKA is supported. This allows you to define any custom storage or processing that can read from Kafka. This feature is flagged as _experimental_ as it has not been thoroughly or stress tested yet, so use at your own risks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "risk" (singular).
|
||
- Loki (`spec.loki`): configure here how to reach Loki. The default values match the Loki quick install paths mentioned in the _Getting Started_ section, but you may have to configure differently if you used another installation method. | ||
|
||
- Kafka (`spec.kafka`): when enabled, integrate the flow collection pipeline with Kafka, by splitting ingestion from transformation (kube enrichment, derived metrics, ...). Kafka can provide better scalability, resiliency and high availability ([view more details](https://www.redhat.com/en/topics/integration/what-is-apache-kafka)). Assumes Kafka is already deployed and a topic is created. For convenience, we provide a quick deployment using [strimzi](https://strimzi.io/): run `make deploy-kafka` from the repository. | ||
- Quick filters (`spec.consolePlugin.quickFilters`): configure preset filters to be displayed in the Console plugin. They offer a way to quickly switch from filters to others, such as showing / hiding pods network, or infrastructure network, or application network, etc. They can be tuned to reflect the different workloads running on your cluster. For a list of available filters, [check this page](./docs/QuickFilters.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it should be more clear on where these settings are in the Flow Collector UI. For example, it's not clear that spec.loki refers to the field called "querierUrl". (I know the example wasn't technically changed in this commit.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well that's not where each setting is accurately described, it's just a summary. The full comprehensive setting link is given above.
Nevertheless I'm adding links to out loki install guides here as well.
|
||
#### With Kafka | ||
|
||
More performance fine-tuning is possible when using Kafka, ie. with `spec.deploymentModel` set to `KAFKA`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to use Kafka if you have bursty traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - but not here: rather in the "settings to pay attention to" section above where we already give some hints on why to use kafka
New changes are detected. LGTM label has been removed. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jotak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |