Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-719 Documentation update #200

Merged
merged 8 commits into from
Dec 2, 2022
Merged

Conversation

jotak
Copy link
Member

@jotak jotak commented Nov 22, 2022

  • Update configuration section (quick-filters, exporters)
  • New section "performance fine-tuning"
  • New section "securing data and communications"

@openshift-ci
Copy link

openshift-ci bot commented Nov 22, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@jotak
Copy link
Member Author

jotak commented Nov 22, 2022

TODO:

  • Screenshots updates/new
  • Alert link
  • Quick filters reference (NETOBSERV-682)
  • More info on loki installation via the operator
  • CSV up/downstream handling

@memodi
Copy link
Contributor

memodi commented Nov 22, 2022

PR netobserv/documents#35 adds alert.


In addition to sampling, other settings can help you get an optimal setup without compromising on the observability.

Here is what you should pay attention to:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stleerh text here / below overlaps a bit with your recommendation doc, so we should align them ; I'm waiting for your comments and suggestions! :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically on the Loki side, these are the only parameters that can be tweaked if you use Loki Operator.

  • ingestionBurstSize
  • ingestionRate
  • maxGlobalStreamsPerTenant


- `spec.loki.batchWait` and `spec.loki.batchSize` control the batching mechanism, ie. how often data is flushed out to Loki. Like in the eBPF agent batching, higher values will generate fewer traffic and consume less CPU, however it will increase a bit the memory consumption of `flowlogs-pipeline`, and may increase a bit collection latency.

- `spec.loki.minBackoff`, `spec.loki.maxBackoff` and `spec.loki.maxRetries` control the retry mechanism. Retries may happen when Loki is unreachable or when it returns errors. Often, it is due to the rate limits configured on Loki server. When such situation occurs, it might not always be the best solution to increase rate limits (on server configuration side) or to increase retries. Increasing rate limits will put more pressure on Loki, so expect more memory and CPU usage, and also more traffic. Increasing retries will put more pressure on `flowlogs-pipeline`, as it will retain data for longer and accumulate more flows to send. When all the retry attempts fail, flows are simply dropped. Flow drops are counted in the metric `netobserv_loki_dropped_entries_total`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of will put pressure - suggest to say cause high memory utilization

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not just memory, increasing flows processing also increases CPU, network usage etc., right?

README.md Outdated

NetObserv is currently meant to be used by cluster / network admins. There is not yet multi-tenancy implemented, which means a user having access to the data will be able to see *all* the data. It is up to the cluster admins to restrict the access to the NetObserv and Loki namespaces and services according to their security policy.

To make authenticated queries to Loki, set `spec.loki.authToken` to `FORWARD`: the console plugin will then forward the logged-in Console user token to Loki.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we link the RBAC configs here that we'd need depending on whether it's FORWARD or HOST?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jotak
Copy link
Member Author

jotak commented Nov 25, 2022

This PR should be merged first: netobserv/documents#37
once it is, I will update this one with new links

@@ -69,21 +69,31 @@ _Pre-requisite: OpenShift 4.10 or above_

Copy link
Contributor

@eranra eranra Nov 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jotak @stleerh

I suggest adding explanations on the flag, and in particular the required ones ::

$ go run ./main.go --help
Usage of /tmp/go-build1418114459/b001/exe/main:
  -console-plugin-image string
        The image of the Console Plugin
  -ebpf-agent-image string
        The image of the eBPF agent
  -flowlogs-pipeline-image string
        The image of Flowlogs Pipeline
  -health-probe-bind-address string
        The address the probe endpoint binds to. (default ":8081")
  -kubeconfig string
        Paths to a kubeconfig. Only required if out-of-cluster.
  -leader-elect
        Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.
  -metrics-bind-address string
        The address the metric endpoint binds to. (default ":8080")
  -v    print version
  -zap-devel
        Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error) (default true)
  -zap-encoder value
        Zap log encoding (one of 'json' or 'console')
  -zap-log-level value
        Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
  -zap-stacktrace-level value
        Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
  -zap-time-encoding value
        Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.

Copy link
Member Author

@jotak jotak Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not really for this README.md which is focused on running the operator from the end user pespective, but maybe more for DEVELOPMENT.md.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ... agree

README.md Outdated

![Flow table](./docs/assets/network-traffic-main.png)
This dashboard shows overall, aggregated statistics on the cluster traffic. The stats can be refined with comprehensive filtering and display options. Different levels of aggregations are available: per node, per namespace, per owner or per pod/service). It allows to identify biggest talkers in different contexts: for instance, top X inter-namespace flows, or top X pod-to-pod flows within a namespace, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/traffic/network

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/statistics/metrics

README.md Outdated Show resolved Hide resolved

![Overview](./docs/assets/overview-dashboard.png)

#### Topology

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that both here and in the above overview it makes sense to describe the ability to update the view every amount of time to get an almost live view of the information

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved

These views are accessible directly from the main menu, and also as contextual tabs for any Pod, Deployment, Service (etc.) in their details page, with preset filters to focus on that resource.

![Contextual topology](./docs/assets/topology-pod.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to show something that in not connected --- showing Loki as the pod might confuse

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay, but if you want an alternative, then you can filter on "ingress" to get traffic going to/from the router.


_Coming soon_

### Grafana
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the metrics are available in the console, should we move the grafana explanation to hack folder or something developer/advanced and out of this README?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grafana might still be the main argument especially for non openshift users, even if we haven't contributed much to that part yet. So I'd prefer to keep it visible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok --- although the console also has a none-OCP incarnation that we can connect in none-OCP scenarios. My main reason to talk about that is that the README becomes a bit longer.

Copy link
Contributor

@eranra eranra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jotak I think that the most complicated prereq for netobserv is Loki --- and we will get a lot of confusion and complaints on documentation. The sentence here https://github.com/netobserv/network-observability-operator/pull/200/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R17 is very-very challenging for someone that tries to deploy Loki --- can you maybe add a link here to installation instructions?

- Update configuration section (quick-filters, exporters)
- New section "performance fine-tuning"
- New section "securing data and communications"
Copy link
Contributor

@OlivierCazade OlivierCazade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jotak
Copy link
Member Author

jotak commented Nov 28, 2022

@eranra sorry your link is broken, can you tell me hat is the very-very challenging sentence about Loki?
Note, I am still planning to add more info on Loki installation, that's still wip

@eranra
Copy link
Contributor

eranra commented Nov 28, 2022

@jotak sorry about the link --- the sentence I was referring to is "Please read the operator description. You will need to install Loki; some instructions are provided there."

- Add doc/links on Loki installation
- Remove description from CSV, replaced with external description in
  markdown files, injected at bundle-build time. It offers several
advantages: 1/ easier to read (e.g. in IDE markdown viewer) 2/ easier to
compare upstream/downstream 3/ ability to update doc in a single PR
rather than having to split work in upstream + midstream
- Update Makefile bundle build to inject markdown description
- Fix broken regex in makefile for links update
@jotak jotak marked this pull request as ready for review November 28, 2022 14:37
@jotak
Copy link
Member Author

jotak commented Nov 28, 2022

Last commit introduces mainly two things:

  • Modified CSV description, in particular with more info about Loki installation
  • New mechanism to inject markdown description into CSV, with upstream and downstream descriptions available here

@jotak
Copy link
Member Author

jotak commented Nov 28, 2022

@eranra ah, but I didn't plan to change that .. That's to avoid duplication, you know? 😄
Though with my last commit I've externalized the CSV description in another markdown file, so I could add a link to the relevant section now

.PHONY: bundle-for-test
bundle-for-test: bundle-prepare
$(KUSTOMIZE) build config/manifests | $(SED) -e 's~:container-image:~$(IMG)~' | $(SED) -e 's~:created-at:~$(DATE)~' | $(OPSDK) generate bundle -q --overwrite --version 0.0.0-$(VERSION) $(BUNDLE_METADATA_OPTS)
$(SED) -e 's/^/ /' config/manifests/bases/$(CSV_DESC) > tmp-desc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add trap "rm -f tmp-desc" EXIT to clean up temp files.  This also cleans up if someone presses ctrl-c to abort.

Better to use tmp-desc.$$.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it doesn't work with make (it needs workarounds). I'd go ahead with my ugly way but feel free to clean this out if you find a better way! :)

| $(SED) -e "/':full-description:'/r tmp-desc" \
| $(SED) -e "s/':full-description:'/|\-/" \
| $(OPSDK) generate bundle -q --overwrite --version $(BUNDLE_VERSION) $(BUNDLE_METADATA_OPTS)
rm tmp-desc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove line 258 if you have 'trap' above.

README.md Outdated
@@ -69,21 +69,33 @@ _Pre-requisite: OpenShift 4.10 or above_

If the OpenShift Console is detected in the cluster, a console plugin is deployed when a `FlowCollector` is installed. It adds new pages and tabs to the console:

- A flow table, with powerful filtering and display options
#### Overview dashboard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were not going to use "dashboard" in Network Traffic to avoid the confusion with the "Dashboards" panel.  My suggestion was to call it "Graphs" or possibly "Metrics".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'd rather go with "metrics", ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or charts)


These views are accessible directly from the main menu, and also as contextual tabs for any Pod, Deployment, Service (etc.) in their details page, with preset filters to focus on that resource.

![Contextual topology](./docs/assets/topology-pod.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay, but if you want an alternative, then you can filter on "ingress" to get traffic going to/from the router.

README.md Outdated

- Kafka (`spec.deploymentModel: KAFKA` and `spec.kafka`): when enabled, integrates the flow collection pipeline with Kafka, by splitting ingestion from transformation (kube enrichment, derived metrics, ...). Kafka can provide better scalability, resiliency and high availability ([view more details](https://www.redhat.com/en/topics/integration/what-is-apache-kafka)). Assumes Kafka is already deployed and a topic is created. For convenience, we provide a quick deployment using [strimzi](https://strimzi.io/): run `make deploy-kafka` from the repository.

- Exporters (`spec.exporters`, _experimental_) an optional list of exporters to which to send enriched flows. Currently only KAFKA is supported. This allows you to define any custom storage or processing that can read from Kafka. This feature is flagged as _experimental_ as it has not been thoroughly or stress tested yet, so use at your own risks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "risk" (singular).


- Loki (`spec.loki`): configure here how to reach Loki. The default values match the Loki quick install paths mentioned in the _Getting Started_ section, but you may have to configure differently if you used another installation method.

- Kafka (`spec.kafka`): when enabled, integrate the flow collection pipeline with Kafka, by splitting ingestion from transformation (kube enrichment, derived metrics, ...). Kafka can provide better scalability, resiliency and high availability ([view more details](https://www.redhat.com/en/topics/integration/what-is-apache-kafka)). Assumes Kafka is already deployed and a topic is created. For convenience, we provide a quick deployment using [strimzi](https://strimzi.io/): run `make deploy-kafka` from the repository.
- Quick filters (`spec.consolePlugin.quickFilters`): configure preset filters to be displayed in the Console plugin. They offer a way to quickly switch from filters to others, such as showing / hiding pods network, or infrastructure network, or application network, etc. They can be tuned to reflect the different workloads running on your cluster. For a list of available filters, [check this page](./docs/QuickFilters.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it should be more clear on where these settings are in the Flow Collector UI.  For example, it's not clear that spec.loki refers to the field called "querierUrl".  (I know the example wasn't technically changed in this commit.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well that's not where each setting is accurately described, it's just a summary. The full comprehensive setting link is given above.
Nevertheless I'm adding links to out loki install guides here as well.


#### With Kafka

More performance fine-tuning is possible when using Kafka, ie. with `spec.deploymentModel` set to `KAFKA`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to use Kafka if you have bursty traffic.

Copy link
Member Author

@jotak jotak Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - but not here: rather in the "settings to pay attention to" section above where we already give some hints on why to use kafka

@openshift-ci
Copy link

openshift-ci bot commented Dec 2, 2022

New changes are detected. LGTM label has been removed.

@jotak
Copy link
Member Author

jotak commented Dec 2, 2022

/approve

@openshift-ci
Copy link

openshift-ci bot commented Dec 2, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Dec 2, 2022
@jotak jotak merged commit 5dc099b into netobserv:main Dec 2, 2022
KalmanMeth pushed a commit to KalmanMeth/network-observability-operator that referenced this pull request Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants