Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1075 Dashboard feedback: split infra / applications #388

Merged
merged 12 commits into from
Jul 31, 2023

Conversation

jpinsonneau
Copy link
Contributor

@jpinsonneau jpinsonneau commented Jul 3, 2023

  • moved to embedded json approach instead of FLP for more flexibility
  • splitted application / infrastructures namespace & workload graphs
  • grouped by sections

image
image
image

@jpinsonneau jpinsonneau marked this pull request as ready for review July 4, 2023 11:07
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 4, 2023
@github-actions
Copy link

github-actions bot commented Jul 4, 2023

New images:

  • quay.io/netobserv/network-observability-operator:b1a5d4e
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-b1a5d4e
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-b1a5d4e

They will expire after two weeks.

Catalog source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-b1a5d4e
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@@ -326,14 +323,6 @@ func flowCollectorControllerSpecs() {
Expect(ds.Spec.Template.Spec.Tolerations).
To(ContainElement(v1.Toleration{Operator: v1.TolerationOpExists}))
})

By("Expecting the monitoring dashboards configmap to be deleted")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that, but I guess that's the drawback of not using FLP: we loose the ability to tie dashboards with metrics presence. Hence this test not passing, I guess. Can we think about a simple alternative for that? Maybe it requires some discussion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 71da325 with other enhancements described below

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 4, 2023
@jpinsonneau
Copy link
Contributor Author

Updated this PRs to address all feedback gathered:

  • tie metrics to dashboards using tags. Check FilterDashboardRows helper
  • added missing metrics definitions
  • added / updated missing graphs in dashboards
    • moved namespace flows to health dashboard
  • added a new health flag (default) to ignore extra flows graphs per node / namespace / workload

Here are examples of dashboards with no filters:

image
image

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 13, 2023
// Available tags are: `egress`, `ingress`, `flows`, `bytes`, `packets`, `namespaces`, `nodes`, `workloads`.
//+kubebuilder:default:={"egress","packets"}
// Available tags are: `egress`, `ingress`, `flows`, `bytes`, `packets`, `namespaces`, `nodes`, `workloads`, `health`.
//+kubebuilder:default:={"egress","packets", "health"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would ignore health by default, this isn't what we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better naming can be found ?

The metrics flagged with health are only flows metrics definitions per nodes / namespaces / workloads

The health dashboard will look like as before by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or health is not what I thought it is )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like, at least for now, "health" fully overlaps with "flows" so we could keep just "flows"

@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:3e2639c
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-3e2639c
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-3e2639c

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:3e2639c make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-3e2639c

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-3e2639c
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

- name: namespace_egress_bytes_total
type: counter
valuekey: Bytes
filter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebasing, you'll get the existing metrics changed like:

        filters:
        - key: FlowDirection
          value: "1"
        - key: Duplicate
          value: "false"

You'll need to make the same change on newly added metrics as well (for all metrics counting bytes or packets)


import "encoding/json"

func FilterDashboardRows(dashboard string, ignoreFlags []string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I'm correct, tags are duplicated: in every dashboard row (for dashboards filtering) and in metrics definitions (for FLP confgen). I guess it's ok for now but I think it could be error prone, especially as the dashboard jsons are very noisy and hard to review. So maybe we can think about how would could "secure" this, either for right now or for a follow-up. I can imagine:

  • modifying FLP confgen so that it gives us the list of excluded and included metrics; then based on that, we would do the dashboards filtering based on a grep (Contains) on the promql expression
  • or do the same thing without confgen, but it means the operator parsing the metrics definition files
  • or, since those dashboards are statically defined, we could also just write a unit test that validates that the dashboard tags correctly match the metric defs tags

wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw speaking of unit test I think it would be good to have one here that ensures the dashboard files are correctly parsed without errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the first option would be the best. We can dig in that, maybe in a followup ?

The parsing is already ensured by this test
https://github.com/jpinsonneau/network-observability-operator/blob/71da325c421290cf515e8ef6aed34b6b60db0577/controllers/flowcollector_controller_test.go#L336-L348

Comment on lines 8 to 11
tags:
- flows
- namespaces
- health
Copy link
Member

@jotak jotak Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling with the tag system to be able to get exactly what we want; I think we would need this kind of thing, to allow more fine-grained selection:

Suggested change
tags:
- flows
- namespaces
- health
tags:
- flows
- namespaces
- namespaces-flows

wdyt?
(btw currently I removed health because it fully overlaps with flows .. I don't know if it may become useful again later)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we will end up with:

  • nodes-flows
  • namespaces-flows
  • workloads-flows

that could be disabled separately for health dashboard. That makes sense to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 13, 2023
@codecov
Copy link

codecov bot commented Jul 13, 2023

Codecov Report

Merging #388 (69b1105) into main (0d6ebe8) will increase coverage by 1.66%.
Report is 1 commits behind head on main.
The diff coverage is 85.42%.

@@            Coverage Diff             @@
##             main     #388      +/-   ##
==========================================
+ Coverage   53.67%   55.33%   +1.66%     
==========================================
  Files          44       45       +1     
  Lines        5559     5824     +265     
==========================================
+ Hits         2984     3223     +239     
- Misses       2359     2381      +22     
- Partials      216      220       +4     
Flag Coverage Δ
unittests 55.33% <85.42%> (+1.66%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
api/v1alpha1/zz_generated.conversion.go 0.25% <0.00%> (-0.01%) ⬇️
api/v1beta1/flowcollector_types.go 100.00% <ø> (ø)
controllers/flowlogspipeline/flp_reconciler.go 67.85% <ø> (-5.32%) ⬇️
controllers/flowcollector_controller.go 53.71% <20.00%> (-1.41%) ⬇️
controllers/flowlogspipeline/flp_common_objects.go 81.21% <60.00%> (-0.56%) ⬇️
...ntrollers/flowlogspipeline/flp_monolith_objects.go 88.46% <75.00%> (ø)
...ontrollers/flowlogspipeline/flp_transfo_objects.go 93.68% <75.00%> (ø)
controllers/flowcollector_objects.go 46.91% <77.77%> (+9.62%) ⬆️
pkg/helper/dashboard.go 96.49% <96.49%> (ø)
api/v1beta1/zz_generated.deepcopy.go 55.36% <100.00%> (+1.58%) ⬆️
... and 5 more

... and 1 file with indirect coverage changes

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 17, 2023
@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:e2f53eb
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-e2f53eb
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-e2f53eb

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:e2f53eb make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-e2f53eb

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-e2f53eb
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 17, 2023
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 17, 2023
@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:eb2201e
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-eb2201e
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-eb2201e

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:eb2201e make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-eb2201e

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-eb2201e
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 17, 2023
@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:5493d75
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-5493d75
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-5493d75

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:5493d75 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-5493d75

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-5493d75
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@memodi
Copy link
Contributor

memodi commented Jul 21, 2023

@jpinsonneau @jotak - I am not sure what's the intention of "Application" panel here, based on looking at panels, I find difficult to spot differences between Application and Infrastructure.

image

I was thinking Application would be something that's NOT netobserv|openshift.* and to have view something like this:

image

wdyt?

@jotak
Copy link
Member

jotak commented Jul 24, 2023

Yes, we should exclude netobserv from applications, I agree

jpinsonneau and others added 9 commits July 24, 2023 18:22
Co-authored-by: Joel Takvorian <joel.takvorian@qaraywa.net>
Namespace-based metrics are included in workload's ones, so this would
result in unnecesary metrics creation when both of them are enabled

So namespaces tags is now disabled by default; to still show
namespace-based metrics in dashboard, we need to use the workload
metrics aggregated by namespace. So We need to duplicate the related
rows in the dashboard.
@openshift-ci openshift-ci bot removed the lgtm label Jul 24, 2023
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 24, 2023
@jotak
Copy link
Member

jotak commented Jul 24, 2023

(FYI I just rebased the branch)

jotak and others added 3 commits July 24, 2023 18:24
Changed for a programmatic approach, to handle more easily json
manipulation and avoid the very large (and very redundant) dashboard json

Also change from using topk5 to topk10 as it renders better charts (less
chunks in timeseries)
Inject namespace for infra filters
@jpinsonneau
Copy link
Contributor Author

Tested back with @jotak changes and works fine:
image
image

@memodi note that @jotak excluded netobserv from Applications panels
image
I only see it appearing for kube api queries (default -> netobserv)

@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jul 26, 2023
@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:9f7f8f3
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-9f7f8f3
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-9f7f8f3

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:9f7f8f3 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-9f7f8f3

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-9f7f8f3
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@jotak
Copy link
Member

jotak commented Jul 26, 2023

@memodi also another change that I did is to run a top10 instead of top5, because I found top5 has a tendency to often show different metrics for different timestamps, which results in "chunking" more the timeseries. With top10 I feel like the visualization is better / more consistent; sounds good?

(of course it entirely depends on the number and the nature of the workloads that people will be running on their clusters)

@memodi
Copy link
Contributor

memodi commented Jul 31, 2023

/ok-to-test

@jotak
Copy link
Member

jotak commented Jul 31, 2023

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jul 31, 2023
@openshift-ci
Copy link

openshift-ci bot commented Jul 31, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit fe0ace8 into netobserv:main Jul 31, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants