-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-1075 Dashboard feedback: split infra / applications #388
Conversation
jpinsonneau
commented
Jul 3, 2023
•
edited
Loading
edited
- moved to embedded json approach instead of FLP for more flexibility
- splitted application / infrastructures namespace & workload graphs
- grouped by sections
New images:
They will expire after two weeks. Catalog source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-b1a5d4e
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
@@ -326,14 +323,6 @@ func flowCollectorControllerSpecs() { | |||
Expect(ds.Spec.Template.Spec.Tolerations). | |||
To(ContainElement(v1.Toleration{Operator: v1.TolerationOpExists})) | |||
}) | |||
|
|||
By("Expecting the monitoring dashboards configmap to be deleted") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that, but I guess that's the drawback of not using FLP: we loose the ability to tie dashboards with metrics presence. Hence this test not passing, I guess. Can we think about a simple alternative for that? Maybe it requires some discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 71da325 with other enhancements described below
Updated this PRs to address all feedback gathered:
Here are examples of dashboards with no filters: |
api/v1beta1/flowcollector_types.go
Outdated
// Available tags are: `egress`, `ingress`, `flows`, `bytes`, `packets`, `namespaces`, `nodes`, `workloads`. | ||
//+kubebuilder:default:={"egress","packets"} | ||
// Available tags are: `egress`, `ingress`, `flows`, `bytes`, `packets`, `namespaces`, `nodes`, `workloads`, `health`. | ||
//+kubebuilder:default:={"egress","packets", "health"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would ignore health by default, this isn't what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better naming can be found ?
The metrics flagged with health
are only flows metrics definitions
per nodes / namespaces / workloads
The health dashboard will look like as before by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or health is not what I thought it is )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like, at least for now, "health" fully overlaps with "flows" so we could keep just "flows"
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:3e2639c make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-3e2639c Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-3e2639c
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
- name: namespace_egress_bytes_total | ||
type: counter | ||
valuekey: Bytes | ||
filter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After rebasing, you'll get the existing metrics changed like:
filters:
- key: FlowDirection
value: "1"
- key: Duplicate
value: "false"
You'll need to make the same change on newly added metrics as well (for all metrics counting bytes or packets)
|
||
import "encoding/json" | ||
|
||
func FilterDashboardRows(dashboard string, ignoreFlags []string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I'm correct, tags are duplicated: in every dashboard row (for dashboards filtering) and in metrics definitions (for FLP confgen). I guess it's ok for now but I think it could be error prone, especially as the dashboard jsons are very noisy and hard to review. So maybe we can think about how would could "secure" this, either for right now or for a follow-up. I can imagine:
- modifying FLP confgen so that it gives us the list of excluded and included metrics; then based on that, we would do the dashboards filtering based on a grep (Contains) on the promql expression
- or do the same thing without confgen, but it means the operator parsing the metrics definition files
- or, since those dashboards are statically defined, we could also just write a unit test that validates that the dashboard tags correctly match the metric defs tags
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw speaking of unit test I think it would be good to have one here that ensures the dashboard files are correctly parsed without errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the first option would be the best. We can dig in that, maybe in a followup ?
The parsing is already ensured by this test
https://github.com/jpinsonneau/network-observability-operator/blob/71da325c421290cf515e8ef6aed34b6b60db0577/controllers/flowcollector_controller_test.go#L336-L348
tags: | ||
- flows | ||
- namespaces | ||
- health |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling with the tag system to be able to get exactly what we want; I think we would need this kind of thing, to allow more fine-grained selection:
tags: | |
- flows | |
- namespaces | |
- health | |
tags: | |
- flows | |
- namespaces | |
- namespaces-flows |
wdyt?
(btw currently I removed health
because it fully overlaps with flows
.. I don't know if it may become useful again later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we will end up with:
- nodes-flows
- namespaces-flows
- workloads-flows
that could be disabled separately for health dashboard. That makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## main #388 +/- ##
==========================================
+ Coverage 53.67% 55.33% +1.66%
==========================================
Files 44 45 +1
Lines 5559 5824 +265
==========================================
+ Hits 2984 3223 +239
- Misses 2359 2381 +22
- Partials 216 220 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:e2f53eb make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-e2f53eb Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-e2f53eb
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:eb2201e make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-eb2201e Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-eb2201e
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:5493d75 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-5493d75 Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-5493d75
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
@jpinsonneau @jotak - I am not sure what's the intention of "Application" panel here, based on looking at panels, I find difficult to spot differences between Application and Infrastructure. I was thinking Application would be something that's NOT wdyt? |
Yes, we should exclude netobserv from applications, I agree |
Co-authored-by: Joel Takvorian <joel.takvorian@qaraywa.net>
Namespace-based metrics are included in workload's ones, so this would result in unnecesary metrics creation when both of them are enabled So namespaces tags is now disabled by default; to still show namespace-based metrics in dashboard, we need to use the workload metrics aggregated by namespace. So We need to duplicate the related rows in the dashboard.
(FYI I just rebased the branch) |
Changed for a programmatic approach, to handle more easily json manipulation and avoid the very large (and very redundant) dashboard json Also change from using topk5 to topk10 as it renders better charts (less chunks in timeseries)
Inject namespace for infra filters
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:9f7f8f3 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-9f7f8f3 Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-9f7f8f3
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
@memodi also another change that I did is to run a top10 instead of top5, because I found top5 has a tendency to often show different metrics for different timestamps, which results in "chunking" more the timeseries. With top10 I feel like the visualization is better / more consistent; sounds good? (of course it entirely depends on the number and the nature of the workloads that people will be running on their clusters) |
/ok-to-test |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jotak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |