Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana dashboards #260

Merged
merged 3 commits into from
Mar 10, 2023

Conversation

KalmanMeth
Copy link
Contributor

@KalmanMeth KalmanMeth commented Feb 2, 2023

Take grafana dashboards created by flp confgenerator [https://github.com/netobserv/flowlogs-pipeline/pull/369] and insert them in the Openshift console.
It is required to first merge FLP PR369, then update go.mod, go mod vendor, etc, and then to merge this PR.

Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @KalmanMeth ! Just small suggestions on my side

controllers/flowlogspipeline/flp_common_objects.go Outdated Show resolved Hide resolved
controllers/flowlogspipeline/flp_monolith_reconciler.go Outdated Show resolved Hide resolved
go.mod Outdated
Comment on lines 97 to 98

replace github.com/netobserv/flowlogs-pipeline v0.1.7-0.20221221173558-e6e77158b956 => github.com/KalmanMeth/flowlogs2metrics v0.0.0-20230129124539-a447dc625fd4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
replace github.com/netobserv/flowlogs-pipeline v0.1.7-0.20221221173558-e6e77158b956 => github.com/KalmanMeth/flowlogs2metrics v0.0.0-20230129124539-a447dc625fd4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant. I included the latest release of flp confgenerator, which now includes the function to create the dashboards.

@jpinsonneau
Copy link
Contributor

jpinsonneau commented Feb 13, 2023

@KalmanMeth I'm trying that on cluster bot using OCP 4.12.1; The dashboard is correctly created but I don't get data in the graphs.

image

FLP is using quay.io/netobserv/flowlogs-pipeline:main image with following config:

{
  "health": {
    "port": 8080
  },
  "log-level": "info",
  "metrics-settings": {
    "prefix": "netobserv_",
    "noPanic": true
  },
  "parameters": [
    {
      "name": "grpc",
      "ingest": {
        "type": "grpc",
        "grpc": {
          "port": 2055
        }
      }
    },
    {
      "name": "enrich",
      "transform": {
        "type": "network",
        "network": {
          "rules": [
            {
              "input": "SrcAddr",
              "output": "SrcK8S",
              "type": "add_kubernetes"
            },
            {
              "input": "DstAddr",
              "output": "DstK8S",
              "type": "add_kubernetes"
            },
            {
              "type": "reinterpret_direction"
            }
          ],
          "directionInfo": {
            "reporterIPField": "AgentIP",
            "srcHostField": "SrcK8S_HostIP",
            "dstHostField": "DstK8S_HostIP",
            "flowDirectionField": "FlowDirection",
            "ifDirectionField": "IfDirection"
          }
        }
      }
    },
    {
      "name": "loki",
      "write": {
        "type": "loki",
        "loki": {
          "url": "http://loki.netobserv.svc:3100/",
          "tenantID": "netobserv",
          "batchWait": "1s",
          "batchSize": 10485760,
          "timeout": "10s",
          "minBackoff": "1s",
          "maxBackoff": "5s",
          "maxRetries": 2,
          "labels": [
            "SrcK8S_Namespace",
            "SrcK8S_OwnerName",
            "DstK8S_Namespace",
            "DstK8S_OwnerName",
            "FlowDirection"
          ],
          "staticLabels": {
            "app": "netobserv-flowcollector"
          },
          "clientConfig": {
            "proxy_url": null,
            "tls_config": {
              "insecure_skip_verify": false
            },
            "follow_redirects": false
          },
          "timestampLabel": "TimeFlowEndMs",
          "timestampScale": "1ms"
        }
      }
    },
    {
      "name": "prometheus",
      "encode": {
        "type": "prom",
        "prom": {
          "metrics": [
            {
              "name": "namespace_flows_total",
              "type": "counter",
              "filter": {
                "key": "",
                "value": ""
              },
              "valueKey": "",
              "labels": [
                "SrcK8S_Namespace",
                "DstK8S_Namespace"
              ],
              "buckets": null
            },
            {
              "name": "node_ingress_bytes_total",
              "type": "counter",
              "filter": {
                "key": "FlowDirection",
                "value": "0"
              },
              "valueKey": "Bytes",
              "labels": [
                "SrcK8S_HostName",
                "DstK8S_HostName"
              ],
              "buckets": null
            },
            {
              "name": "workload_ingress_bytes_total",
              "type": "counter",
              "filter": {
                "key": "FlowDirection",
                "value": "0"
              },
              "valueKey": "Bytes",
              "labels": [
                "SrcK8S_Namespace",
                "DstK8S_Namespace",
                "SrcK8S_OwnerName",
                "DstK8S_OwnerName",
                "SrcK8S_OwnerType",
                "DstK8S_OwnerType"
              ],
              "buckets": null
            }
          ],
          "port": 9102,
          "prefix": "netobserv_"
        }
      }
    }
  ],
  "pipeline": [
    {
      "name": "grpc"
    },
    {
      "name": "enrich",
      "follows": "grpc"
    },
    {
      "name": "loki",
      "follows": "enrich"
    },
    {
      "name": "prometheus",
      "follows": "enrich"
    }
  ],
  "profile": {
    "port": 6060
  }
}

And here is the dashboard config:

{
   "__inputs": [ ],
   "__requires": [ ],
   "annotations": {
      "list": [ ]
   },
   "editable": false,
   "gnetId": null,
   "graphTooltip": 0,
   "hideControls": false,
   "id": null,
   "links": [ ],
   "panels": [
      {
         "aliasColors": { },
         "bars": false,
         "dashLength": 10,
         "dashes": false,
         "datasource": "prometheus",
         "fill": 1,
         "fillGradient": 0,
         "gridPos": {
            "h": 20,
            "w": 25,
            "x": 0,
            "y": 0
         },
         "id": 2,
         "legend": {
            "alignAsTable": false,
            "avg": false,
            "current": false,
            "max": false,
            "min": false,
            "rightSide": false,
            "show": true,
            "sideWidth": null,
            "total": false,
            "values": false
         },
         "lines": true,
         "linewidth": 1,
         "links": [ ],
         "nullPointMode": "null",
         "percentage": false,
         "pointradius": 5,
         "points": false,
         "renderer": "flot",
         "repeat": null,
         "seriesOverrides": [ ],
         "spaceLength": 10,
         "stack": false,
         "steppedLine": false,
         "targets": [
            {
               "expr": "topk(5,rate(netobserv_namespace_flows_total[1m]))",
               "format": "time_series",
               "intervalFactor": 2,
               "legendFormat": "",
               "refId": "A"
            }
         ],
         "thresholds": [ ],
         "timeFrom": null,
         "timeShift": null,
         "title": "Flows rate per Namespace",
         "tooltip": {
            "shared": true,
            "sort": 0,
            "value_type": "individual"
         },
         "type": "graph",
         "xaxis": {
            "buckets": null,
            "mode": "time",
            "name": null,
            "show": true,
            "values": [ ]
         },
         "yaxes": [
            {
               "format": "short",
               "label": null,
               "logBase": 1,
               "max": null,
               "min": null,
               "show": true
            },
            {
               "format": "short",
               "label": null,
               "logBase": 1,
               "max": null,
               "min": null,
               "show": true
            }
         ]
      },
      {
         "aliasColors": { },
         "bars": false,
         "dashLength": 10,
         "dashes": false,
         "datasource": "prometheus",
         "fill": 1,
         "fillGradient": 0,
         "gridPos": {
            "h": 20,
            "w": 25,
            "x": 0,
            "y": 0
         },
         "id": 3,
         "legend": {
            "alignAsTable": false,
            "avg": false,
            "current": false,
            "max": false,
            "min": false,
            "rightSide": false,
            "show": true,
            "sideWidth": null,
            "total": false,
            "values": false
         },
         "lines": true,
         "linewidth": 1,
         "links": [ ],
         "nullPointMode": "null",
         "percentage": false,
         "pointradius": 5,
         "points": false,
         "renderer": "flot",
         "repeat": null,
         "seriesOverrides": [ ],
         "spaceLength": 10,
         "stack": false,
         "steppedLine": false,
         "targets": [
            {
               "expr": "topk(5,rate(netobserv_node_ingress_bytes_total[1m]))",
               "format": "time_series",
               "intervalFactor": 2,
               "legendFormat": "",
               "refId": "A"
            }
         ],
         "thresholds": [ ],
         "timeFrom": null,
         "timeShift": null,
         "title": "Ingress Bandwidth",
         "tooltip": {
            "shared": true,
            "sort": 0,
            "value_type": "individual"
         },
         "type": "graph",
         "xaxis": {
            "buckets": null,
            "mode": "time",
            "name": null,
            "show": true,
            "values": [ ]
         },
         "yaxes": [
            {
               "format": "short",
               "label": null,
               "logBase": 1,
               "max": null,
               "min": null,
               "show": true
            },
            {
               "format": "short",
               "label": null,
               "logBase": 1,
               "max": null,
               "min": null,
               "show": true
            }
         ]
      },
      {
         "aliasColors": { },
         "bars": false,
         "dashLength": 10,
         "dashes": false,
         "datasource": "prometheus",
         "fill": 1,
         "fillGradient": 0,
         "gridPos": {
            "h": 20,
            "w": 25,
            "x": 0,
            "y": 0
         },
         "id": 4,
         "legend": {
            "alignAsTable": false,
            "avg": false,
            "current": false,
            "max": false,
            "min": false,
            "rightSide": false,
            "show": true,
            "sideWidth": null,
            "total": false,
            "values": false
         },
         "lines": true,
         "linewidth": 1,
         "links": [ ],
         "nullPointMode": "null",
         "percentage": false,
         "pointradius": 5,
         "points": false,
         "renderer": "flot",
         "repeat": null,
         "seriesOverrides": [ ],
         "spaceLength": 10,
         "stack": false,
         "steppedLine": false,
         "targets": [
            {
               "expr": "topk(5,rate(netobserv_workload_ingress_bytes_total[1m]))",
               "format": "time_series",
               "intervalFactor": 2,
               "legendFormat": "",
               "refId": "A"
            }
         ],
         "thresholds": [ ],
         "timeFrom": null,
         "timeShift": null,
         "title": "Ingress Bandwidth by source and destination",
         "tooltip": {
            "shared": true,
            "sort": 0,
            "value_type": "individual"
         },
         "type": "graph",
         "xaxis": {
            "buckets": null,
            "mode": "time",
            "name": null,
            "show": true,
            "values": [ ]
         },
         "yaxes": [
            {
               "format": "short",
               "label": null,
               "logBase": 1,
               "max": null,
               "min": null,
               "show": true
            },
            {
               "format": "short",
               "label": null,
               "logBase": 1,
               "max": null,
               "min": null,
               "show": true
            }
         ]
      }
   ],
   "refresh": "",
   "rows": [ ],
   "schemaVersion": 16,
   "style": "dark",
   "tags": [
      "netobserv",
      "grafana",
      "dashboard",
      "flp"
   ],
   "templating": {
      "list": [ ]
   },
   "time": {
      "from": "now",
      "to": "now"
   },
   "timepicker": {
      "refresh_intervals": [
         "5s",
         "10s",
         "30s",
         "1m",
         "5m",
         "15m",
         "30m",
         "1h",
         "2h",
         "1d"
      ],
      "time_options": [
         "5m",
         "15m",
         "1h",
         "6h",
         "12h",
         "24h",
         "2d",
         "7d",
         "30d"
      ]
   },
   "timezone": "browser",
   "title": "Netobserv Metrics",
   "version": 0
}

Am I missing something ?

@KalmanMeth
Copy link
Contributor Author

@jpinsonneau Did you run the hack/enable_metrics.sh script? Do you have data being generated, such as from the sample workload?

@jpinsonneau
Copy link
Contributor

@jpinsonneau Did you run the hack/enable_metrics.sh script? Do you have data being generated, such as from the sample workload?

Thanks ! the content is now loading.
I thought that part was also automated. Is there any reason to avoid creating this automatically ?

Also I have issues with labels appearing empty:
image

@KalmanMeth
Copy link
Contributor Author

@jpinsonneau

Also I have issues with labels appearing empty:

Yes, I also do not see proper labels. I might need some help to understand what is needed here. I am continuing to investigate.

@jpinsonneau
Copy link
Contributor

Yes, I also do not see proper labels. I might need some help to understand what is needed here. I am continuing to investigate.

You should add legendFormat in targets such as:

 "targets": [
    {
       "expr": "topk(5,rate(netobserv_namespace_flows_total[1m]))",
       ...
       "legendFormat": "{{pod}}:{{Value}}",
    }
 ],

@@ -21,7 +21,7 @@ visualization:
type: grafana
grafana:
- expr: 'topk(5,rate(netobserv_namespace_flows_total[1m]))'
legendFormat: '{{SrcK8S_Namespace}} : {{DstK8S_Namespace}}'
legendFormat: '{{pod}} : {{DstK8S_Namespace}}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all legendFormat, what about using the same format as console plugin ?
When namespace is available{{namespace}}.{{name}} (resource / owner) else only {{name}} (namespace / host / ip) for every peers
https://github.com/netobserv/network-observability-console-plugin/blob/68239b88737f676152a9cf51e3b58bc6b781bd75/web/src/utils/metrics.ts#L77

Then we use -> between source and destination as:
{{source}} -> {{destination}}
https://github.com/netobserv/network-observability-console-plugin/blob/fe28d186fcd9acc1078c54db7690e86d6c83a293/web/src/components/metrics/metrics-helper.tsx#L147

Here is the result:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added -> between source and destination. Instead of . separating namespace and owner I used , because sometimes the name of an owner or namespace already includes . .

I included the pod in the legend because we get separate statistics for each flowlogs-pipeline instance that is running. Since the different FLPs report separately and have different labels, we do not have a cumulative summed total, since the same metrics may be spread across multiple FLPs. We have to think about how we want to handle this.

We can further change the legends in a future PR by simply changing the metrics_definitions yaml files, with no additional changes in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's good enough to merge as is. +1 for followup on totals

Thanks @KalmanMeth !

config/manager/kustomization.yaml Outdated Show resolved Hide resolved
jpinsonneau
jpinsonneau previously approved these changes Feb 21, 2023
Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested on OCP 4.12.4 and works fine 👍
Thanks @KalmanMeth !

image

kubectl delete servicemonitor netobserv-plugin || true
kubectl delete configmap -n openshift-config-managed flowlogs-pipeline-metrics-dashboard || true
kubectl delete configmap -n netobserv flowlogs-pipeline-config || true
kubectl delete configmap -n netobserv console-plugin-config || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing make undeploy-sample-cr should be sufficient to remove everything created without having to maintain a custom list here. Are there downsides using this instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you just need to make sure that the operator is still running while doing make undeploy-sample-cr, else it won't work)

Copy link
Contributor Author

@KalmanMeth KalmanMeth Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make undeploy-operator is very convenient to clean up all the allocated items in the case we are debugging the operator running locally and something goes wrong.

@@ -40,6 +40,7 @@ func newMonolithReconciler(info *reconcilersCommonInfo) *flpMonolithReconciler {
promService: &corev1.Service{},
serviceAccount: &corev1.ServiceAccount{},
configMap: &corev1.ConfigMap{},
dbConfigMap: &corev1.ConfigMap{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this new config map should also be added with AddManagedObject, same as the existing configmap. This is necessary to manage namespace switching: if netobserv is moved to a different namespace, it ensures the previous configmap is correctly going to be deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same applies to flp_transfo_reconciler file)

Copy link
Contributor Author

@KalmanMeth KalmanMeth Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I thought at first. But then I ran into an issue that this new config map is not really in the netobserv namespace. It technically is in the openshift-config-managed namespace, but we have to set it up when we create all the objects in the netobserv namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't notice that, thanks for the explaination

} else if !equality.Semantic.DeepDerivative(newCM.Data, r.owned.configMap.Data) {
if err := r.UpdateOwned(ctx, r.owned.configMap, newCM); err != nil {
return err
}
if r.reconcilersCommonInfo.availableAPIs.HasConsoleConfig() {
if err := r.UpdateOwned(ctx, r.owned.dbConfigMap, dbConfigMap); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about that ... basically it is saying : "if the main configMap is changed, then update the DB config map" ?
I would rather copy-paste the whole if/else block and adapt it for dbConfiogMap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same applies to flp_transfo_reconciler file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the main configMap is changed, it is possible that the labels in the dashboard are changed because of a new prefix to the prometheus variables. Otherwise, there is no reason to track the changes to the new dashboards configMap, and as noted above, the dashboard configMap does not technically belong to the netobserv` namespace.

Copy link
Member

@jotak jotak Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but tying the dbConfigMap lifecycle to the presence of configMap seems risky to me

For instance, imagine that thr process runs through the top if !r.nobjMngr.Exists(r.owned.configMap) at first creation, and configMap is created, so far so good. Then imagine there's an issue (e.g. a temporary connectivity issue) and the subsequent call to r.CreateOwned(ctx, dbConfigMap) fails: in that case, configMap exists and dbConfigMap doesn't. In such scenario, the operator will never be able to recover from that state, as the reconcile doesn't create dbConfigMap when configMap exists.

Copy link
Contributor Author

@KalmanMeth KalmanMeth Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst that happens is that there is no dashbaord, but everything else runs properly.

To manage it independently from the original configMap, we have to track it. We cannot put it in the netobserv nobjMngr structure because it is not in the netobserv namespace and it would cause other confusion. So we would need another place to save the status of this particular dbConfigMap. Where would you suggest?

Note: we would have to perform a separate Fetch for this object since it would not come back in the FetchAll operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Something like this should work I guess:

func (r *reconcilersCommonInfo) reconcileDbConfig(ctx context.Context, dbConfigMap *corev1.ConfigMap) error {
	curr := &corev1.ConfigMap{}
	if err := r.Get(ctx, types.NamespacedName{
		Name:      dbConfigMap.Name,
		Namespace: dbConfigMap.Namespace,
	}, curr); err != nil {
		if errors.IsNotFound(err) {
			return r.CreateOwned(ctx, dbConfigMap)
		}
		return err
	}
	if !equality.Semantic.DeepDerivative(dbConfigMap.Data, curr.Data) {
		return r.UpdateOwned(ctx, curr, dbConfigMap)
	}
	return nil
}

(Same function can be shared between monolith & transfo reconcilers)
It makes also the dbConfigMap: &corev1.ConfigMap{} in owned struct useless

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jotak
Copy link
Member

jotak commented Feb 22, 2023

Thanks for the PR
Some remarks on the reconcilers & objects lifecycle, but other than that looks good!

jotak
jotak previously approved these changes Feb 27, 2023
Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

@KalmanMeth
Copy link
Contributor Author

rebased

@KalmanMeth
Copy link
Contributor Author

reintegrated with latest code base

@codecov
Copy link

codecov bot commented Mar 5, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@176f7d9). Click here to learn what that means.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #260   +/-   ##
=======================================
  Coverage        ?   48.32%           
=======================================
  Files           ?       43           
  Lines           ?     4919           
  Branches        ?        0           
=======================================
  Hits            ?     2377           
  Misses          ?     2335           
  Partials        ?      207           
Flag Coverage Δ
unittests 48.32% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jotak
Copy link
Member

jotak commented Mar 9, 2023

I tested the PR, it looks good, there's still a few things to do but they can be addressed in a follow-up:

  1. When all metrics are disabled, the dashboard is still present (and empty). It should be removed in that case.
  2. We need to tune the defaults so that untested/undocumented dashboards (wrt QE/doc tasks) are removed for the moment
  3. Name & flags look a bit weird compared to others. I think we must be missing some guidelines here (to check with the monitoring team)

Capture d’écran du 2023-03-09 17-00-13

@jotak
Copy link
Member

jotak commented Mar 10, 2023

/approve
/lgtm
thanks @KalmanMeth !

@openshift-ci openshift-ci bot added the lgtm label Mar 10, 2023
@openshift-ci
Copy link

openshift-ci bot commented Mar 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 8af3676 into netobserv:main Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants