Skip to content

Latest commit

 

History

History
146 lines (97 loc) · 19.6 KB

monitoring-and-logging.md

File metadata and controls

146 lines (97 loc) · 19.6 KB

Quick Links

Switching to new Observability Cluster - Design, Current Status & Changes

Monitoring

MultiClusterObservability component

  • The ACM MultiClusterObservability component allows us to configure the storage class, storage size, rule storage size, receive storage size, compact storage size, alert manager storage size, metric object storage bucket, interval, and downsampling of the observability. It also allows us to configure the replicas and node selectors for each of the observability components (store, receive, grafana, query, alert manager, store memcached, RBAC query proxy, observatorium API, query frontend, rule, and query frontend memcached. See the Multi Cluster Observability component here.
  • You can find the observability components in the open-cluster-management-observability namespace here.

Monitoring Troubleshooting

Drop and Recreate multicluster-engine Operator

In case you encounter errors with ACM upgrades and Observability, it may be due to the multicluster-engine being out of sync with old ACM data:

E1212 15:28:42.757827 1 helmreleasemgr.go:99] failed to download chart from helm repo. - url: http://multiclusterhub-repo.open-cluster-management.svc.cluster.local:3000/charts/policyreport-2.5.3.tgz error: return code: 404 unable to retrieve chart - Failed to download the chart

error validating existing CRs against new CRD's schema for "multiclusterobservabilities.observability.open-cluster-management.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"observability.open-cluster-management.io", Version:"v1beta1", Resource:"multiclusterobservabilities"}: conversion webhook for observability.open-cluster-management.io/v1beta2, Kind=MultiClusterObservability failed: Post "https://multicluster-observability-webhook-service.open-cluster-management.svc:443/convert?timeout=30s": no endpoints available for service "multicluster-observability-webhook-service"

It requires deleting the multicluster-engine Subscription and CSV, deleting the openshift-monitoring pods, and deleting the ACM Subscription, CSV, and MultiClusterObservability CRD, and putting it all back again.

Using the Monitoring tools

Monitoring and logging for the infrastructure hardware and software that is not OpenShift (for example Grafana).

As a NERC administrator, I should be able to monitor the status of any infrastructure software or hardware that supports operations for the NERC OpenShift environment, even if it is not itself part of OpenShift.

Steps for reporting

You can access many metrics for pods of applications in a namespace. See some of the available logs and metrics:

Using the Reporting tools

Track/report usage of the cluster

As an administrator of the cluster, I should be able to view daily, weekly, and monthly reports of the cluster infrastructure utilization.

Steps

  • Administrator logs into the associated XDMoD instance and views reports.
  • Click here to view the ACM Observability Grafana dashboards. These dashboards provide insights into Control Plane Health, Optimization, Capacity, Utilization, and more. You can change the timespan in the top right to show results in terms of minutes, hours, days, months or years.

Track/report usage of the project

As a user and the owner of a project, I should be able to view daily, weekly, and monthly reports of the infrastructure utilization by the projects I own.

Steps to track usage

Ceph Storage Space Monitoring

Log archiving and rollover could run the Ceph Storage out of space. Because the metrics to calculate space on the ceph cluster are not yet sent to Observability, they are available in the OpenShift Monitoring instead. Check on log storage space consumed vs. available using these OpenShift metrics:

  1. OpenShift Data Foundations Ceph Storage Total Storage

    Ceph Total Storage

  2. OpenShift Data Foundations Ceph Storage Storage Used

    Ceph Storage Used

  3. OpenShift Data Foundations Ceph Storage Percent Used

    Ceph Percent Used

MultiClusterObservability documentation

Here are some useful links to the MultiClusterObservability documentation:

Logging

Logging Operators

  • Logging in the cluster is provided by the Red Hat Red Hat OpenShift Logging here.
  • We combine the OpenShift Logging Operator with the Loki Operator here, so that the Logging Operator sends the infrastructure, audit, and application logs to the Loki Operator where they are stored in an Object Bucket.
  • The OpenShift Logging Operator has a dependency on the Elasticsearch Operator here. Whether you use Elasticsearch for storing logs or using Loki, you still need the Elasticsearch Operator installed for required dependent CustomResourceDefinitions.

Loki operator

Tracking events in the Logging System

As an administrator of the cluster, I should be able to track all the events in the cluster using the logging system in OpenShift.

Steps

  • Click here to visit the Logs.
  • You can easily filter by recent date, or date range in the past.
  • You can easily filter by content, namespaces, pods, and containers.
  • You can also filter by log levels: critical, error, warning, info, debug, trace, unknown.
  • Click "Show Query" to add more advanced filters like cluster ID:
    • Here are the logs for the infra cluster, you can also add the following query to the end of your log query to filter on infra cluster logs: | openshift_cluster_id="b3c6e302-f119-4adb-bc48-e04c6aa2eaa5"
    • Here are the logs for the prod cluster, you can also add the following query to the end of your log query to filter on infra cluster logs: | openshift_cluster_id="fcb727d6-3e61-4d23-913d-756cf41c7982"
  • NERC Admins have access to application logs.
  • Infrastructure and audit logs have always been reserved to cluster admins in OpenShift Logging ( even on the old stack with Elasticsearch). LokiStack is best configured for admin access via a group (currently we support three dedicated names cluster-admin, dedicated-admin and the standard group for kubeadmin). These groups require a ClusterRoleBinding to the ClusterAdmin ClusterRole.

Cluster Logging documentation

Here are some useful links to the MultiClusterObservability documentation: