Quick Links

ACM Observability Monitoring Grafana Dashboard
Multi Cluster Logging with Loki Operator
OpenShift Data Foundations Ceph Storage Percent Used in OpenShift Monitoring

Switching to new Observability Cluster - Design, Current Status & Changes

Design: Logging System
Design: Observability Architecture

Monitoring

Monitoring in the cluster is provided by the Red Hat Advanced Cluster Management Operator here.

MultiClusterObservability component

The ACM MultiClusterObservability component allows us to configure the storage class, storage size, rule storage size, receive storage size, compact storage size, alert manager storage size, metric object storage bucket, interval, and downsampling of the observability. It also allows us to configure the replicas and node selectors for each of the observability components (store, receive, grafana, query, alert manager, store memcached, RBAC query proxy, observatorium API, query frontend, rule, and query frontend memcached. See the Multi Cluster Observability component here.
You can find the observability components in the open-cluster-management-observability namespace here.

Monitoring Troubleshooting

Drop and Recreate multicluster-engine Operator

In case you encounter errors with ACM upgrades and Observability, it may be due to the multicluster-engine being out of sync with old ACM data:

E1212 15:28:42.757827 1 helmreleasemgr.go:99] failed to download chart from helm repo. - url: http://multiclusterhub-repo.open-cluster-management.svc.cluster.local:3000/charts/policyreport-2.5.3.tgz error: return code: 404 unable to retrieve chart - Failed to download the chart

error validating existing CRs against new CRD's schema for "multiclusterobservabilities.observability.open-cluster-management.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"observability.open-cluster-management.io", Version:"v1beta1", Resource:"multiclusterobservabilities"}: conversion webhook for observability.open-cluster-management.io/v1beta2, Kind=MultiClusterObservability failed: Post "https://multicluster-observability-webhook-service.open-cluster-management.svc:443/convert?timeout=30s": no endpoints available for service "multicluster-observability-webhook-service"

It requires deleting the multicluster-engine Subscription and CSV, deleting the openshift-monitoring pods, and deleting the ACM Subscription, CSV, and MultiClusterObservability CRD, and putting it all back again.

Using the Monitoring tools

Monitoring and logging for the infrastructure hardware and software that is not OpenShift (for example Grafana).

As a NERC administrator, I should be able to monitor the status of any infrastructure software or hardware that supports operations for the NERC OpenShift environment, even if it is not itself part of OpenShift.

Steps for reporting

You can access many metrics for pods of applications in a namespace. See some of the available logs and metrics:

Click here to visit the cpu usage logs for dex.
Click here to visit the cpu usage logs for gitops.
Click here to visit the cpu usage logs for grafana.
Click here to visit the cpu usage logs for logging.
Click here to visit the cpu usage logs for loki.
Click here to visit the cpu usage logs for vault.
Click here to visit the cpu usage logs for xdmod.

Using the Reporting tools

Track/report usage of the cluster

As an administrator of the cluster, I should be able to view daily, weekly, and monthly reports of the cluster infrastructure utilization.

Steps

Administrator logs into the associated XDMoD instance and views reports.
Click here to view the ACM Observability Grafana dashboards. These dashboards provide insights into Control Plane Health, Optimization, Capacity, Utilization, and more. You can change the timespan in the top right to show results in terms of minutes, hours, days, months or years.

Track/report usage of the project

As a user and the owner of a project, I should be able to view daily, weekly, and monthly reports of the infrastructure utilization by the projects I own.

Steps to track usage

User logs into the associated XDMoD instance and views reports for projects they own.
Users cannot view reports for projects they do not own. We will need to look into this, to restrict the view to only projects that they own.
Click here to view the memory usage of projects over time.
Click here to view the CPU usage of the projects over time.
Click here to show the projects using the top 5 CPU usage at each point in time.

Ceph Storage Space Monitoring

Log archiving and rollover could run the Ceph Storage out of space. Because the metrics to calculate space on the ceph cluster are not yet sent to Observability, they are available in the OpenShift Monitoring instead. Check on log storage space consumed vs. available using these OpenShift metrics:

OpenShift Data Foundations Ceph Storage Total Storage
OpenShift Data Foundations Ceph Storage Storage Used
OpenShift Data Foundations Ceph Storage Percent Used

MultiClusterObservability documentation

Here are some useful links to the MultiClusterObservability documentation:

APIs Red Hat Advanced Cluster Management for Kubernetes 2.5
Observing environments introduction Red Hat Advanced Cluster Management for Kubernetes 2.5
Managing applications Red Hat Advanced Cluster Management for Kubernetes 2.0

Logging

Logging Operators

Logging in the cluster is provided by the Red Hat Red Hat OpenShift Logging here.
We combine the OpenShift Logging Operator with the Loki Operator here, so that the Logging Operator sends the infrastructure, audit, and application logs to the Loki Operator where they are stored in an Object Bucket.
The OpenShift Logging Operator has a dependency on the Elasticsearch Operator here. Whether you use Elasticsearch for storing logs or using Loki, you still need the Elasticsearch Operator installed for required dependent CustomResourceDefinitions.

Loki operator

The Loki Operator allows you to set up LokiStacks, AlertingRules, RecordingRules, and RulerConfigs based on your cluster logs for infrastructure, audit, and applications. See the Loki Operator here
Setting up a LokiStack allows you to configure the size of a cluster logging system that you desire in terms of storage and replicas. LokiStack here
Setting up a LokiStack involves configuring persistent storage by storageClassName for Persistent Volume Claims. ocs-external-storagecluster-ceph-rbd storage class here
Setting up a LokiStack involves configuring object storage by a secret named "thanos-object-storage" in the "openshift-logging" namespace containing the access_key_id, access_key_secret, bucketnames, and endpoint of the object storage.
The object storage for Loki is provided by OpenShift Data Foundations. See the openshift-logging-objectbucketclaim Object Bucket Claim here
The The infra and prod Cluster Logs are available on the infra cluster here

Tracking events in the Logging System

As an administrator of the cluster, I should be able to track all the events in the cluster using the logging system in OpenShift.

Steps

Click here to visit the Logs.
You can easily filter by recent date, or date range in the past.
You can easily filter by content, namespaces, pods, and containers.
You can also filter by log levels: critical, error, warning, info, debug, trace, unknown.
Click "Show Query" to add more advanced filters like cluster ID:
- Here are the logs for the infra cluster, you can also add the following query to the end of your log query to filter on infra cluster logs: | openshift_cluster_id="b3c6e302-f119-4adb-bc48-e04c6aa2eaa5"
- Here are the logs for the prod cluster, you can also add the following query to the end of your log query to filter on infra cluster logs: | openshift_cluster_id="fcb727d6-3e61-4d23-913d-756cf41c7982"
NERC Admins have access to application logs.
Infrastructure and audit logs have always been reserved to cluster admins in OpenShift Logging ( even on the old stack with Elasticsearch). LokiStack is best configured for admin access via a group (currently we support three dedicated names cluster-admin, dedicated-admin and the standard group for kubeadmin). These groups require a ClusterRoleBinding to the ClusterAdmin ClusterRole.

Cluster Logging documentation

Here are some useful links to the MultiClusterObservability documentation:

Chapter 7. Forwarding logs to external third-party logging systems OpenShift Container Platform 4.10
Logging OpenShift Container Platform 4.10
Exported fields | Logging | OpenShift Container Platform 4.10
Deploying Cluster Logging
Multi-tenancy | Grafana Loki documentation
Grafana Configuration
HTTP API | Grafana Loki documentation
Forwarding Logs to LokiStack - Loki Operator
API - Loki Operator
Configure generic OAuth authentication | Grafana documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitoring-and-logging.md

monitoring-and-logging.md

Quick Links

Switching to new Observability Cluster - Design, Current Status & Changes

Monitoring

MultiClusterObservability component

Monitoring Troubleshooting

Drop and Recreate multicluster-engine Operator

Using the Monitoring tools

Monitoring and logging for the infrastructure hardware and software that is not OpenShift (for example Grafana).

Steps for reporting

Using the Reporting tools

Track/report usage of the cluster

Steps

Track/report usage of the project

Steps to track usage

Ceph Storage Space Monitoring

MultiClusterObservability documentation

Logging

Logging Operators

Loki operator

Tracking events in the Logging System

Steps

Cluster Logging documentation

Files

monitoring-and-logging.md

Latest commit

History

monitoring-and-logging.md

File metadata and controls

Quick Links

Switching to new Observability Cluster - Design, Current Status & Changes

Monitoring

MultiClusterObservability component

Monitoring Troubleshooting

Drop and Recreate multicluster-engine Operator

Using the Monitoring tools

Monitoring and logging for the infrastructure hardware and software that is not OpenShift (for example Grafana).

Steps for reporting

Using the Reporting tools

Track/report usage of the cluster

Steps

Track/report usage of the project

Steps to track usage

Ceph Storage Space Monitoring

MultiClusterObservability documentation

Logging

Logging Operators

Loki operator

Tracking events in the Logging System

Steps

Cluster Logging documentation