Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

observability: Add multi-cluster-observability-addon proposal #1524

Closed

Conversation

periklis
Copy link
Contributor

@periklis periklis commented Dec 1, 2023

Refs:

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2023
Copy link
Contributor

openshift-ci bot commented Dec 1, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

## Summary

Multi-Cluster Observability has been an integrated concept in Red Hat Advanced Cluster Management (RHACM) since its inception but only incorporates one of the core signals, namely metrics, to manage fleets of OpenShift Container Platform (OCP) based clusters (See [RHACM Multi-Cluster-Observability-Operator (MCO)](rhacm-multi-cluster-observability)). The underlying architecture of RHACM observability consists of a set of observability components to collect a dedicated set of OCP metrics, visualizing them and alerting on fleet-relevant events. It is an optional but closed circuit system applied to RHACM managed fleets without any points of extensibility.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a little nit - but for sake of accuracy need to mention that current MCO is not closed circuit system applied to RHACM managed fleets without any points of extensibility.. In fact it uses the same addon framework that you propose to use below. And in fact the current MCO could be extended to incorporate both logging and tracing - at least technically.

Copy link
Contributor Author

@periklis periklis Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCO is using addon framework? AFAIK it is an operator, or am I mislead by looking on this too narrow on this repo: https://github.com/stolostron/multicluster-observability-operator/

At least this operator is what I had in mind about a closed circuit system. It takes many many decision on how things run and beyond the state of the union of this code base it is hard to extend.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a typical managed cluster with ACM Observability enabled, you will see 9 separate addons
image
MCO is a hub side component that polls and watches for new clusters being added
As a new managed cluster is found, the endpoint metrics collector addon (itself an operator) is rolled out

Multi-Cluster Observability has been an integrated concept in Red Hat Advanced Cluster Management (RHACM) since its inception but only incorporates one of the core signals, namely metrics, to manage fleets of OpenShift Container Platform (OCP) based clusters (See [RHACM Multi-Cluster-Observability-Operator (MCO)](rhacm-multi-cluster-observability)). The underlying architecture of RHACM observability consists of a set of observability components to collect a dedicated set of OCP metrics, visualizing them and alerting on fleet-relevant events. It is an optional but closed circuit system applied to RHACM managed fleets without any points of extensibility.

This enhancement proposal seeks to bring a unified approach to collect and forward logs and traces from a fleet of OCP clusters based on the RHACM addon facility (See Open Cluster Management (OCM) [addon framework](ocm-addon-framework)) by enabling these signals events to land on third-party managed and centralized storage solutions (e.g. AWS Cloudwatch, Google Cloud Logging). The multi-cluster observability addon is an optional RHACM addon. It is a day two companion for MCO and does not necessarily share any resources/configuration with the latter. It provides a unified installation approach of required dependencies (e.g. operator subscriptions) and resources (custom resources, certificates, CA Bundles, configuration) on the managed clusters to collect and forward logs and traces. The addon's name is Multi Cluster Observability Addon (MCOA).

Copy link

@bjoydeep bjoydeep Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reflecting a bit on the naming convention. In RHACM today, we have a :

  • observability addon and its corresponding operator on the hub called a MCO
  • grc addon
  • app life cycle addon

etc

So calling this Multicluster observability addon could be very confusing. I think I understand the logic behind proposed naming convention - it is adding on logging and tracing functions to original MCO. And that makes sense. But to RHACM customers used to a certain convention, this will be very confusing IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any proposals for a good name? mco-addon?


## Motivation

The main driver for the following work is to simplify and unify the installation of log and trace collection and forwarding on an RHACM managed fleet of OCP clusters. The core utility function of the addon is to install required operators (i.e. [Red Hat OpenShift Logging](ocp-cluster-logging-operator) and [Red Hat OpenShift distributed tracing data collection](opentelemetry-operator)), configure required custom
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Red Hat OpenShift distributed tracing data collection

It should be renamed to Red Hat build of OpenTelemetry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the rename is indeed needed after GA'ing both products of yours.

### User Stories

* As a fleet administrator I want to install a homogeneous log collection and forwarding on any set of RHACM managed OCP clusters.
* As a fleet administrator I want to install a homogeneous trace collection and forwarding on any set of RHACM managed OCP clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this but I would like to add

  • the addon deploys OTELcol that can be used to collect and forward OTLP traces, metrics and logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a very welcome addition/amendment of this proposals goals. Nothing is set in stone at this stage and as said elsewhere we need to make the package OTEL-friendly/strictly for a uniform signal experience.

Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Signed-off-by: Israel Blancas <iblancasa@gmail.com>
[ocp-clusterlogforwarder-outputsecretspec]:https://github.com/openshift/cluster-logging-operator/blob/627b0c7f8c993f89250756d9601d1a632b024c94/apis/logging/v1/cluster_log_forwarder_types.go#L226-L265
[ocp-clusterlogforward-outputtypespec]:https://github.com/openshift/cluster-logging-operator/blob/627b0c7f8c993f89250756d9601d1a632b024c94/apis/logging/v1/output_types.go#L21-L40
[opentelemetry-collector-auth]:https://opentelemetry.io/docs/collector/configuration/#authentication
[opentelemetry-operator]:https://console-openshift-console.apps.ptsirakiaws2311285.devcluster.openshift.com/github.com/open-telemetry/opentelemetry-operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong link?


### Implementation Details/Notes/Constraints [optional]

The MCOA implementation sources three different set of manifests acompanying the addon registration and deployment on a RHACM hub cluster:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The MCOA implementation sources three different set of manifests acompanying the addon registration and deployment on a RHACM hub cluster:
The MCOA implementation sources three different set of manifests accompanying the addon registration and deployment on a RHACM hub cluster:


#### Multi Cluster Log Collection and Forwarding

For all managed clusters the fleet administrator is required to provide a single `ClusterLogForwarder` resource stanza that describes the log forwarding configuration for the entire fleet in the default namespace `open-cluster-management`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the ACM addon capabilities but it does require to install the logging CRDs in the hub cluster too?

name: spoke-application-logs
namespace: openshift-logging
data:
'tls.crt': "Base64 encoded TLS client certificate"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'm not familiar with add-ons but it means that any referenced secret ends up verbatim in the ManifestWork object?

# - TLS client certificates for mTLS communication with a log output / trace exporter.
# - Client credentials for password based authentication with a log output / trace exporter.
- resource: secrets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need a defaultConfig? Or is it omitted to keep the manifest readable?

@@ -26,6 +26,8 @@ tracking-link:
- https://issues.redhat.com/browse/OBSDA-356
- https://issues.redhat.com/browse/OBSDA-393
- https://issues.redhat.com/browse/LOG-4539
- https://issues.redhat.com/browse/TRACING-3540
- https://issues.redhat.com/browse/OBSDA-489
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ticket is enough for tracing. The tracing jira above is a child of the OBSDA ticket

@@ -347,8 +349,177 @@ spec:
```

#### Multi Cluster Trace Collection and Forwarding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to Multi Cluster OTLP collection and forwarding?

@@ -84,7 +86,7 @@ The workflow implemented in this proposal enables fleet-wide log/tracing collect
1. The fleet administrator registers MCOA on RHACM using a dedicated `ClusterManagementAddOn` resource on the hub cluster.
2. The fleet administrator deploys MCOA on the hub cluster using a Red Hat provided Helm chart.
2. The fleet administrator creates a default `ClusterLogForwarder` stanza in the `open-cluster-management` namespace that describes the list of log forwarding outputs. This stanza will then be used as a template by MCOA when generating the `ClusterLogForwarder` instance per managed cluster.
3. The fleet administrator creates a default `OpenTelemetryCollector` resource in the `open-cluster-management` namespace that describes the list of trace exporters. This stanza will then be used as a template by MCOA when generating the `OpenTelemetryCollector` instance per managed cluster.
3. The fleet administrator creates a default `OpenTelemetryCollector` resource in the `open-cluster-management` namespace that describes the list of trace receivers, processors, connectors and exporters. This stanza will then be used as a template by MCOA when generating the `OpenTelemetryCollector` instance per managed cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this more generic and remove trace?

@@ -351,8 +351,35 @@ spec:
#### Multi Cluster Trace Collection and Forwarding
For all managed clusters the fleet administrator is required to provide a single `OpenTelemetryCollector` resource stanza that describes the trace forwarding configuration for the entire fleet in the default namespace `open-cluster-management`.

The following example resource describes a configuration for forwarding application traces from one OpenTelemetry Collector (deployed in the spoke cluster) to another one in a different
cluster exposing the OTLP endpoint via OpenShift Route:
One `OpenTelemetryCollector` instance is deployed per spoke cluster. It reports its traces to a Hub OTEL Cluster (note that this cluster can be different from the RHACM Hub cluster). The Hub OTEL Cluster exports the received telemetry to a traces storage (like Grafana Tempo or a third-party service).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before, I would change this to remove trace and add OTLP

@simonpasquier
Copy link
Contributor

cc @jotak for awareness

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

### User Stories

* As a fleet administrator I want to install a homogeneous log collection and forwarding on any set of RHACM managed OCP clusters.
Copy link

@bjoydeep bjoydeep Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any thought on how we can provide say :

  • audit logging only for clusters will label env=prod
  • infra logging for all clusters

We do not have this for MCO at the moment. But we introduced a mechanism while adding User Workload data ingestion into ACM which could be exploited. We have not asked to do this yet in metric world. However, I wonder if this is something which we are used to seeing for logging/tracing.

Does the per cluster configmap shown below capable of doing that?

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024
@periklis
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024
@periklis periklis changed the title cluster-logging: Add multi-cluster-observability-addon proposal observability: Add multi-cluster-observability-addon proposal Feb 5, 2024
@dhellmann
Copy link
Contributor

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2024
@periklis
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2024
@periklis
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2024
1. The fleet administrator registers MCOA on RHACM using a dedicated `ClusterManagementAddOn` resource on the hub cluster.
2. The fleet administrator deploys MCOA on the hub cluster using a Red Hat provided Helm chart.
2. The fleet administrator creates a default `ClusterLogForwarder` stanza in the `open-cluster-management` namespace that describes the list of log forwarding outputs. This stanza will then be used as a template by MCOA when generating the `ClusterLogForwarder` instance per managed cluster.
3. The fleet administrator creates a default `OpenTelemetryCollector` resource in the `open-cluster-management` namespace that describes the list of trace receivers, processors, connectors and exporters. This stanza will then be used as a template by MCOA when generating the `OpenTelemetryCollector` instance per managed cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. The fleet administrator creates a default `OpenTelemetryCollector` resource in the `open-cluster-management` namespace that describes the list of trace receivers, processors, connectors and exporters. This stanza will then be used as a template by MCOA when generating the `OpenTelemetryCollector` instance per managed cluster.
3. The fleet administrator creates a default `OpenTelemetryCollector` stanza in the `open-cluster-management` namespace that describes the list of trace receivers, processors, connectors and exporters. This stanza will then be used as a template by MCOA when generating the `OpenTelemetryCollector` instance per managed cluster.

2. The fleet administrator deploys MCOA on the hub cluster using a Red Hat provided Helm chart.
2. The fleet administrator creates a default `ClusterLogForwarder` stanza in the `open-cluster-management` namespace that describes the list of log forwarding outputs. This stanza will then be used as a template by MCOA when generating the `ClusterLogForwarder` instance per managed cluster.
3. The fleet administrator creates a default `OpenTelemetryCollector` resource in the `open-cluster-management` namespace that describes the list of trace receivers, processors, connectors and exporters. This stanza will then be used as a template by MCOA when generating the `OpenTelemetryCollector` instance per managed cluster.
4. The fleet administrator creates a default `AddOnDeploymentConfig` resource in the `open-cluster-management` namespace that describes general addon parameters, i.e. operator subscription channel names that should be used on all managed clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. The fleet administrator creates a default `AddOnDeploymentConfig` resource in the `open-cluster-management` namespace that describes general addon parameters, i.e. operator subscription channel names that should be used on all managed clusters.
4. The fleet administrator creates a default `AddOnDeploymentConfig` stanza in the `open-cluster-management` namespace that describes general addon parameters, i.e. operator subscription channel names that should be used on all managed clusters.

@periklis periklis marked this pull request as ready for review April 23, 2024 13:17
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2024
Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from periklis. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested review from jcantrill and jotak April 23, 2024 13:18
Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

@periklis: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint f2f7220 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@periklis
Copy link
Contributor Author

@periklis periklis closed this Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants