-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MCO data exportation proposal #1639
base: master
Are you sure you want to change the base?
Changes from 8 commits
408f689
cf7a6a3
528c3ca
953b5a0
1224507
0c53dcb
4a75f15
62a07ac
9af53af
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,215 @@ | ||
--- | ||
title: multicluster-logs-traces-forwarding | ||
authors: | ||
- "@pavolloffay" | ||
reviewers: | ||
- "@moadz" | ||
- "@periklis" | ||
- "@alanconway" | ||
- "@jcantrill" | ||
- "@berenss" | ||
- "@bjoydeep" | ||
approvers: | ||
- "@moadz" | ||
- "@periklis" | ||
- "@alanconway" | ||
- "@jcantrill" | ||
api-approvers: | ||
- "@moadz" | ||
- "@periklis" | ||
- "@alanconway" | ||
- "@jcantrill" | ||
creation-date: 2024-06-08 | ||
last-updated: 2024-06-08 | ||
tracking-link: | ||
- | ||
see-also: | ||
- None | ||
replaces: | ||
- None | ||
superseded-by: | ||
- None | ||
--- | ||
|
||
|
||
# Multi-Cluster telemetry data exportation | ||
|
||
## Release Signoff Checklist | ||
|
||
- [x] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Summary | ||
|
||
The objective of multi cluster observability is to offer users a capability to collect | ||
metrics, logs and traces from spoke clusters. Currently, the collection technology | ||
uses three different technology stacks and protocols for exporting data (Prometheus | ||
remote-write for metrics, Loki push for logs and OTLP for traces). | ||
Loki push and Prometheus remote-write are not commonly supported as ingest protocols by | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't strictly an accurate statement. 'Most' suggests a vast majority don't accept it, which is not correct. This list is not even comprehensive wrt to 'Managed Prometheus' offerrings, all of which natively support . A more accurate statement would be DataDog and Dynatrace do not support native Prometheus remote_write. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can attest to this. Prometheus remote write is a hugely popular protocol that the community has adopted, and there are now efforts for even 2.0 of this protocol to include more info. Most common large-scale vendors support it, but some rely on other signals as their main source of data. I would say Loki conventions are also quite popular and people adhere to it, even if the underlying project is different. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I did some digging into various vendors. Some of the largest vendors don't support RW but also some of the new/smaller vendors don't support RW either. On the other hand, most of them support OTLP. I will rephrase the sentence. Datadog: no native RW, no RW ingestion via their agent https://docs.datadoghq.com/containers/kubernetes/prometheus/?tab=kubernetesadv2. Vector has |
||
observability vendors. | ||
|
||
This enhancement proposal seeks to strengthen interoperability of MCOA by unifying and | ||
simplifying exporting of all MCOA telemetry data (metrics, logs, traces) | ||
by exposing a unified export API and consolidating export protocols. This capability | ||
enables users to send data from MCOA to any observability vendor and apply | ||
fine-grained filtering and routing on exported data to configurable sinks. | ||
|
||
## Motivation | ||
|
||
At the moment exporting all telemetry data from OpenShift is fragmented to three | ||
technology stacks (Prometheus, ClusterLoggingForwarder, OpenTelemetry collector). | ||
Every tool uses a different configuration API, export protocol and provides (or does | ||
Comment on lines
+63
to
+64
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this really be a negative aspect? All these stacks are tailor-made to handle their representative signals, and all have their own user bases. I'm not clear as to why it is a bad thing if I don't want to configure my metrics the same way I configure my logs or my traces. Each serves a different utility, and as a user, I'd like to choose how to manage them separately There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @saswatamcode This may be true but I don't read that as the intent of this proposal; that seems orthogonal. I understand the intent as providing a way to unify the signals to a single protocol which is OTLP using OTEL semantics |
||
not provide) filtering/PII capabilities. | ||
|
||
### Prior art and user requests | ||
|
||
* Red Hat OpenShift as OpenTelemetry (OTLP) native platform: https://www.redhat.com/en/blog/red-hat-openshift-opentelemetry-otlp-native-platform | ||
* Export in-cluster metrics to 3rd party vendor https://github.com/openshift/cluster-monitoring-operator/issues/2000 | ||
* Exporting metrics to Dynatrace https://issues.redhat.com/browse/OBSDA-433 | ||
* Export all metrics to Dynatrace https://issues.redhat.com/browse/OBSDA-450 | ||
* Customer asking to export metric to Splunk https://redhat-internal.slack.com/archives/C04TFRRKUA2/p1687853284985279 | ||
|
||
### User Stories | ||
|
||
* As a fleet administrator, I want to export all telemetry signals collected by MCOA to an OTLP compatible endpoint(s). | ||
* As a fleet administrator, I want to filter sensitive data before it is exported to MCO telemetry store or 3rd party OTLP endpoint. | ||
* As a fleet administrator, I want to decide which data is exported to MCO telemetry store or 3rd party OTLP endpoint. | ||
|
||
### Goals | ||
|
||
* Use OTLP protocol to export all telemetry data to a 3rd party system. | ||
* Provide a single configuration API on MCO CRD for exporting all telemetry data. | ||
* Provide unified filtering and routing capabilities for all exported telemetry data. | ||
|
||
### Non-Goals | ||
|
||
* Data visualization and querying. | ||
|
||
## Proposal | ||
|
||
The following section describes how data exportation, routing and filtering is | ||
configured in MCO and MCOA. | ||
|
||
![Architecture](./multicluster-observability-addon-interoperability-arch.jpg) | ||
|
||
### Workflow Description | ||
|
||
1. Configure OTLP endpoint in MCO (`MultiClusterObservability`) CR. | ||
2. The MCOA configures an additional OTLP exporter in the OpenTelemetry collector. The | ||
exporter is in the pipeline that receives all supported telemtery signals. | ||
3. (optional) Filtering (e.g. for PII) can be configured in | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the expectation that users will configure their own PII filtering? With platform metrics it's mostly pod names and IP's which i guess could be anonymised, but that would render them useless for general platform troubleshooting would it not? I'm still not clear on how this is supposed to be held by users. Furthermore if they are writing metrics onto clusters that they own, PII becomes a authorization and deletion concern, not an ingestion concern. I would say this feature is mostly relevant when offloading your observability data to a vendor or third party (e.g. RHOBS or DataDog) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not clear as to how this would be easier/unified, Such filtering can be configured on prometheus scrape-level or for ClusterLoggingForwarder as well and actually having them split ensures a user can be intentional about what exactly they want to filter. Such PII-filtering configuration for metrics can be easily set on scrape configs as needed if one really wants to filter out certain specific labels. But @moadz raises a great point here which is, when will I as a user, need to censor my own metrics? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yes, and most likely mostly for user workload metrics. They could filter out platform data as well, but it will be their responsibility if they break the console. logs are traces are more important for PII than metrics.
With MCO we are intentional about making it easy to provision and manage the entire stack and ultimately provide a good integrated product experience. As a user I would prefer to configure processing/filtering capability in a single API rather than on three different APIs/stacks (they could even have different processing/filtering capabilities). |
||
`OpenTelemetryCollector` CR manged by MCOA by [transformprocessor](https://github. | ||
com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/transformprocessor/README.md). | ||
4. (optional) Routing can be configured in `OpenTelemetryCollector` CR managed by MCOA by | ||
pavolloffay marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[routingprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/routingprocessor). | ||
|
||
|
||
### API Extensions | ||
|
||
None - no new APIs for CRDs are introduced. | ||
|
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
#### General configuration and fleet-wide stanzas | ||
|
||
To support above workflow MCOA deploys additional collector which forwards collected | ||
data to 3rd party OTLP endpoint. | ||
|
||
- An `OpenTelemetryCollector` resource that enables receivers for supported telemetry | ||
signals. The individual telemetry stacks will forward data to these endpoints. | ||
The collector enables OTLP exporter for and forwarding to 3rd party vendor. | ||
|
||
#### Hypershift [optional] | ||
|
||
pavolloffay marked this conversation as resolved.
Show resolved
Hide resolved
|
||
N/A | ||
|
||
### Drawbacks | ||
|
||
- MCOA configuration through the MultiClusterObservability: the MCO CR nowadays has an already extensive set of configuration fields, when designing the MCOA configuration, we will need to take extra caution as to not make this CR more complex and hard to navigate; | ||
- MCOA manifest sync: with MCOA being deployed by MCO we will need to set up a procedure to maintain the MCOA manifests that live in the MCO repo up to date. | ||
- CRD conflicts: MCOA will leverage the CRDs from other operators we will have to ensure that we will not be running into situations where two operators are managing the same CRD | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are definitely valid drawbacks, but are broad and generic to MCOA/Open Cluster Management and Kubernetes as a whole, not specifically to the approach echoed here. The main drawbacks as I see them are:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To add to this,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think in this proposal @saswatamcode, there is a collector running in the hub that translates back to remote_write to avoid the drawbacks you mentioned. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup either way we go, there are drawbacks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, this proposal does not imply running a collector on the hub that translates to protocols supported by the stores. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the link. The translator most likely incurs some CPU/mem cost. Are there some benchmarks that show that or even a change in the ingestion throughput? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Between remote write vs native otlp? I don't think there is anything publicly available as a benchmark, maybe some prombench scenario. But the way this translator works is by translating otlp to prometheus remote write requests https://github.com/prometheus/prometheus/blob/main/storage/remote/otlptranslator/prometheusremotewrite/metrics_to_prw.go#L41 so it is doing additional work on top of regular remote write ingestion. To get around this vendors like grafana have products like https://grafana.com/docs/grafana-cloud/send-data/alloy/ Hopefully over some time we see it become more native in Prometheus, and equally performant 🙂 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That is just a custom build like OTEL collector. We have Red Hat build of OpenTelemetry https://docs.openshift.com/container-platform/4.15/observability/otel/otel-configuration-of-otel-collector.html. In the next version we will add Prometheus Remote Write exporter. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here are some benchmark results, comparing
I guess in scenario 3 this translation is happening, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the 3rd one is translating on the export side. |
||
|
||
## Design Details | ||
|
||
### Open Questions [optional] | ||
|
||
TBD | ||
|
||
### Test Plan | ||
|
||
TBD | ||
|
||
### Graduation Criteria | ||
|
||
TBD | ||
|
||
#### Dev Preview | ||
|
||
TBD | ||
|
||
#### Dev Preview -> Tech Preview | ||
|
||
TBD | ||
|
||
#### Tech Preview -> GA | ||
|
||
TBD | ||
|
||
#### Removing a deprecated feature | ||
|
||
None | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
None | ||
|
||
### Version Skew Strategy | ||
|
||
None | ||
|
||
### Operational Aspects of API Extensions | ||
|
||
TBD | ||
|
||
#### Failure Modes | ||
|
||
TBD | ||
|
||
#### Support Procedures | ||
|
||
TBD | ||
|
||
## Implementation History | ||
|
||
TBD | ||
|
||
## Alternatives | ||
|
||
### Multiple OTLP exporter/sinks | ||
|
||
OTLP exporter/sink could be implemented in all telemetry collectors (`Prometheus`, | ||
`ClusterLogForwarder`, `OpenTelemetryCollector`), however providing a common filtering | ||
and routing capabilities will be problematic if not possible. | ||
|
||
In addition to exporting in OTLP, a single collector will enable MCO to easily support | ||
exporting data to other systems with custom protocols (e.g. AWS CloudWatch, | ||
Google Cloud Monitoring/Logging, Azure Monitor). | ||
Comment on lines
+195
to
+196
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Google Cloud Monitoring and Azure Monitor support remote write already. AWS Cloudwatch does not, but AWS Managed Prometheus plugs into cloudwatch and accepts remote_write. Likewise for GCM and Googles managed Prometheus. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not an expert here, but it seems like the Azure monitor requires a sidecar for ingesting remote-write. This applies for metrics, for logs and traces a solution might be different. The intention is to simplify data exporting by providing a well-supported solution across telemetry signals supported by MCO. A unified approach will eliminate silos and overlapping product features. |
||
|
||
### Integrate directly into present Multi-Cluster-Observability-Operator | ||
|
||
TBD | ||
|
||
## Infrastructure Needed [optional] | ||
|
||
None | ||
|
||
[ocm-addon-framework]:https://github.com/open-cluster-management-io/addon-framework | ||
[opentelemetry-operator]:https://github.com/open-telemetry/opentelemetry-operator | ||
[rhacm-multi-cluster-observability]:https://github.com/stolostron/multicluster-observability-operator | ||
|
||
## RANDOM IDEAS | ||
|
||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster observability addon
orcluster observability operator
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MCO -multi cluster observability, which MCOA is part of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes what I meant is that the sentence sounds like incomplete because it is not clear if you refer to one or another