Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MCO data exportation proposal #1639

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

pavolloffay
Copy link
Member

Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
@openshift-ci openshift-ci bot requested review from jan--f and jcantrill June 10, 2024 11:00
Copy link
Contributor

openshift-ci bot commented Jun 10, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from pavolloffay. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


## Summary

The objective of multi cluster observability is to offer users a capability to collect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster observability addon or cluster observability operator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCO -multi cluster observability, which MCOA is part of

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes what I meant is that the sentence sounds like incomplete because it is not clear if you refer to one or another

Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>

### Goals

* Use a single protocol (OTLP) for exporting all data to MCO and/or 3rd party system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how "Use a single protocol (OTLP) for exporting all data to MCO" is related to any of the user stories. I'd consider the protocol being used between the spoke clusters and the hub to be an internal implementation detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all customers want to send data from spoke clusters to a central location in their infrastructure. There are cases where they directly want to send from spoke to a third party service. At that point, it is important the spoke clusters can "talk" to external services and the protocol used is not just an internal detail.

Copy link
Member Author

@pavolloffay pavolloffay Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User story:

  • As a fleet administrator, I want to export all telemetry signals collected by MCOA to an OTLP compatible endpoint(s).

is related to this goal. The OTLP protocol is crucial here as it is the most supported protocol across observability vendors and OSS tools.

Copy link

@moadz moadz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal Pavol! I think it's a really interesting idea, and it's quite sexy to hold, but paints a very complex problem with a reasonably broad brush. I'll also caveat this by saying i'm responding from a purely metrics perspective, as I have little context on logs. I'll try to structure this in the form of a few outrageously leading and contrived questions:

Should we be making it easier to export telemetry via MCO?

Answer: A resounding YES

I agree with the core thesis of this proposal, which is that a user should be able to reason about exports purely in open and unified, vendor agnostic means.

This means if I want to export metrics, logs and traces from a cluster, OTLP provides the most bang-for-your-buck in terms of vendor compatibility. This is something that is already achievable by running the OTEL Collector either alongside the existing stack, or completely on its own.

We should:

  • Make it trivial to export metrics and alerts in the OTLP format from the in-cluster stack if deployed. This should be easy and one-CRD/configuration step.
  • Make it so that the in-cluster components speak nicely to OTEL Collector in this case and treat OTLP as close to a first-class citizen in OpenShift as possible (think CMO forwarding via OTEL Collector etc.)
  • Or simply just run the collector on its own as we do with Microshift and RHEL hosts via flightctl

So far... we 100% agree.

If we're already using it for third parties, should we use a single protocol (OTLP via OTEL Collector) for exporting all metrics data to MCO?

Answer: No

This is where this proposal confuses me slightly, because you're making an assumption that all MCO and OCP customers would like to export metrics via OTLP, and thus we should also adopt the collector for this purpose.

If the aim is to make it easy to export to a third party, OTLP/OTEL collector makes sense because the user is not always authoritative over compatibility with the third party.

MCO is not a third party, it is a first party. The MCO components run on customer premises and they own the data that they are producing. They can go in and delete, manage access to it, and audit its contents without additional cost. The benefits of OTLP for a third party don't apply to MCO as a first party, because we have end-to-end authority over ingestion, storage and query. This is the core of why we should be investing in MCO as a product segment because it's unique to us as a platform provider. Our equivalents in Cloudwatch/AWS, Azure Monitor/Azure and Google Cloud Monitoring/GCP do not allow the customer to actually retain their data on their own infrastructure, and manipulate it at their will.

Given we control how the sausage is made, and how it is subsequently consumed, there is actually very little value in adopting OTLP on the critical path here. Given users will be paying for the compute, and would like it to run reliably and cheaply on their infrastructure, that should be our focus.

The factual basis for this is that we have received zero (0) RFE's from ACM monitoring customers asking us to support OTLP as a line format. So if none of our existing customers are asking for it why build it?

This approach seems to suggest that the overhead of producing and/or storing metrics in Prometheus, translating them to OTLP and then translating them back to OpenMetrics format is worth it in the default case (collecting and storing platform metrics on MCO), which currently encompasses 100% of our users.

Does this mean that there is no future for the OTEL collector as a general purpose Observability data sink and forwarder on the critical path for metrics ingest into MCO storage?

Answer: Categorically NO.

I personally would love it if this were the case. It would slim down our stack, reduce running costs and simplify operation and maintenance. As such I have set the minimum criteria we would need from OTEL collector without regressing on existing functionality we provide.

I'm all for simplification and unification as long as it abides by one core and sacrosanct principle. It must be for the functional(features) or non-functional(performance and reliability) benefit of the end-user.

Functional benefits

Currently the most pressing functional benefits we would need form OTEL collector to use on the critical path is:

  • [MUST] Aggregation of metrics (recording rules); infrastructure metrics cardinality is huge, even to third parties, so we would need to be able to aggregate metrics to reduce the storage and query cost in the hub.
  • [MUST] Downsampling-on–the-wire for metrics; bandwidth is a sacred resource and with high-cardinality metrics writing more often can often cost you 10x the egress with little benefit to show for it. Metrics-collector currently only does this. It reduces cluster egress costs at the cost of semantic accuracy by only scraping every 5 mins. If OTEL Collector could do this we would use it asap.
  • [SHOULD] Dynamic catch-all observability signal collection based on boundary conditions; e.g. if an alert starts firing, collect all container level metrics, traces and logs in the namespace related to the alert. We do this with metrics-collector currently, but the implementation leaves much to be desired.
  • [COULD] Correlation and troubleshooting benefits that are materialised in the ACM UI; if we're not doing this, then there's no benefit in supporting unified collection in the spokes.

Non-functional benefits

  • [MUST] Prometheus/OTEL Collector via OTLP is more/equally as performant to Prometheus/Metrics-collector via remote-write; OTEL Collector metrics performance remains an unknown quantity, the published benchmarks leave a lot to be desired as they compare OTLP performance to defunct standards like OpenCensus and SignalFX on very small samples (10k DPS) we need a broader and more comprehensive tests on how this performs. What we need for OTEL collector before we use OTLP for our default Spoke-to-Hub exposition format is something equivalent to the remote-write 2.0 spec benchmarks.

So CPU/Memory profiles, load tests and flamegraphs. The whole shabang!

  • [MUST] Be fault tolerant during rollouts and configuration changes; our customers rely on MCO/Hub alert forwarding for troubleshooting infrastructure issues and declaring incidents. The collector should be able to provide this facility, either by running in HA for rollout fault tolerance, or proxying critical metrics and alerts through a redundant stack. This is actually one of metrics-collectors weak points, so would be great if we could address this through the OTEL Collector.

This is all assuming that native OTLP write support does not materialise in the short term, but that would likely be a replication format, and would not address the features collector could unlock wrt on the wire processing.

metrics, logs and traces from spoke clusters. Currently, the collection technology
uses three different technology stacks and protocols for exporting data (Prometheus
remote-write for metrics, Loki push for logs and OTLP for traces).
Loki push and Prometheus remote-write are not commonly supported as ingest protocols by
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't strictly an accurate statement. 'Most' suggests a vast majority don't accept it, which is not correct. This list is not even comprehensive wrt to 'Managed Prometheus' offerrings, all of which natively support .

A more accurate statement would be DataDog and Dynatrace do not support native Prometheus remote_write.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can attest to this. Prometheus remote write is a hugely popular protocol that the community has adopted, and there are now efforts for even 2.0 of this protocol to include more info. Most common large-scale vendors support it, but some rely on other signals as their main source of data.

I would say Loki conventions are also quite popular and people adhere to it, even if the underlying project is different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most common large-scale vendors support it, but some rely on other signals as their main source of data.

I did some digging into various vendors. Some of the largest vendors don't support RW but also some of the new/smaller vendors don't support RW either. On the other hand, most of them support OTLP. I will rephrase the sentence.

Datadog: no native RW, no RW ingestion via their agent https://docs.datadoghq.com/containers/kubernetes/prometheus/?tab=kubernetesadv2. Vector has beta support for RW but not sure if DD supports it. They don't support OTLP natively either, only via their agent.
Dynatrace: no native RW, no RW ingestion via their agent
Honeycomb: no native RW
Instana no native RW, ingestion possible via their agent
LogicMonitor: no native RW https://www.logicmonitor.com/support/monitoring/applications-databases/openmetrics-monitoring
Lumigo: no RW/prometheus support
Lightstep: no RW support, https://docs.lightstep.com/docs/ingest-prometheus

Splunk: native RW
New relic: native RW
elastic: native RW


### Goals

* Use a single protocol (OTLP) for exporting all data to MCO and/or 3rd party system.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contradictory to the original stated aim, which is:

This enhancement proposal seeks to strengthen interoperability of MCOA by unifying and
simplifying exporting of all MCOA telemetry data (metrics, logs, traces)

and the User Story provided:

As a fleet administrator, I want to export all telemetry signals collected by MCOA to an OTLP compatible endpoint(s).

What protocol the spokes speak to the central store is immaterial to the user, given that isn't something that they need to be compatible with. It's compatible by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main objective of the proposal is to enable users to export data to 3rd party observability vendors with day two functional requirements (filtering, routing).

From the summary

This capability
enables users to send data from MCOA to any observability vendor and apply
fine-grained filtering and routing on exported data to configurable sinks.

1. Configure OTLP endpoint in MCO (`MultiClusterObservability`) CR.
2. The MCOA configures an additional OTLP exporter in the OpenTelemetry collector. The
exporter is in the pipeline that receives all data.
3. (optional) Filtering (e.g. for PII) can be configured in
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation that users will configure their own PII filtering? With platform metrics it's mostly pod names and IP's which i guess could be anonymised, but that would render them useless for general platform troubleshooting would it not? I'm still not clear on how this is supposed to be held by users.

Furthermore if they are writing metrics onto clusters that they own, PII becomes a authorization and deletion concern, not an ingestion concern. I would say this feature is mostly relevant when offloading your observability data to a vendor or third party (e.g. RHOBS or DataDog)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear as to how this would be easier/unified, Such filtering can be configured on prometheus scrape-level or for ClusterLoggingForwarder as well and actually having them split ensures a user can be intentional about what exactly they want to filter. Such PII-filtering configuration for metrics can be easily set on scrape configs as needed if one really wants to filter out certain specific labels.

But @moadz raises a great point here which is, when will I as a user, need to censor my own metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation that users will configure their own PII filtering?

yes, and most likely mostly for user workload metrics. They could filter out platform data as well, but it will be their responsibility if they break the console.

logs are traces are more important for PII than metrics.

I'm not clear as to how this would be easier/unified, Such filtering can be configured on prometheus scrape-level or for ClusterLoggingForwarder as well and actually having them split ensures a user can be intentional about what exactly they want to filter.

With MCO we are intentional about making it easy to provision and manage the entire stack and ultimately provide a good integrated product experience. As a user I would prefer to configure processing/filtering capability in a single API rather than on three different APIs/stacks (they could even have different processing/filtering capabilities).

Comment on lines 120 to 125
To support above workflow MCOA deploys additional collector which forwards all data to
MCO telemetry store and/or 3rd party OTLP endpoint.

- An `OpenTelemetryCollector` resource that enables receivers for all telemetry
signals and an OTLP exporter for and forwarding to the MCO store and 3rd party
vendor.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this solving for exporting to third parties? MCO runs on customer prem, on APIs that are versioned (on both sides) alongside the release of ACM that they are running on, it isn't a third party. So does the OTEL collector have to encompass this ingress to MCO as well? What is the material benefit to the user if that is the case?

This also makes the assumption that most users would like to ingest their metrics in OTLP, which is again not the case. There are zero (0) ACM Observability RFE's that include facilitating OTLP between the spokes and the hub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be curious as to if this would be an efficient way of exporting data to a vendor.

There are usually limits to how many requests a vendor will ingest concurrently, and there might be rate limits as well.

For customers having a large number of spoke clusters, wouldn't first ingesting centrally and then exporting a much more efficient way of doing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are usually limits to how many requests a vendor will ingest concurrently, and there might be rate limits as well.

@saswatamcode do you have any pointers for this? It seems contra productive for a vendor to rate limit their customers who are usually billed per ingested data volumes.

Copy link
Member

@saswatamcode saswatamcode Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes usually more profitable for a vendor to ingest as much as possible but vendors would also need to protect their own infra or have user-set billing limits. I know datadog had something like https://docs.datadoghq.com/api/latest/rate-limits/. Not sure about others

Comment on lines 133 to 135
- MCOA configuration through the MultiClusterObservability: the MCO CR nowadays has an already extensive set of configuration fields, when designing the MCOA configuration, we will need to take extra caution as to not make this CR more complex and hard to navigate;
- MCOA manifest sync: with MCOA being deployed by MCO we will need to set up a procedure to maintain the MCOA manifests that live in the MCO repo up to date.
- CRD conflicts: MCOA will leverage the CRDs from other operators we will have to ensure that we will not be running into situations where two operators are managing the same CRD
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are definitely valid drawbacks, but are broad and generic to MCOA/Open Cluster Management and Kubernetes as a whole, not specifically to the approach echoed here.

The main drawbacks as I see them are:

  • Highly centralised component responsible for all observability signals. By extension, if the OTEL collector is down, you would get nothing exported out of your cluster, even mission critical alerting that might tell you your collector is down for example. If you're purely using it for vendor driven alerting, that's probably fine. But a lack if in-cluster alerting if you opt in for that topology means you would never know if you stopped producing telemetry.
  • Lack of native HA redundancy on OTEL Collector as a technology, how is its availability impacted by rollouts for example? How fault tolerant is it going to be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this,

  • Native OTLP ingestion is not stable for upstream projects like Prometheus and Thanos, and even if they work, they are not yet performant enough to replace remote write, and there are quite some decisions to be taken, on how to handle differences between the protocol.
  • OTLP remote write exporter can be used but then this is also a slower path as compared to directly remote writing metrics, and such conversions from one protocol to another can result in broken semantics at times.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this proposal @saswatamcode, there is a collector running in the hub that translates back to remote_write to avoid the drawbacks you mentioned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup either way we go, there are drawbacks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this proposal @saswatamcode, there is a collector running in the hub that translates back to remote_write to avoid the drawbacks you mentioned.

No, this proposal does not imply running a collector on the hub that translates to protocols supported by the stores.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link. The translator most likely incurs some CPU/mem cost. Are there some benchmarks that show that or even a change in the ingestion throughput?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between remote write vs native otlp? I don't think there is anything publicly available as a benchmark, maybe some prombench scenario. But the way this translator works is by translating otlp to prometheus remote write requests https://github.com/prometheus/prometheus/blob/main/storage/remote/otlptranslator/prometheusremotewrite/metrics_to_prw.go#L41 so it is doing additional work on top of regular remote write ingestion.

To get around this vendors like grafana have products like https://grafana.com/docs/grafana-cloud/send-data/alloy/

Hopefully over some time we see it become more native in Prometheus, and equally performant 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get around this vendors like grafana have products like https://grafana.com/docs/grafana-cloud/send-data/alloy/

That is just a custom build like OTEL collector. We have Red Hat build of OpenTelemetry https://docs.openshift.com/container-platform/4.15/observability/otel/otel-configuration-of-otel-collector.html. In the next version we will add Prometheus Remote Write exporter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some benchmark results, comparing

  1. Prometheus -> RW -> RW endpoint
  2. OTelCol -> OLTP -> OTLP endpoint
  3. OtelCol -> RW -> RW endpoint

https://github.com/danielm0hr/edge-metrics-measurements/blob/main/talks/DanielMohr_PromAgentVsOtelCol.pdf

I guess in scenario 3 this translation is happening, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the 3rd one is translating on the export side.
The translation layer I mentioned above is a new one on ingest i.e, OTelCol -> OTLP -> Prometheus/Thanos (as OTLP endpoint).

Comment on lines +200 to +201
exporting data to other systems with custom protocols (e.g. AWS CloudWatch,
Google Cloud Monitoring/Logging, Azure Monitor).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google Cloud Monitoring and Azure Monitor support remote write already.

AWS Cloudwatch does not, but AWS Managed Prometheus plugs into cloudwatch and accepts remote_write. Likewise for GCM and Googles managed Prometheus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert here, but it seems like the Azure monitor requires a sidecar for ingesting remote-write. This applies for metrics, for logs and traces a solution might be different.

The intention is to simplify data exporting by providing a well-supported solution across telemetry signals supported by MCO. A unified approach will eliminate silos and overlapping product features.

metrics, logs and traces from spoke clusters. Currently, the collection technology
uses three different technology stacks and protocols for exporting data (Prometheus
remote-write for metrics, Loki push for logs and OTLP for traces).
Loki push and Prometheus remote-write are not commonly supported as ingest protocols by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can attest to this. Prometheus remote write is a hugely popular protocol that the community has adopted, and there are now efforts for even 2.0 of this protocol to include more info. Most common large-scale vendors support it, but some rely on other signals as their main source of data.

I would say Loki conventions are also quite popular and people adhere to it, even if the underlying project is different.

1. Configure OTLP endpoint in MCO (`MultiClusterObservability`) CR.
2. The MCOA configures an additional OTLP exporter in the OpenTelemetry collector. The
exporter is in the pipeline that receives all data.
3. (optional) Filtering (e.g. for PII) can be configured in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear as to how this would be easier/unified, Such filtering can be configured on prometheus scrape-level or for ClusterLoggingForwarder as well and actually having them split ensures a user can be intentional about what exactly they want to filter. Such PII-filtering configuration for metrics can be easily set on scrape configs as needed if one really wants to filter out certain specific labels.

But @moadz raises a great point here which is, when will I as a user, need to censor my own metrics?

Comment on lines 120 to 125
To support above workflow MCOA deploys additional collector which forwards all data to
MCO telemetry store and/or 3rd party OTLP endpoint.

- An `OpenTelemetryCollector` resource that enables receivers for all telemetry
signals and an OTLP exporter for and forwarding to the MCO store and 3rd party
vendor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be curious as to if this would be an efficient way of exporting data to a vendor.

There are usually limits to how many requests a vendor will ingest concurrently, and there might be rate limits as well.

For customers having a large number of spoke clusters, wouldn't first ingesting centrally and then exporting a much more efficient way of doing this?

Comment on lines +63 to +64
technology stacks (Prometheus, ClusterLoggingForwarder, OpenTelemetry collector).
Every tool uses a different configuration API, export protocol and provides (or does
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this really be a negative aspect? All these stacks are tailor-made to handle their representative signals, and all have their own user bases. I'm not clear as to why it is a bad thing if I don't want to configure my metrics the same way I configure my logs or my traces. Each serves a different utility, and as a user, I'd like to choose how to manage them separately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saswatamcode This may be true but I don't read that as the intent of this proposal; that seems orthogonal. I understand the intent as providing a way to unify the signals to a single protocol which is OTLP using OTEL semantics

Comment on lines 133 to 135
- MCOA configuration through the MultiClusterObservability: the MCO CR nowadays has an already extensive set of configuration fields, when designing the MCOA configuration, we will need to take extra caution as to not make this CR more complex and hard to navigate;
- MCOA manifest sync: with MCOA being deployed by MCO we will need to set up a procedure to maintain the MCOA manifests that live in the MCO repo up to date.
- CRD conflicts: MCOA will leverage the CRDs from other operators we will have to ensure that we will not be running into situations where two operators are managing the same CRD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this,

  • Native OTLP ingestion is not stable for upstream projects like Prometheus and Thanos, and even if they work, they are not yet performant enough to replace remote write, and there are quite some decisions to be taken, on how to handle differences between the protocol.
  • OTLP remote write exporter can be used but then this is also a slower path as compared to directly remote writing metrics, and such conversions from one protocol to another can result in broken semantics at times.

Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
@pavolloffay
Copy link
Member Author

@moadz regarding

If we're already using it for third parties, should we use a single protocol (OTLP via OTEL Collector) for exporting all metrics data to MCO?

I agree that it is an implementation detail of how data is sent to the MCO telemetry store. We will need to support remote-write for existing deployments anyway. I see the value of unifying on the OTLP for the hub store as well if we can guarantee non-functional requirements.

I have altered the proposal to use OTLP only for 3rd party stores.

Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
@jan--f
Copy link
Contributor

jan--f commented Jun 19, 2024

Moad make a lot of good points from the technical perspective, I agree with all of them, especially the arguent against moving the ACM internal data streams to OTLP. I would like to raise another, perhaps slightly less technical point.
The title of this had me quite excited.
I think discussing APIs, especially when it comes to data export, instead of technology choices is a great idea. However this proposal would be more aptly named Add MCO OTLP export. I thought this was possible already too.

Arguing about a technology agnostic API would improve the scope of the discussion. Furthermore we could hopefully avoid discussing this again in a while when the next new hot tech comes around. With an API in place we can then add, switch and deprecate technologies as needed.
I'm sure in practice it won't be as simple as I make it out to be here, but I do think separating what API would solve a problem from "we should us technology X" would scope arguments more effectively.

@pavolloffay
Copy link
Member Author

Being explicit on the protocol is necessary here, our choice should take into account which systems we want to integrate with. Every system supports an explicit list of specific protocols. The important aspect of providing support for another protocol depends on our internal architecture (e.g. it's easier to implement on a single component than on 3 different stacks) and how we structure high-level MCO CRD.

@jan--f
Copy link
Contributor

jan--f commented Jun 26, 2024

Sure I agree this proposal should include a protocol that is getting implemented. What I'm arguing for is to abstract the API layer such that other implementations are also possible. Alternatively lets at least rename this Add MCO OTLP export or similar.

I would strongly prefer a data export API that is technology agnostic as much as it can be.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2024
@iblancasa
Copy link
Member

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2024
Signed-off-by: Pavol Loffay <p.loffay@gmail.com>
Copy link
Contributor

openshift-ci bot commented Aug 21, 2024

@pavolloffay: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 9af53af link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants