Update example config for handling tail-sampling and span metric generation when horizontally scaling collectors #6260

swar8080 · 2025-02-09T19:10:10Z

Hello, we used this page's documentation when setting-up tail-sampling and span metric generation. The guide was helpful but there were a couple things it could've mentioned to make our implementation smoother:

The guide currently suggests having two separate deployments of collectors. One just for load-balancing spans and another for processing the load-balanced spans. We started with the two deployments but then found it easier to maintain a single deployment of collectors responsible for both tasks. However, it wasn't obvious (to us at least) that this was possible and I'm guessing others would prefer this approach too.
There wasn't any example configuration of how the load-balancing exporter, span metric connector, and tail sampling processor all fit together, so it took some jumping between READMEs and unintentional violations of single-writer assumption (which we weren't aware of) to get it right.

So this documentation suggestion shows how all the components work together in a single collector deployment.

…ing exporter, tail-sampling processor, and span metrics connector together when scaled to multiple collector instances. Also removes language and configuration suggesting that load-balancing should be a different deployment of collectors than the collectors doing the tail sampling and span metric generation. It's easier to maintain a single deployment responsible for both load balancing and processing of the load balanced data but the pattern for doing this may not be obvious at first.

svrnm · 2025-02-10T07:41:32Z

thanks @swar8080 ! @open-telemetry/collector-approvers PTAL

jpkrohling

I like the change, but I'd really prefer to have three collector instances (three config files) instead of one with three pipelines. The reason is that each pipeline (load-balancer, tail-sampler, span metrics) has a different load profile and would scale differently.

You absolutely CAN do it like depicted here, but without understanding the nuances, I'd prefer the official documentation to have each pipeline to be its own deployment.

jpkrohling · 2025-02-10T08:12:52Z

content/en/docs/collector/scaling.md

+
+    traces/span_metrics:
+      receivers:
+        - otlp/for_tail_sampling


Suggested change

- otlp/for_tail_sampling

- otlp/for_span_metrics

swar8080 · 2025-02-11T02:53:14Z

Hi @jpkrohling, thanks for reviewing

Separating span metrics and tail sampling to different collectors is a good callout. A spike it span volume or span metric cardinality explosion would likely only cause problems for one deployment but not the other. I'll go ahead and change this into three collector configurations, or two if the below persuades you otherwise :).

Maybe we missed the benefits of load balancing being its own deployment. Both set-ups have to receive incoming spans, and load balancing to the same deployment has twice as many receives and exports. But it seems like the extra memory for load balancing is dwarfed by memory needed for tail-sampling and span metrics, which also grows with span volume. For CPU, load balancing exporter used a lot before this optimization, but now our pprof shows it as a small percent of total cpu time. So for us it didn't seem worth the effort of maintaining another deployment, which is why we ended-up consolidating it. That saved us another deployment to monitor, and also one less file to jump between when working on our collector config, since we did some filtering/edits before load balancing. So maybe two separate deployments, each load balancing to themselves, would be a good set-up for a lot of users?

swar8080 requested a review from a team as a code owner February 9, 2025 19:10

github-actions bot added the sig:collector label Feb 9, 2025

swar8080 force-pushed the scaling-stateful-collectors-detailed-example branch from dd7bf67 to eb5c3ab Compare February 9, 2025 19:10

opentelemetrybot requested review from a team and bogdandrutu and removed request for a team February 9, 2025 19:11

swar8080 force-pushed the scaling-stateful-collectors-detailed-example branch from eb5c3ab to b184128 Compare February 9, 2025 19:14

opentelemetrybot requested a review from a team February 9, 2025 19:14

Update refcache and fix link to internal documentation

79aae6b

jpkrohling reviewed Feb 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update example config for handling tail-sampling and span metric generation when horizontally scaling collectors #6260

Update example config for handling tail-sampling and span metric generation when horizontally scaling collectors #6260

swar8080 commented Feb 9, 2025

svrnm commented Feb 10, 2025

jpkrohling left a comment

jpkrohling Feb 10, 2025

swar8080 commented Feb 11, 2025

Update example config for handling tail-sampling and span metric generation when horizontally scaling collectors #6260

Are you sure you want to change the base?

Update example config for handling tail-sampling and span metric generation when horizontally scaling collectors #6260

Conversation

swar8080 commented Feb 9, 2025

svrnm commented Feb 10, 2025

jpkrohling left a comment

Choose a reason for hiding this comment

jpkrohling Feb 10, 2025

Choose a reason for hiding this comment

swar8080 commented Feb 11, 2025