Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail-Based Sampling - Scalability Issues #4758

Closed
vikrambe opened this issue Oct 21, 2019 · 45 comments
Closed

Tail-Based Sampling - Scalability Issues #4758

vikrambe opened this issue Oct 21, 2019 · 45 comments
Assignees
Labels
comp:jaeger Jaeger related issues

Comments

@vikrambe
Copy link
Contributor

I like the idea of tail based sampling. I would like to enhance tail based sampling in 2 ways:

Consider a case where we are tail-sampling based on span->http.status_code=500.

  1. I would want to sample 100% of all http.status_code "500"s and 10% of http.status_code "200"s. I would like to create a new sampler that combines string_tag_filter and probabilistic sampler.
  2. What would happen when Spans for a trace land-up in different Otel-Collector instances, one instance can sample the trace when it find http.status_code as 500 and the other could decide not to sample. I was thinking of using group-cache to solve this problem.
    Please, share your thoughts on the above.
@vikrambe
Copy link
Contributor Author

@songy23 @inigohu @tigrannajaryan Can you please share a sample config for the below ?

  • Scale-out by putting one or more collectors behind a load balancer or k8s service, but the load balancer must support traceID-based routing (i.e. all spans for a given traceID need to be received by the same collector instance)

Most of the routing happens on Headers, for this we will have to iterate span batches and route spans based in their traceId's. How does this work when we use Jaeger GRPC ?

Refer : https://github.com/open-telemetry/opentelemetry-collector/blob/master/docs/performance.md

vikrambe referenced this issue in vikrambe/opentelemetry-collector Oct 27, 2019
vikrambe referenced this issue in vikrambe/opentelemetry-collector Oct 27, 2019
vikrambe referenced this issue in vikrambe/opentelemetry-collector Oct 27, 2019
vikrambe referenced this issue in vikrambe/opentelemetry-collector Oct 27, 2019
vikrambe referenced this issue in vikrambe/opentelemetry-collector Oct 27, 2019
vikrambe referenced this issue in vikrambe/opentelemetry-collector Oct 27, 2019
@vikrambe
Copy link
Contributor Author

Please review fix for open-telemetry/opentelemetry-collector#1 above.
i.e.

I would want to sample 100% of all http.status_code "500"s and 10% of http.status_code "200"s. I would like to create a new sampler that combines string_tag_filter and probabilistic sampler.

vikrambe referenced this issue in vikrambe/opentelemetry-collector Oct 28, 2019
@joe-elliott
Copy link
Contributor

joe-elliott commented Nov 12, 2019

A no external dependencies option might look like:

  • Consistent hash ring communicated via a gossip based protocol. Hashicorp has a go implementation: https://github.com/hashicorp/memberlist
  • Upon receiving a span a collector hashes the trace id and forwards it to the appropriate collector

I believe this is very roughly what Cassandra does.

During scale up/down traces will be scattered across multiple collectors as the hash ring settles. I suppose these traces would then not be eligible for tail based sampling and the orphaned spans would automatically be flushed to the backend?

Edit:
After continuing to think about it a consistent hash ring is probably overkill and mod N hashing most likely is fine. A scaling event will cause more orphaned spans, but this may be fine as long as its documented. I continue to think a gossip protocol is a good no external dependency way of adding collector discovery.

I'm new to the codebase and still learning. Can this be implemented as a "rebalancing" pipeline processor? It would either forward a span to a different collector or allow it to continue being processed by the current pipeline based on the hash of the trace id? Reviewing the code now with this in mind.

@annanay25
Copy link
Contributor

I'd like to share the design document for scaling tail based sampling for review - https://docs.google.com/document/d/1HEI8E9VFcv_QoA8k4YLoO8Md7XWiE5vqE3pIhKJozTQ/edit?usp=sharing

/cc @open-telemetry/collector-maintainers @joe-elliott

@vikrambe
Copy link
Contributor Author

@annanay25 I thought on the same lines but making otel implement service discovery seems to be an overkill and this design is not scalable at high load. Every otel collector instance starts distributing spans to respective collector instances. It more challenging when we have to unpack grpc batches and redistribute spans. Failover scenarios and tail sampling decision window adds more edge cases and complexity. We need a consistent and a reliable solution that will work at scale here.

@vikrambe
Copy link
Contributor Author

@annanay25 @joe-elliott Maybe use of a in-memory distributed cache should help here. We can run otel collector in two modes . One mode will fill cache Only and other mode will actually make tail sampling decisions based on filter criteria’s. what do you guys think?

@vikrambe
Copy link
Contributor Author

I'd like to share the design document for scaling tail based sampling for review - https://docs.google.com/document/d/1HEI8E9VFcv_QoA8k4YLoO8Md7XWiE5vqE3pIhKJozTQ/edit?usp=sharing

/cc @open-telemetry/collector-maintainers @joe-elliott

@annanay25 Load balancers support ring hash based on header value. OtelCollector when configured with tail sampler can unpack grpc batches and hit the ALBs with span batched by grouping criteria such as TraceID or nodeId etc. And ship spans to respective otel collector instance handling TraceID being processed.

@joe-elliott
Copy link
Contributor

@vikrambe One of our goals with the proposed design was to avoid taking on any external dependencies. We recognize that at a certain scale tools like load balancers or in memory caches would definitely make things easier.

Perhaps a broader discussion about this particular feature and what kind of external dependencies the otel-collector would be willing to take on is necessary here?

@vikrambe
Copy link
Contributor Author

vikrambe commented Dec 16, 2019

@joe-elliott I dont think we would need any external dependencies in otel collector. To keep it simple and yet solve this problem. One of the options that i can think of is to enhance otel collector's agent mode to batch spans and add headers that would help in routing the batchs to appropriate otel collector instance. Routing batches based on headers can be done by fairly any Load balancer.

@jpkrohling
Copy link
Member

One of our goals with the proposed design was to avoid taking on any external dependencies.

I can totally understand that, but if there's a high-quality embedded distributed caching library out there, it might be worth using it instead of implementing a new one from scratch. @vikrambe's suggestion of golang/groupcache might be worth the investigation.

One advantage I can see when using groupcache instead of the proposed mod N hash is that cache misses would not break the tail-based sampling, as the data will be retrieved from a remote ring member.

@william-tran
Copy link

If you can use kafka, I think we're pretty close: jaegertracing/jaeger#425 (comment)
Use two tiers of otel-collector with kafka in the middle.

@ehmke
Copy link

ehmke commented May 28, 2020

I am finding this discussion very interesting. How does this relate to the comment in the last paragraph here: https://github.com/open-telemetry/opentelemetry-collector/blob/master/docs/performance.md

It says that we would need a LB supporting traceId-based routing to scale horizontally. Does this present a practical option when using a certain combination of protocols? For example, I am sending batches of traces via thrift_http. I assume, it would be enough if the batches would only carry traces with the same trace id and if every http request would have the trace-id header of the traces it is transporting.

@secat
Copy link

secat commented Jun 22, 2020

@annanay25 I have read your post on How Grafana Labs enables horizontally scalable tail sampling in the OpenTelemetry Collector.

Are you planning to contribute your new processor called the aggregate processor to the open-telemetry/opentelemetry-collector-contrib project?

@morigs
Copy link
Contributor

morigs commented Aug 7, 2020

I think scalable tail-based sampling is a must-have feature and should be inside main repository.
Also IMHO this feature is required for GA.
So should we wait for PR from @annanay25 or implement something similar?

@secat
Copy link

secat commented Oct 2, 2020

Is there any update on how to standardize the tail based sampling horizontal scaling?

@jpkrohling
Copy link
Member

jpkrohling commented Oct 2, 2020

Not yet, we started a discussion as part of open-telemetry/opentelemetry-collector#1724. I do plan on working on a PoC based on my last comments there. The idea would be that the load balancer would direct all spans for specific traces to the same collector, and that collector would be configured to do tail-based sampling.

@annanay25
Copy link
Contributor

Apologies for being unresponsive here. I somehow managed to miss these notifications.

I would be more than happy to contribute the aggregate processor to the contrib repo. However, some concerns were raised in a previous PR where I attempted the same, so I would like to address them as well this time, and work towards something that the community has a consensus on. @jpkrohling has already done some amazing new work in this area, so it would be best to build upon that.

@jpkrohling
Copy link
Member

Is this still relevant? Given that we have the load balancer in place now, it should be easy to do a scalable tail based solution now.

@morigs
Copy link
Contributor

morigs commented Jan 20, 2021

@jpkrohling Could you please describe in more detail the solution with a load balancer?

@jpkrohling
Copy link
Member

Sure: you'll need two layers of collectors. The first are the load balancer(s), which would be a simple collector with the load balancer exporter configured: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/exporter/loadbalancingexporter

The second would be a cluster of collectors with the tail-sampling processor, or the groupbytrace processor + your own sampling processor, in case you have your own distribution.

If that's still too abstract, I can try to come up with a concrete example, but it might take a while.

@vikrambe
Copy link
Contributor Author

@jpkrohling I have a couple of queries.

  1. How will it handle async calls/long lived transactions ?
  2. Load balancing is one part of this issue. We also need some kind of rule engine to evaluate filter criteria. With AND/OR capabilities to sample the traces. Maybe, do we want to create a separate issue to track it ?

@vikrambe
Copy link
Contributor Author

@jpkrohling Thanks for the loadbalancingexporter. This is helpful...

@jpkrohling
Copy link
Member

For 1, not sure I understand the problem. Both the tail sampling processor and the groupbytrace will wait for a specific amount of time, under the assumption that all spans for the trace will have arrived by then.

The filter criteria is the second part of the tail sampling processor, known as "policy" there. We started thinking about splitting this into its own processor, but it wouldn't bring anything that doesn't exist already, at least not in the first release.

If none of the existing processors/policies are good for you, you can build your own and pack as part of your custom build of the otelcol.

@morigs
Copy link
Contributor

morigs commented Jan 21, 2021

Wow, this is really interesting, didn't know that. I hope this pattern will be well documented in the future.
The only concern is the fact that loadbalancingexporter is in the contrib repository. This sounds like not quite official way.
May be it should be moved to the main collector repo then

@jpkrohling
Copy link
Member

Where would you expect to see this information? I can create some documentation about it. This belongs to the contrib because it's still understood to be a "niche" feature. The more people using this feature and providing feedback, the higher the chances that this will get moved to the core ;-)

@morigs
Copy link
Contributor

morigs commented Jan 21, 2021

I don't think this is possible right now, but I would expect to see the balancing setup documentation somewhere here.
I think load balancing is a must, because scaling is impossible without it.
It would be cool to involve the core team members in this discussion

@jpkrohling
Copy link
Member

When not doing tail-based sampling, you can do load balancing using the regular techniques. This specialized load balancer is needed only if you need to ensure that the same trace ID is consistently sent to the same backend.

@morigs
Copy link
Contributor

morigs commented Jan 22, 2021

I agree. Bu IMHO tail-based sampling is a quite common scenario

@heruv1m
Copy link

heruv1m commented May 26, 2021

@jpkrohling is there solution for question from issue topic?

I would want to sample 100% of all http.status_code "500"s and 10% of http.status_code "200"s

Is seems, that tail sampling unable to set percent of samples

@jpkrohling
Copy link
Member

You are right -- policies can't be chained at the moment, though this is something we discussed in the past.

@heruv1m
Copy link

heruv1m commented May 26, 2021

Can I do that sampling in another way?

@jpkrohling
Copy link
Member

Not with the current processors. Building one isn't that complicated, though. Here's how the policy parts of the tail-based sampling look like when split from that processor: jpkrohling@fbac498

@vikrambe
Copy link
Contributor Author

@heruv1m Here is the implementation. We can use loadbalancingexporter to co-locate spans belonging to same traceId on a single instance of OpentelemetryCollector instance. Once you co-locate all the spans for a given traceId in a single opentelemetry collector instance you can use the below implementation. It's not available in the master repo as the implementation was turned down, but you can cook your own OpentelemetryCollector binary with below changes.

TestCase : vikrambe/opentelemetry-collector@b3372f2
Implementation : vikrambe/opentelemetry-collector@1ac1887

@heruv1m @jpkrohling I have an implementation of composite policy here open-telemetry/opentelemetry-collector@1a3aed2. Let me know if it helps, i can open a PR on contrib repo...

Design Document For Composite Policy here https://docs.google.com/document/d/10wpIv3TtXgOik05smHm3nYeBX48Bj76TCMxPy8e1NZw/edit#heading=h.ecy5l2puwtp4

@heruv1m
Copy link

heruv1m commented May 26, 2021

@vikrambe I think your processor is what I need.
PR in contrib repo would be nice

@karthikinamanamelluri
Copy link

karthikinamanamelluri commented Jul 5, 2021

@vikrambe
Could you please refer to the contrib PR here or to the fork if created out of contrib.

@bogdandrutu bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector Aug 20, 2021
@vikrambe
Copy link
Contributor Author

@vikrambe
Could you please refer to the contrib PR here or to the fork if created out of contrib.

#4396 -> Planning to work on the review comments over the weekend.

@alolita alolita added the comp:jaeger Jaeger related issues label Sep 8, 2021
@XiaoWeiKIN
Copy link

it's just what I need

@vikrambe
Copy link
Contributor Author

#4396

#4958 -> This is already merged. Use Composite Tailsampling Policy.

@jpkrohling
Copy link
Member

@vikrambe, do you think there's still something to do here?

@vikrambe
Copy link
Contributor Author

vikrambe commented Feb 8, 2022

@vikrambe, do you think there's still something to do here?

@jpkrohling I think we have all the pieces here to solve this problem at scale now.

  1. groupByTraceProcessor
  2. loadbalancingexporter
  3. tailSampler with composite policy, With latest changes we can also add AND clause to the policy

We can close this issue. We should put all these pieces together in a document/blog ? Please suggest ?

@jpkrohling
Copy link
Member

Documentation would be great, perhaps starting as a blog post and migrating later to the official docs. Are you up to writing something?

@vikrambe
Copy link
Contributor Author

vikrambe commented Feb 8, 2022

Documentation would be great, perhaps starting as a blog post and migrating later to the official docs. Are you up to writing something?

Sure @jpkrohling I would be happy to write a blog on this topic, but i would definitely need your review and suggestions. I will start working on the draft and share it soon.

@codeboten
Copy link
Contributor

codeboten commented Jun 27, 2022

We can close this issue.

Closing the issue as per the comment above

@AkselAllas
Copy link

I have a question though.

Is there a way of making loadbalancerexporter collector layer highly available to ensure no traces get lost due to downtime in availability? 🤔

@jpkrohling
Copy link
Member

You can place a regular load balancer in front of the cluster of trace-ID load-balancer. Just make sure that they have a consistent ring view, otherwise, each node will forward data to a different backing collector. The current DNS resolver can still be used in this scenario, but make sure to set the refresh periodicity to a low number, so that the nodes are out of sync only for brief moments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jaeger Jaeger related issues
Projects
None yet
Development

No branches or pull requests