-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can auto-generated shard messages be omitted from a trace transaction with manual span management? #68
Comments
Ok, I'll look forward to meeting with you on Friday - my first thought here is that these should possibility be filtered out via OpenTelemetry's trace I'll see if I can prototype something in that regard today. I just wrote an implementation last week to use OTel metrics views to filter out unwanted metrics: https://phobos.petabridge.com/articles/trace-filtering.html#filtering-metrics |
My worry is that dropping 1-2 spans in the middle of a transaction might break the entire trace, but I'll have to test it |
I'll put together a prototype prior to our meeting and report my findings on this thread. |
This is my fear too because the 4 spans above form a parent-child hierarchy: 1 to 2 to 3 to 4. So simply filtering out two intermediate spans can break the whole chain. If all spans belonged to the same node, I could simply pass the 1st span's resuable context to the 4th span and the transaction would continue. But spans 2-3 are on the edge of the cluster node. What I would like to have is to be able to get hold of the UsableContext that is passed to the auto-generated span for the shard message without auto-generating those spans. That would keep only explicitly created spans in the trace transaction shaping it perfectly. |
With Akka.Cluster.Sharding there are no guarantees that will be the case |
So it looks like using an OpenTelemetry filter processor works: You'll get some of these errors inside Jaeger, but the trace still renders:
My source code for the /// <summary>
/// Excludes all spans created by the built-in sharding actors
/// </summary>
public class ExcludeShardingProcessor : CompositeProcessor<Activity>
{
public override void OnEnd(Activity data)
{
if (data.Tags.Any(c => c.Key.Equals("akka.actor.type")
&& c.Value != null
&& (c.Value.Contains("Akka.Cluster.Sharding"))))
return; // filter out
base.OnEnd(data);
}
public ExcludeShardingProcessor(IEnumerable<BaseProcessor<Activity>> processors) : base(processors)
{
}
}
And then integrating the processor into my APM pipeline: var resource = ResourceBuilder.CreateDefault()
.AddService(Assembly.GetEntryAssembly().GetName().Name, serviceInstanceId: $"{Dns.GetHostName()}");
// enables OpenTelemetry for ASP.NET / .NET Core
services.AddOpenTelemetryTracing(builder =>
{
builder
.SetResourceBuilder(resource)
.AddPhobosInstrumentation()
.AddHttpClientInstrumentation()
.AddAspNetCoreInstrumentation()
.AddProcessor(new ExcludeShardingProcessor(new[]
{ new SimpleActivityExportProcessor(new JaegerExporter(new JaegerExporterOptions())) }));
}); You have to wrap your OTEL exporter inside the filter per the OTEL .NET APIs, which for some reason is causing my |
Reported issue in the OTEL repo: open-telemetry/opentelemetry-dotnet#3603 |
Also, here's my PR where all of the work was performed: petabridge/akkadotnet-code-samples#104 |
Thank you so much @Aaronontheweb for such thorogh investigation. I am going to give it a try. |
@Aaronontheweb I tried using a custom processor inherited from CompositeProcessor, and it filtered out specified spans but unfortunately it broke trace transaction in smaller pieces. Example: I let Phobos manage my traces by setting create-trace-upon-receive to "on". It generated the following trace transaction chain (showing operation names):
Note there are many messages of Microsoft.FSharp.Core.FSharpResult type. So I added a filter to my processor to filter out spans with condition data.OperationName.Contains("Microsoft.FSharp.Core.FSharpResult"). Then the following traces were generated:
So all Microsoft.FSharp.Core.FSharpResult are gone, but trace transaction is broken in their places. Unfortunately this doesn't solve the issue. |
Hmm, reading release notes for latest OpenTelemetry: "CompositeProcessor will now ensure ParentProvider is set on its children (#3368)" I am on an earlier version. Will need to try the latest one. |
Nope, upgrading to OpenTelemetry version 1.4.0-alpha.2 didn't help. Using CompositeProcessor to filter some messages breaks trace transaction in those places. |
You can probably use a |
@Aaronontheweb names are not a problem. I chose FSharpResult as an example of message types to try to filter out, and it broke the trace transaction. If I instead filter out tags with "Akka.Cluster.Sharding", the transaction will also be broken. As long as it's impossible to retain span hierarchy when excluding certain spans, then it's of a little use to compose distributed traces. |
The more I look into it, the more I doubt anything can be altered using CompositeProcessor. It works with activities that already set up and there's not much can be altered for them. For example, although Activity has a SetParentId method, documentation states that "This method should only be used before starting the Activity object. This method has no effect if you call it after the Activity object has started". |
Living with some of the noise from Phobos might be the best option in this case then unfortunately |
I'm looking into two approaches here, revisiting this issue after some months away from it:
The problem we have with the "hiding" approach in Sharding is the buffering inside the The "compression" approach would work after the fact - I'd have to buffer groups of unprocessed spans that all are all gathered locally on a single node and flatten them into a single operation. I haven't looked at how feasible that is exactly but I think it might actually be simpler to implement and will probably reduce the performance impact on the sharding system. |
I believe our solution to #69 addresses this too - but if that solution doesn't quite work we have some more ideas for hardening it. |
I believe this is rather specific question related to manual span creation, so I booked an hour with @Aaronontheweb to demonstrate and discuss this issue. I will briefly describe it here.
Setup:
We have an Akka cluster where incoming requests (via RabbitMQ) spawn transactions consisting of series of messages sent between several cluster nodes using cluster sharding. We took over the trace span creating by setting create-trace-upon-receive to "off". We are using latest Phobos release (2.1.0 beta1).
Issue:
Every shard message adds 2 additional short spans (duration less than 1ms) to the trace transaction.
Example:
1 span. A queue message is fetched from a RabbitMQ queue. We start a new activity. Message is sent to a handler via cluster sharding
2 span. A new span added with actorType "Akka.Cluster.Sharding.ShardRegion" and actor path "/system/sharding/mediasetProxy" (this is a shard proxy)
3 span. A new span added with actorType "Akka.Cluster.Sharding.ShardRegion" and actor path "/system/sharding/mediaset" (this is the actual shard, not proxy)
4 span. A message is received by the handler of shard message which creates its own custom trace span.
A full trace transaction typically contains 40-50% auto-generated spans from shard messages (2 and 3 above). I checked one such transaction, it had 89 trace spans, and 40 of them were auto-generated shard messages with a name "akka.msg.recv XXX" where XXX is a name of a shard message type.
Prior to release of Phobos 2.1 no such messages were present in a trace but there was a bug in an implementation and sending shard messages broke trace transaction. New version fixed this problem, we now manage to compose long trace transactions from the spans we create ourselves, but ability to suppress auto-generated "akka.msg.recv XXX" messages from the trace will make it more compact and easier to understand.
The text was updated successfully, but these errors were encountered: