-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Improve detection of parent-child relationships #537
Comments
Thanks for sharing Daniel. The team will get back to you here post discussions. |
There is some legacy code that uses tags to generate service graph, we hope to get rid of it very soon. However there exists a alternate and better logic working in parallel that computes the service graph using three IDs namely TraceId, SpanId and ParentSpanId and a serviceName. How it works? If you notice, B's span is a child of A's. And this parent-child relationship is used by haystack to generate an edge between the two service nodes. Note when B calls its downstream service (say C), then B's client span will be a child of B's server span. But since these two spans are coming from the service (B in this case), so we don't create a loopback edge. Regarding your ask to show a third-party system that doesn't support tracing as a graph-node, we can think of using peer.service tag instead of extending IDL Now if you look at the current implementation, for every span, we wait for few seconds(configurable) to look for its parent or child span. Now if we don't receive a pair, we ignore that span altogether. We can add a fallback to use 'peer.service' tag but that might conflict with a case where the span-pair came late and hence missed. For such a scenario, we will show two edges, one for peer.service and other actual service, if they have different names. A calls peer.service as 'foo' but foo calls itself as 'bar'. I need to evaluate if fallback strategy is ok, may be turn this on behind a feature flag? |
Thanks for your time @ashishagg. |
This field is too important to be relegated to a tag, and even if it weren't, the opentracing peer convention is not widely used and not consistently either. You should consider a field Long history and context belowBear in mind that the original concept of peer.service came literally from zipkin based conversions. For example, it was conventionally "sa" server address or "ca" client address in Zipkin. When Yuri decided to make the tag named "peer.service", Uber was still a Zipkin site and probably wanted something easier than the old binary annotations we had. They knew the limitations of looking up based on different names ("sa" vs "ca"), so one name ("peer.service") was a bit of an improvement back then. Note Yuri and Ben started and decided this in 2 days while a lot of folks were on holiday (27-29 Dec 2015). Notably, "service" was left out until several years later! Meanwhile, Zipkin knew where bodies were buried... we all suffered them but wanted to always remain backwards compatible from a data pov. Fixing the root problem requires buy-in from more than two people though, at least that's been the case in Zipkin. We also need tests to pass. For example, we can't just define something and not use it in practice (such as in a service graph). In early 2016, we started an effort to get rid of the confusion around the remote side and fix the model. This was called v2. Through that process, we noted some other things. Firstly "peer" is an awkward and bad name to use. It likely wouldn't have been used if others were involved in OpenTracing's PR besides just ben and Yuri. Anyway, "remote" was a far more accessible term. Next, indexing nested collections has performance problems, merging complexity, and the concept of remote is quite a primary concern. So, A top-level remote service field is both more efficient and and easier to map apis. PS.. yes APIS! because this is too important to be relegated to a tag. In early 2017 we had the v2 format including this and Brave 4 which had an api for it as well. This made things much much simpler on users and the data side. Anyway.. instead of using tag naming convention which would only be present on a subset of instrumentation, I'd just add remoteServiceName field to your data instead. You already have a serviceName field, and this makes the most sense from an intuition vs having local service as a field and relegating remote one to a tag naming convention! Plus you can always map "peer.service" in the rare case it is present just like you map into your span service field today. Meanwhile, many zipkin based instrumentation have top-level apis for service including remote service, and so you'd get a stronger mapping when that's the case as well. So, would you consider just doing this with a model field? I'd be more likely to help with work involved if you agree. |
PS if it helps, here's our dependency linker code, which covers a bunch of edge cases as well. For example, people making circular links etc. https://github.com/openzipkin/zipkin/blob/master/zipkin/src/main/java/zipkin2/internal/DependencyLinker.java Since haystack data model is somewhat like zipkin v1.5 ( :) ), the following mapper might also be of interest https://github.com/openzipkin/zipkin/blob/master/zipkin-storage/mysql-v1/src/main/java/zipkin2/storage/mysql/v1/DependencyLinkV2SpanIterator.java |
Good to read the history behind these 'tags' 👍 My concern in putting remoteServiceName in data model is about ownership and uniformity. I will illustrate this with an example of two services A calling B. The choice of remoteServiceName is entirely upto A, and B may not call itself as 'B' in its own span(serviceName) . The two services can set whatever they desire which is correct. But from service-graph side, if it only depends on serviceName, semantics remain intact. However using both attributes (serviceName and remoteServiceName) with different naming may complicate things? RemoteServiceName makes absolute sense if there is a uniformity across service-mesh for service names. Your thoughts? |
My concern in putting remoteServiceName in data model is about ownership
and uniformity. I will illustrate this with an example of two services A
calling B. The choice of remoteServiceName is entirely upto A, and B may
not call itself as 'B' in its own span(serviceName) .
We solve this simply by not allowing a caller to override the callee's
name. In other words, you are authoritatively the decider of who "you" are.
In this respect, the remoteServiceName only helps with uninstrumented
incoming traffic (ex from Kinesis) or uninstrumented outbound (ex to MySQL).
Make sense?
The two services can set whatever they desire which is correct. But from
service-graph side, if it only depends on serviceName, semantics remain
intact. However using both attributes (serviceName and remoteServiceName)
with different naming may complicate things?
They serve different purposes. If I call MySQL that doesn't mean I am
MySQL.
RemoteServiceName makes absolute sense if there is a uniformity across
service-mesh for service names.
I don't see the relationship here. remoteServiceName is optional and when
present offers a service name for a destination that *might not* be
instrumented. If that is instrumented, then your linker ignores it.
Another way is to propagate your service name downstream in a header. that
can be an authoritative remote service name. I know Netflix spoke about
this as it helps with streaming aggregation (ack some deduping still
needed). Still when calling down stream you'd want to indicate something or
else you'd not see any virtual destinations.
… |
Makes sense. But as you mentioned, there is a history behind its introduction in the spec. Anyhow, give us couple of days to think through it and we will create a work item for it. |
yep: the key here will be likely a decision of.. span.serviceName and span.tags[] where key = "peer.service"
vs span.serviceName and span.remoteServiceName
:) |
@ashishagg @ayansen I was looking at this again as I ran into a scenario to record a span for a AWS call. Re-reading the thread, I think we should add |
@abhinavtripathi7 and @ashishagg to look into and prioritize this issue. |
Any idea when we would be able get this request implemented? We are seeing some holes in our service graphs as a result of this. |
Did you get a chance to try our ServiceInsights (beta) feature? You need to install newer haystack UI version |
A service is only displayed in the service graph if a parent-child relationship is found.
This seems to be done in the SpanAccumulator.scala where it looks for 'cs', 'cr' and 'sr', 'ss' tags (to identify as client or server) and then tries to find span pairs based on if they share the same ids.
If a server service does not reporting tracing data (for eg: an external third party or a system that doesnt support distributed tracing) and we only have spans coming from the client service, this parent-child relationship is not established and the server service is not represented in the service graph.
Except from it's name, there is enough information to represent this service in the service graph as well it's stats (rate of calls, % of failures, etc).
I'm not really sure how the implementation would look like. One option would be to extend the idl.
Currently it only supports the name of the local service. Adding the name of the remote service as an optional field would allow us to display the correct name on the service graph (as well as in the traces view).
The text was updated successfully, but these errors were encountered: