Skip to content
This repository has been archived by the owner on May 23, 2023. It is now read-only.

Add tag for service #119

Merged
merged 5 commits into from
Jul 12, 2018
Merged

Add tag for service #119

merged 5 commits into from
Jul 12, 2018

Conversation

tylerbenson
Copy link
Contributor

Per discussion in #77, service name is deemed a widely enough used concept to warrant adoption by the community.

(Closes #77)

Per discussion in #77, service name is deemed a widely enough used concept to warrant adoption by the community.

(Closes #77)
@tylerbenson tylerbenson requested review from tedsuo and yurishkuro May 30, 2018 03:58
| `sampling.priority` | integer | If greater than 0, a hint to the Tracer to do its best to capture the trace. If 0, a hint to the trace to not-capture the trace. If absent, the Tracer should use its default sampling mechanism. |
| `service` | string | The name of the current service. E.g., `"elasticsearch"`, `"a_custom_microservice"`, `"memcache"`. Should correspond with values set in `peer.service`. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have a description of the use case, such as a service mesh creating spans on behalf of different services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a single use-case that would work well in this description. For example, with Datadog, (in addition to the use case you described) someone could set the service name via this tag that would override the default service name for a specific tag with the goal of separating out or correcting the name. Do you have a better description in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your use case is less clear to me. Why would a service override its name? In our internal wrappers around oss jaeger libs we do the opposite, ie prevent services from even configuring their name in the tracer, instead we read it from env vars set by infra, to avoid names that do not correspond to discovery names.

I think it would be good to go through an RFC for this feature and describe the use cases it is meant for and why it's needed. Otherwise people will just unintentionally abuse it, I'm afraid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are two use cases for wanting the ability to override from the globally configured name:

First: you can think of like the service mesh use case (creating a span on behalf of another service). For example when making a database call, we rename that span to correspond to the database instead of the current service.

Second: same issue applies when dealing with a java application server with multiple war files deployed onto it. Many people think of those war files as logically separate things and want them to be named differently. If the ENV configured name takes priority, there isn't a nice way to make that distinction.

What about the name and use case is unclear? Why the need for an RFC? @tedsuo thoughts?

I understand your suggestion for configuring separate tracer instances to cover some of these purposes, but that doesn't work well in auto-instrumentation land where things need to be much more dynamic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for RFC is that it serves as a formal description of the use cases that can be considered canonical. The summary table modified in this PR can refer to the RFC for more details. Otherwise defining this new tag without explaining how to use it leads to confusion of which I think your database use case is a perfect example: it's misleading to attribute the client-side span inside the application to the database server, and unnecessary since the database can and should be referenced in the peer.service tag instead, per already existing semantic guidelines.

The multi-war file example could be a use case, but none of the existing instrumentation in opentracing/contrib behaves like that today, i.e. if I use java-grpc or java-spring-cloud or probably any other java-xxx modules from contrib, they won't know how to set the service tag because they expect the tracer to do it. So for that multi-war example to be practical not only do we need to introduce the service tag, but we also need to change all instrumentations in contrib to be able to take service name as an argument so that they can set this tag. I think that's much larger implications than just adding an optional tag. Personally I don't think we should be recommending this as a valid use case, since it can be much easier solved with tracer per application, even if they are deployed in a single app server (any decent app server would offer a separate class loader for each app).

So that leaves us with service mesh or similar use cases where an application is constructing spans on behalf of other applications, which is indeed a marginal use case well served by an "optional" tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want this in order for service meshes and collection systems to trace "on behalf" of another process. There is a need to differentiate between their own tracing and the tracing they are doing on behalf of another service. I would recommend that be listed as a primary use case.

I don't think you would ever have RPC instrumentation setting this. In the case of running multiple processes/JARs/etc, it would be an application-level span that would receive this tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest focusing just on the service mesh /external reporting. The guidance should be, IMHO:

  • when describing a span which is associated with an adjacent service the current service is calling, use the peer.service tag.
  • when recording a span out-of-process, on behalf of another service, use the service tag.

Separately, there is a larger issue around tracing stateful "things", be they called services, processes, or objects. Right now tracing is only about transactions. We consistently see code that persists and interacts with multiple traces/transactions, and it is not possible to model them effectively right now. For that issue, I think we need an RFC. And WARs, to me, would fall under that scenario.

I would recommend we separate that larger case out from the smaller (yet also necessary) case of "external tracing" such as service meshes, syslog-to-trace, etc. The service tag is sufficient for that narrower use case.

Copy link
Contributor Author

@tylerbenson tylerbenson Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro @tedsuo if you want to give me the text you want in the description column that you'd like to see, I can put it in there. I'm not clear on how to word the description so it would be useful for the service mesh use case. The description ted gave above seems too long for that compared to the other descriptions already there.

@CodingFabian
Copy link

We received this requirement from our users as well: they want to override on a per span level the calling/called service name.
While I agree that there are other ways to do it, I can see that configuring all metadata on a span level makes sense.
Nitpicking on the description: It should be made clear that peer.service of the parent should correspont to service of the child. Currently it reads as if peer.service and service of the same span should match, which would make one redundant.

@tylerbenson
Copy link
Contributor Author

@tedsuo @yurishkuro @CodingFabian I've made another pass at the description to try to incorporate the feedback. If you don't like it, please propose a better description.

@@ -31,8 +31,9 @@ Span tags apply to **the entire Span**; as such, they apply to the entire timera
| `peer.ipv4` | string | Remote IPv4 address as a `.`-separated tuple. E.g., `"127.0.0.1"` |
| `peer.ipv6` | string | Remote IPv6 address as a string of colon-separated 4-char hex tuples. E.g., `"2001:0db8:85a3:0000:0000:8a2e:0370:7334"` |
| `peer.port` | integer | Remote port. E.g., `80` |
| `peer.service` | string | Remote service name (for some unspecified definition of `"service"`). E.g., `"elasticsearch"`, `"a_custom_microservice"`, `"memcache"` |
| `peer.service` | string | Remote service name (for some unspecified definition of `"service"`) being called by a client. E.g., `"elasticsearch"`, `"a_custom_microservice"`, `"memcache"`. Meaning should correspond with values set in `service`. |
Copy link
Member

@yurishkuro yurishkuro Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"being called by a client" is incorrect: peer.service tag is symmetric, used by both client and server spans to identify the remote service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair in theory, but I don't understand how that would be useful in practice. Do people write instrumentation where the server creates a span on behalf of a client like is common for the reverse?

Should I remove the being called by a client part then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the server creates a span with span.kind=server and peer.service={name} if it knows the upstream caller by {name} (it depends on the RPC protocol, or conventions within the organization, e.g. a special http header - example).

yes, "being called by a client" should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro done.

@yurishkuro
Copy link
Member

new description lgtm.

| `sampling.priority` | integer | If greater than 0, a hint to the Tracer to do its best to capture the trace. If 0, a hint to the trace to not-capture the trace. If absent, the Tracer should use its default sampling mechanism. |
| `service` | string | Service name for a span which should override any default defined in tracer config for current and child spans. E.g., `"account-service"`, `"a_custom_microservice"`. Meaning should correspond with values set in `peer.service`. Note: intended use is for things like a service mesh which might have many services it is creating traces for. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for current and child spans" - this implies that if a root span of a trace sets the service tag, then all spans in that trace will be associated with that service. Need to find some wording that restricts the scope of this to spans created in the same process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@objectiser I understand the concern. What do you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"current and child spans (reasonably considered to be within the scope of that service)" ?

Not sure - was wondering whether something based on the tracer, or within the current process - but couldn't find suitable phrase. So the above is vague enough so could mean anything but at the same time make it clear it does not mean all child spans.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@objectiser any other suggestions?

| `sampling.priority` | integer | If greater than 0, a hint to the Tracer to do its best to capture the trace. If 0, a hint to the trace to not-capture the trace. If absent, the Tracer should use its default sampling mechanism. |
| `service` | string | Service name for a span which should override any default defined in tracer config for current and child spans (reasonably considered to be within the scope of that service). E.g., `"account-service"`, `"a_custom_microservice"`. Meaning should correspond with values set in `peer.service`. Note: intended use is for things like a service mesh which might have many services it is creating traces for. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasonably considered to be within the scope of that service

Is it really necessary? If say a service mesh process is creating spans on behalf of a service, why can't it set the service tag on all spans it creates?

Another example is a normal service that creates most of the spans for itself, but the very first span is created on behalf of the caller (for whatever reason, the caller couldn't do it). With the above wording that first span will cause all spans in that service to be attributed to the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is changed so the tracer does not take responsibility for setting service on child spans, that would also be fine - although I'm guessing that was added to overcome the issue where all instrumentations would need to know the service and added it explicitly.

@tedsuo
Copy link
Member

tedsuo commented Jul 11, 2018

Current services definition

Service name for a span which should override any default defined in tracer config for current and child spans (reasonably considered to be within the scope of that service). E.g., "account-service", "a_custom_microservice". Meaning should correspond with values set in peer.service. Note: intended use is for things like a service mesh which might have many services it is creating traces for.

Comments

Very happy to see agreement that we should have this, and I'd like to get it merged soon. I only have one real piece of feedback: I don't think we should specify that child spans automatically have this tag applied. In practice, I have found that external reporters will be applying this tag to every span they are reporting. Requiring tracers to track parentage internally would require implementation changes, which would be expensive and unnecessary for the stated use case. It also leads to odd edge cases: if you try to define reasonably considered as stated above, you'll find that it may be tricky. It would be better to simply say that the tagged span will be treated as belonging to the service in the tag.

Suggested rewording

The service name for a span, which overrides any default "service name" property defined in a tracer's config. The meaning of service should correspond to the value set in peer.service, except it is applied to the current span. This tag is meant to only be used when a tracer is reporting spans on behalf of another service (for example, a service mesh reporting on behalf of the services it is proxying, or an out-of-band reporter which reads in log files). This tag does not need to be used when reporting spans for the service the tracer is running in.

@yurishkuro
Copy link
Member

@tedsuo +1, this is the same concern as mine in #119 (comment)

@tylerbenson
Copy link
Contributor Author

@tedsuo @yurishkuro in the interest of getting this in soon, I've adopted the changes suggested by Ted. I expect Datadog will continue supporting the cascading behavior, but I don't think that is incompatible with the spec or the way I expect users to utilize it.

Copy link
Member

@tedsuo tedsuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerbenson thanks! I think, in practice, there's nothing "incompatible" or wrong with having that cascading behavior, it just shouldn't be a guarantee made by the spec at this time.

@tedsuo
Copy link
Member

tedsuo commented Jul 12, 2018

I believe we've reached consensus on this initial set of usecases and requirements, so I'm going to merge.

In the future, if we would like to extend the set of use-cases to include something that requires cascading behavior - such as "multi-service processes" - we can revisit this tag in a new issue.

Thanks, y'all!

@tedsuo tedsuo merged commit f7ca62c into master Jul 12, 2018
@tylerbenson tylerbenson deleted the tyler/service-tag branch July 12, 2018 21:04
@tylerbenson
Copy link
Contributor Author

@tedsuo is the next step to add this tag to the java repo?

tylerbenson added a commit that referenced this pull request Dec 6, 2018
tylerbenson added a commit that referenced this pull request Dec 6, 2018
tylerbenson added a commit that referenced this pull request Jan 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants