Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define what behavior is expected when X-B3-TraceId is present, but not X-B3-SpanId #3

Open
codefromthecrypt opened this issue Sep 14, 2016 · 19 comments

Comments

@codefromthecrypt
Copy link
Member

question here spring-cloud/spring-cloud-sleuth#400 (comment) from @shalako

@codefromthecrypt codefromthecrypt changed the title Define what behavior is expected when B3-TraceId is present, but not X-B3-SpanId Define what behavior is expected when X-B3-TraceId is present, but not X-B3-SpanId Sep 14, 2016
@codefromthecrypt
Copy link
Member Author

X-B3-TraceId must be present with X-B3-SpanId to affect the trace identifiers. If they aren't specified together, whichever is present will be ignored and a new trace and span id will be provisioned.

https://github.com/twitter/finagle/blob/989edc58084fa01b6e6557ff5b6adda594ba469f/finagle-http/src/main/scala/com/twitter/finagle/http/Codec.scala#L371

@codefromthecrypt
Copy link
Member Author

PS I know many folks aren't watching this repository, so I'll do a one-time spam for B3-compliant tracer authors. The aim is to capture B3 here and also ideally an integration test for it (maybe crossdock). I hope some of you will choose to watch this repository (so as to make sure things make sense)

@nicmunroe @felixbarny @yurishkuro @jcarres-mdsol @abesto @eirslett @kristofa @michaelsembwever @kevinoliver @mosesn @marcingrzejszczak @prat0318 @devinsba @basvanbeek @schlosna @ewhauser @klingerf

@codefromthecrypt
Copy link
Member Author

related issue: #4

@nicmunroe
Copy link

Apologies in advance if I'm missing something and the following is dumb.

In the case of receiving a span ID but no trace ID I can understand throwing stuff away and creating new IDs as I don't see any benefit to honoring a span ID without knowing what trace it's attached to. But in the case of receiving a trace ID without a span ID that's still potentially useful info; it's true that you wouldn't know which span ID to use as the parent, but just knowing that a span is part of a given request even if it's "detached" is potentially useful information in my mind.

This would only really come up in two cases:

  1. Broken caller impl. In this case my concern above would apply. Yes, they are broken and should not send bad requests, but since we have a graceful fallback (honor trace ID and create new span ID) that provides useful info in the end (knowing that a span was part of a request even if you can't attach it to a parent) shouldn't we do the graceful thing? Not to mention that if you drop the trace ID you're losing out on log tagging using MDC from that point forward.
  2. The caller wants you to use a specific trace ID for the request for debugging / call association purposes. We've run into cases where it's been incredibly useful for some clients (e.g. restrictive mobile apps) to be able to specify the trace ID when they make the initial request in order to force that trace ID for that request.

I'm probably missing some important points about "why". Let me know what you think.

@shalako
Copy link

shalako commented Sep 16, 2016

Honoring the trace when present makes sense to me, too, whether or not a span is present.

@codefromthecrypt
Copy link
Member Author

Prologue:

A reverse-documentation effort smokes out concerns in the original implementation. So, let me first mention that by documenting what's been the case for years doesn't mean I agree with it, or decided it. In things like this in Zipkin, I usually take the stance of first document what exists. Then, if important enough help push the many parties to change. The latter is very expensive even when people agree.

On requiring both, and not accepting degraded propagation:

The coupling was required since its original commit in 2011 twitter/finagle@8939f31 Perhaps @johanoskarsson remembers why, I wouldn't. My 2p it that it reasonable to not accept broken instrumentation, especially if you have no defined way to log "broken". We do have a way now, via the "error" annotation.

The way I see it is that people can choose to start a new span given broken propagation, but if we want to make this helpful, we should probably report that it is broken. One thing that has come up (for example reported by @kristofa) is that misleading traces are worse than no trace. One way we could do this is add either the existing "error" annotation or define a broken instrumentation annotation and use that.

@nicmunroe on part 2: If the user can send trace id, they can also send a span id, right? Let me know if #4 (comment) doesn't cover that.

@codefromthecrypt
Copy link
Member Author

@yurishkuro I think you have done some work in analyzing tracers.. do you have means to tag bad or out-of-date instrumentation? ex in finagle, the version is logged, so if there's an issue with one version, you could analyze on that.

@yurishkuro
Copy link

we do log the version of the tracing client library, but we do not bother processing malformed traces from the wire, we just start a new one. It's a bit different in our case because aside from baggage all of the span context is encoded as a single string in a single header, so the use case of "have trace id, don't have span id" is very unlikely to happen (unless the header itself got partially chopped off by some http proxy, but then we just won't parse it at all).

@nicmunroe
Copy link

@nicmunroe on part 2: If the user can send trace id, they can also send a span id, right? Let me know if #4 (comment) doesn't cover that.

@adriancole it does cover responsible callers who do their research before integrating, but it's happened more than once where a time-crunched team assumes they know what's required and only end up sending trace ID. The problem with simply dropping that on the floor is that the problem is usually only discovered during an important error debugging session, at which point it's too late.

Since we use MDC tagged logging to put trace IDs into every log message and use a log aggregator to be able to search for all log messages across all microservices related to a given trace ID when investigating that request, switching trace IDs halfway through the request when we technically don't have to would be a big problem. I know that's a secondary use case for tracing and not supposed to be the primary consideration, but it's proven itself to be so incredibly useful that I don't think we can throw broken implementors under the bus. After all, the broken implementors are the ones most likely to need the extra debugging capabilities, yes?

I don't think our use case should trump everything else and cause the spec to change - just trying to provide our perspective.

@codefromthecrypt
Copy link
Member Author

@nicmunroe @jcarres-mdsol @shalako

Ok, well we can't rewrite history, but we can affect this moving forward. How about this?


X-B3-TraceId should be present with X-B3-SpanId to ensure that a span is placed at the correct place in the tree. Historical instrumentation start new traces when X-B3-SpanId is absent.

Instrumentation who tolerate absent X-B3-SpanId have the following behavior:

  • X-B3-SpanId is derived as the right-most 16 characters (64bits) of the X-B3-TraceId
  • The implicit span is assumed to be the root of the trace, so this works fine once per trace.

When multiple calls propagate the same X-B3-TraceId, but not X-B3-SpanId, multiple "sr" annotations will end up in the root span, corrupting it. If this happens, correct instrumentation to pass a unique X-B3-SpanId.

@nicmunroe
Copy link

@adriancole ok, a few things are clicking for me now and I think I see where some of the disconnect on my part was:

  1. Wingtips generates a new random span ID for everything, including root span (it doesn't share trace ID for the root). So when a call comes in with trace ID specified but no span ID we end up creating a new unique span ID. There's never multiple root spans in our case - we'd just end up with a "disconnected" span that has no parent but is otherwise unique (which is unfortunate but not as bad as two spans sharing the same span ID).
  2. The wingtips idiom is to have all spans be started and completed in the same process - we don't have cases where a span is started in one process and finished in another. So I'm guessing we suffer fewer negative consequences from continuing a broken trace (?).

Again, this is not to say that what wingtips is doing is better; I mainly want to make sure what we're doing isn't wrong or going to cause other nasty problems. Assuming what we're doing is ok then I'm happy with the documentation stating what happens historically along with what other options are acceptable.

@codefromthecrypt
Copy link
Member Author

This sounds similar to grpc. In grpc, they propagate the parent (client) id
only. Then, when it is received, the server uses that parent, generates a
new span id for the server span, and moves on.

In B3 (today anyway), the caller propagates the span id for the RPC.

For single-host spans, the client span id is propagated, but used
differently.

Let's say the client has parentId 1 and spanId 2. The client propagates its
trace parent and span id. Doesn't matter if this is wingtips or not.

If the server wants to use shared spans, it uses exactly the same ids for
the "sr" event.

If the server wants to do separate spans for client and server, it uses the
propagated span id as its parent and generates as new id.

If there is no span id, with the change we mentioned, it just provisions a
new span id (it doesn't have to use lower 64 bits of the trace even though
that's conventional).


The catch is when a caller thinks the server is going to use the span id it
propagated (because it assumes shared spans but the server doesn't do
that). That id still exists (as it is recorded as the parent id of the new
server span). However, it may not have any annotations in it at all (if the
caller didn't report that id). What's the impact?

Well, there's no impact in zipkin. Zipkin has no search by span id anyways.
We also have tests to cover headless trace (when the root span id is never
reported, we treat the root-most as root)

There could be impact to log tools that assume all span ids end up in logs.
This isn't something we have scope to impact at the moment, and in some
cases it will still work. Ex in finagle the string logged includes all
trace identifiers (trace parent and span id)

So... does this help clarify things about impact to wingtips (which is same
as grpc and maybe jaeger I think)

On 19 Sep 2016 01:29, "Nic Munroe" notifications@github.com wrote:

@adriancole ok, a few things are clicking for me now and I think I see
where some of the disconnect on my part was:

Wingtips generates a new random span ID for everything, including root
span (it doesn't share trace ID for the root). So when a call comes in with
trace ID specified but no span ID we end up creating a new unique span ID.
There's never multiple root spans in our case - we'd just end up with a
"disconnected" span that has no parent but is otherwise unique (which is
unfortunate but not as bad as two spans sharing the same span ID).
The wingtips idiom is to have all spans be started and completed in the
same process - we don't have cases where a span is started in one process
and finished in another. So I'm guessing we suffer fewer negative
consequences from continuing a broken trace (?).

Again, this is not to say that what wingtips is doing is better; I mainly
want to make sure what we're doing isn't wrong or going to cause other
nasty problems. Assuming what we're doing is ok then I'm happy with the
documentation stating what happens historically along with what other
options are acceptable.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@nicmunroe
Copy link

Yeah in wingtips the server always separates the spans by taking the client's span ID as the server's parent span and creating a new span ID for the server span. So it sounds like that's a reasonable thing to do and not unheard of. Thanks for taking the time for all these explanations!

@yurishkuro
Copy link

X-B3-SpanId is derived as the right-most 16 characters (64bits) of the X-B3-TraceId

that doesn't sound like a good practice. If the client app that only sends trace id makes a bunch of calls, each call will register as the same span id.

@codefromthecrypt
Copy link
Member Author

X-B3-SpanId is derived as the right-most 16 characters (64bits) of the
X-B3-TraceId

that doesn't sound like a good practice. If the client app that only sends
trace id makes a bunch of calls, each call will register as the same span
id.

Well, we were defining what bad practice would look like, as B3 expects a
span id to be sent! :) I did document that this would register as the same
span id, as if we think this through, what's the better option? When
there's no parent, we have a choice of

  • make a mostly valid root span (by using a consistent span id function)
    • If they reuse that multiple times, they corrupt that span, but it will
      still be renderable.
  • make a mostly valid root span (by using an inconsistent span id function)
    • If they reuse it multiple times, it will appear as though there are
      multiple root spans (multiple spans with no parent). I'm not even sure if
      zipkin will return that.

I lean towards asking people to actually implement the specification, and
if they don't, make it possible to see the error, rather than define an
undefined shape (a trace with multiple roots). Am I missing something?

@yurishkuro
Copy link

@adriancole While I am intimately familiar with the pain of bad instrumentation, I agree that it's better to fix that instrumentation than to document weird scenarios of what should happen if half the info is missing. @nicmunroe 's argument that "you get burnt by bad instrumentation when you need it the most" is somewhat flawed, because if bad instrumentation is already in production, then the tracing system is already processing the bad data and can actually identify bad players proactively, without waiting for an outage to happen.

@jcarres-mdsol
Copy link

I'd say just error. Both headers are necessary.

@nicmunroe
Copy link

@yurishkuro I agree with you in principle that bad instrumentation is possible to detect proactively, but not every organization is mature enough or has built the proper tools and alerting to stay on top of that. In the wild many teams don't have a solid grasp of distributed tracing and will do the wrong thing due to not reading the spec, or a bug, or what have you (and it will always be worse for orgs as they get started).

I think it's fine for the spec to be strict and simply state it's an error to send trace ID without span ID, and not document weird scenarios. But in the spirit of "be conservative in what you send, be liberal in what you accept" robustness, wingtips will likely continue to handle that case by continuing the trace with a new random span ID as the disconnected span still allows for bad-instrumentation-detection, and having all parts of the request tagged with the same trace ID still provides tangible benefit for callers as they work through their instrumentation issues.

@codefromthecrypt
Copy link
Member Author

ps I thought about this recently, and one thing about this thread is that it seems there's multiple valid policies, of which context decides what's best.

One way out is to document a few policies in practice, what's more used (if there's a more used) .. in something like an appendix.

Ex.

  • ExtractFailurePolicy.RESTART
    • default and what exists on most tracers
  • ExtractFailurePolicy.ERROR
    • throw error knowing it might break apps, like having assertions or linting on
  • ExtractFailurePolicy.TAG_AND_RESTART
    • add a tag to the span including the cause of the extraction failure, but restart normally

I'd concede that most of these choices are library and possibly span format specific, but anyway food for thought.

this recently broke out from separate discussions with @basvanbeek and @pavolloffay on how to handle propagation nuance, as these sorts of things can be insidious in practice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants