Skip to content
This repository has been archived by the owner on May 23, 2023. It is now read-only.

Should there be a SpanContext.DebugCorrelationID or similar? #24

Closed
bhs opened this issue Dec 5, 2016 · 49 comments
Closed

Should there be a SpanContext.DebugCorrelationID or similar? #24

bhs opened this issue Dec 5, 2016 · 49 comments

Comments

@bhs
Copy link
Contributor

bhs commented Dec 5, 2016

Per opentracing/opentracing-cpp#3 as well as numerous misc discussions on Gitter.

Basically, most OT implementations have some sort of trace_id under the hood. There is practical value in exposing that, but there are also immediate problems:

  1. There is nothing in the OT spec about any particular sort of "id" (trace, span, or otherwise); and even if there were, there is no way for OT to know the bitwidth of those ids or even whether they're strings/ints/whatever.
  2. Since the OT spec allows for multiple parent spans, the "trace id" (sic) is not actually a single id / is not well-defined across the trace.

One option would be a SpanContext.DebugCorrelationID that would return a string and be a sort of "best-effort trace_id" for lack of a better word. I have mixed feelings about this but wanted to raise it for discussion and tracking.

@bhs
Copy link
Contributor Author

bhs commented Dec 5, 2016

(cc @lookfwd)

@lookfwd
Copy link

lookfwd commented Dec 5, 2016

One important detail on this, is that I think beyond any IDs that represent the trace, an "implementation identifier" or some other form of namespace should also be provided. This way one avoids interpreting IDs from tracer of type A as ones from type B while parsing logs or whatever. An id in the form ot::openzipkin/<traceid>/<spanid> - no spaces - URL specs'-valid - sounds ideal to me.

@lookfwd
Copy link

lookfwd commented Dec 5, 2016

Since the OT spec allows for multiple parent spans, the "trace id" (sic) is not actually a single id / is not well-defined across the trace.

I can see an implementation of that in Go. That's nice.

I see the point. If you inject all the IDs on a log, you will be able to do a quick "grep" (or whatever) and be sure you've got all the info out. Otherwise implementations will have to hit an OT API to get all related IDs for a trace and then do a "grep" on any of the ids.

@lookfwd
Copy link

lookfwd commented Dec 5, 2016

One argument I can see is this:

If a tracer wants to support injecting all the trace-ids always (thus print them all and avoid the OT-API to get related IDs), it must limit itself to only support multiple parents at Span construction time. Otherwise, between construction time, and the time the Span gets associated with any other Trace-ID's, the task might crash thus the connection between this Span and the Trace might be lost. So I guess such tracers are idiomatic and least of their worries is to provide a composite ID e.g. os:idiomatic_tracer/traceId1_traceId2_traceId3/spanid or whatever.

pseudocode:

extracted_context = tracer.extract(...)
extracted_context2 = tracer.extract(...)
span = tracer.start_span(operation_name=operation, child_of=[extracted_context, extracted_context2])
span.print()  # Here it prints ot:mytracer/extracted_context_trace_id:extracted_context_trace_id2/spanid

Otherwise, mainstream tracers that allow late join's to extra trace-ids can only support the "print a/any single tracedid/spanid" model and then there will be an API with "get relevant IDs" and a bulk search.

pseudocode:

extracted_context = tracer.extract(...)
span = tracer.start_span(operation_name=operation, child_of=extracted_context)
span.print()  # Here it prints ot:mytracer/extracted_context_trace_id/spanid
# <--- Crash might happen here and you would still like to know that 
# this span was associated with `extracted_context2` below anyway. So
# some form of "give me any related span IDs" post-processing is necessary.
extracted_context2 = tracer.extract(...)
span.isAlsoChildOf(extracted_context2)
span.print()  # Here it prints either ot:mytracer/extracted_context_trace_id/spanid or ot:mytracer/extracted_context_trace_id2/spanid
              # in any case a complex operation has to be performed for retrieval.

I can't see any solid hybrid-model between those two.

@codefromthecrypt
Copy link

here's some related commentary which might be of interest openzipkin/openzipkin.github.io#48 (comment)

@lookfwd
Copy link

lookfwd commented Dec 6, 2016

What about traces not being sampled and potentially not propagating downwards? I see Zipkin tracing as a best effort concern. It should not be responsible for making sure you can correlate a request with log files for instance.

Good to keep in mind

@yurishkuro
Copy link
Member

What about traces not being sampled and potentially not propagating downwards?

Trace ID/context propagation generally happens regardless of the sampling. At least in our setup trace context is always propagated in-band, in part because it also carries baggage which some functionality relies on always being there.

I see Zipkin tracing as a best effort concern.

Only when it comes to profiling and reporting spans out of process. But as a distributed context propagation it is not optional. If it were optional, then any other form of "unique ID" propagation would be in the same boat, making the whole point of correlating logs moot.

@codefromthecrypt
Copy link

codefromthecrypt commented Dec 7, 2016 via email

@lookfwd
Copy link

lookfwd commented Dec 7, 2016

If this helps, you can have people return e.g. ot:openzipkin/? for not available for any reason and ot:openzipkin/- for not sampled. Ugly - but just to keep the discussion going.

@codefromthecrypt
Copy link

@lookfwd fwiw in the next version of brave (zipkin java tracer), I'll be propagating the context regardless of sampling decision. will also try and see if it catches with other languages. old instrumentation won't do this, though.

on what to print, incidentally without noticing this I was already doing traceid/spanid in my context toString

@lookfwd
Copy link

lookfwd commented Dec 15, 2016

I'll be propagating the context regardless of sampling decision

nice, makes sense!

I was already doing traceid/spanid in my context toString

Ok.

Maybe, if we all see the point/usefulness of this one, but we are aware of the practicalities of existing implementations, we could say that we:

We recommend there's print() related API that prints trace/span-id or whatever related span identification mechanism an implementation uses. This is useful for correlating traces with other logging mechanisms. This should preferably print a string prefixed with an identifier of the tracing implementation e.g. openzipkin:8a43f9e23. Trace/SpanIDs might not apply on specific spans. In this case an implementation should print e.g. openzipkin:N/A. If this is because the trace wasn't sampled and you don't propagate trace IDs unless you sample, it should print e.g. openzipkin:N/S. If more than one span/trace-ids are related to the current Span, the decision on what exactly to print is implementation-specific. In any case, the meaning of the printed text is implementation specific and not expected to be interoperable. The implementation-name prefix helps you become aware and identify cases where you switched tracer.

@codefromthecrypt
Copy link

codefromthecrypt commented Dec 16, 2016 via email

@bhs
Copy link
Contributor Author

bhs commented Dec 16, 2016

@lookfwd @adriancole this brings up an interesting question: what if SpanContext had required (and trivial) methods called something like SpanContext.formatName() and SpanContext.formatVersion() which would be defined to return a strings which together uniquely identify the, well, propagation format. And, further, if there was a SpanContext.debugId() that, in conjunction with the format name+version, uniquely identified the SpanContext. This is different than necessarily uniquely identifying the trace, as I still think that's pretty ill-defined in a DAG model.

@codefromthecrypt
Copy link

codefromthecrypt commented Dec 16, 2016 via email

@bhs
Copy link
Contributor Author

bhs commented Dec 16, 2016

should we not just treat this as a form of injection?

That's what I've sometimes said in the past... people usually respond with something about baggage :)

@codefromthecrypt
Copy link

codefromthecrypt commented Dec 16, 2016 via email

@bhs
Copy link
Contributor Author

bhs commented Dec 16, 2016

that makes sense, @adriancole...

@lookfwd
Copy link

lookfwd commented Dec 16, 2016

If I get it right, by saying it's injection one should use the text map Carrier to get a set of key-values and then in an application specific way, include these in logs. I can't see any reason this wouldn't work and it sounds brilliant.

@SaintDubious
Copy link

I apologize for coming late to the party with concerns, but this seems a bit over-engineered to me. What's needed is an implementation specific way to represent a span as a string. It's hard for me to imagine anything that can't be represented as a string. The API doesn't need to know how the implementer chooses to represent it.

Further the injection idea seems to sidestep the problem. If I can represent a text map carrier as a string, and the proposition is to inject the ids into a text map carrier, then I can represent the ids as a string.

I fear this is probably a case of me not understanding the problem, or focusing too much on my specific use case. More info would be appreciated.

@ojkelly
Copy link

ojkelly commented Mar 11, 2017

I agree with @SaintDubious this seems a bit over-engineered, or possibly missing what the value could be.

In thinking about two main categories of OpenTracing users I see new projects with no existing logging, and legacy systems with established logging infrastructure. For the latter being able to expose a single string which can be correlated back to the trace is all we need.

Whatever the name of the function (though I think SpanContext.debugCorrelationId() is clearest), the actual implementation should be left to the Tracer. As it's use is to link a trace to data in another logging system.

Which is to say the actual string output of the function doesn't matter, as long as it can link data found in logs back to a trace, and vice versa. How that's achieved is an implementation detail of the Tracer, I don't think it matters to the spec.

@bhs
Copy link
Contributor Author

bhs commented Mar 13, 2017

These are the requirements I'm working with for this issue:

  1. it's clear that the identifier is unique to the trace/tree, not to the span,
  2. the semantics suggest imprecision in some way (e.g., the "debug" method prefix), and
  3. there's no expectation that different Tracer impls return values with the same formatting conventions

... as such, I can get behind SpanContext.debugCorrelationId() or similar.

I am less sympathetic about users who want a debugSpanId() (or whatever it's called). Anybody want to argue for it?

@ojkelly
Copy link

ojkelly commented Mar 13, 2017

@bhs I want to briefly argue against debugSpanId().

If the output debugCorrelationId() is entirely decided by the implementation, with the intention to allow correlation of the trace back to a seperate logging system, then the implementation could output any of the following traceId, implementationName:traceId, implementationName:traceId:spanId.

Therefore the specificity of the output is entirely up to the tracer, and if they need span level correlation, they can output it from SpanContext.debugCorrelationId().

@bhs
Copy link
Contributor Author

bhs commented Mar 13, 2017

Thanks @ojkelly. I would probably argue against "implementationName:traceId:spanId"... if I designed the spec/comment for debugCorrelationId, it would be something like

// debugCorrelationId() returns a string that can be used as a correlation hint across a
// distributed system. In practice, this is often directly descended from a Dapper- or
// Zipkin-style trace_id, though that is not a requirement.
//
// If a Span refers to only one parent SpanContext, debugCorrelationId should be the
// same in that parent Span and the child Span.

My two cents, anyway.

@yurishkuro
Copy link
Member

yurishkuro commented Mar 13, 2017

agreed. The intention of using the word "correlation" was that all spans in a (simple) trace have the same debugCorrelationId.

@wu-sheng
Copy link
Member

Agreed. We have the traceSegmentId, which is mapping the same concept as debugCorrelationId .

@lookfwd
Copy link

lookfwd commented Mar 14, 2017

As a minor comment - I don't like the debug that much since it represents to me something like "debug mode" or it might or might not have a value or something like that where, what it really tries to express, is just imprecision in the sense that this value could be quite ambiguous... but not really a debug thing. Some (even worse) names might be aCorrelationId() or someCorrelationId() or correlationHint() etc. I guess native speakers can find something even more precise. It might well be just me... Overall I like the spec above very much.

@wu-sheng
Copy link
Member

@lookfwd Got your point, reasonable concern. SpanContext.CorrelationID may be better?

@lookfwd
Copy link

lookfwd commented Mar 15, 2017

CorrelationID seems quite clear to me... and ambiguous enough to indicate that it isn't TraceId or SpanId. It makes you think "ah - I have to read more to understand this".

INFO << "Find out why this high-end security system is slow by dapper'ing " << span.debugCorrelationId();`

VS

INFO << "Find out why this high-end security system is slow by dapper'ing " << span.correlationId();`

@lookfwd
Copy link

lookfwd commented Mar 15, 2017

On the same spirit, I would add to @bhs 's definition that:

// A tool that targets on a specific Tracing implementation is expected to have
// a straightforward way to find the (one or more) traces related to a correlationId, 

@bhs
Copy link
Contributor Author

bhs commented Mar 16, 2017

@lookfwd I think I'm convinced that "CorrelationID" is different enough from "TraceID" that people won't get confused (i.e., that we can drop the "debug" prefix).

@lookfwd
Copy link

lookfwd commented Mar 24, 2017

I think we should also think of a default value for the Noop implementation

@wu-sheng
Copy link
Member

@lookfwd It is a little default for unknown the tracer implementations will do. With intuition, I think maybe -1 or N/A?

@lookfwd
Copy link

lookfwd commented Mar 24, 2017

Yes... something along those lines... The only drive I could see is it being compatible with (any implementation's) trace-id expected character-set... so I think I would slightly favor an empty string (!!) in contrast to something that has special characters like - or /. Implementations that don't propagate unless sampled, might also report empty strings - I guess - so it becomes a bit ambiguous... but it's better those implementations to have something more clever if they want to denote not sampled or accept the ambiguity.

@andyday
Copy link

andyday commented May 8, 2017

@lookfwd I think I'm convinced that "CorrelationID" is different enough from "TraceID" that people won't get confused (i.e., that we can drop the "debug" prefix).

I'm not clear why a different ID is needed. It would be better to expose TraceID to the application from the sdks. TraceID is already correlating...

@devinsba
Copy link

devinsba commented May 8, 2017

@andyday what is meant by that is that a trace ID is not a required part of an implementation. The specification does not prescribe that a Tracer even have a trace ID. Also it is a Tracer concern what the correlation ID might be or look at.

@andyday
Copy link

andyday commented May 8, 2017

@devinsba I would think most implementations would just make correlationId an alias for traceId, but I see your point. TraceId is required but in a vague way... I'm fine with span.correlationId() though. Currently having to bypass the opentracing interface to get such as thing in https://github.com/Nordstrom/ctrace ...

@mabn
Copy link

mabn commented May 8, 2017

@bhs regarding arguing for debugSpanId - I won't, but I'd rather argue against including this:

// If a Span refers to only one parent SpanContext, debugCorrelationId should be the
// same in that parent Span and the child Span.

There is value in being able to correlate log messages with particular span. Few cases come to mind:

  • service called by many others - it possible that single service is called many times during single trace. Being able to distinguish one request from another in the logs is valuable.
  • cyclic dependencies between services - like above
  • long running processes - there are processes which split execution into multiple iterations (e.g. "repeat until finished") where all iterations belong to the same trace.

On the other hand in cases where above don't happen logging shorter debugCorrelationId is beneficial - less storage taken, less space on the screen, easier to pass around. I could see a tracer implementation having a configuration option which controls this behavior.

One more case (my case) - in our API error responses return "logref" or "X-Request-Id" header which uniquely identify the request. In practice this is spanId (we use random 128bit span ids). So there is some need for "debugSpanId", but I'm not sure if it has to exist in the OT API. And we could as well return traceId+spanId as the X-Request-Id, probably even to just the traceId (for the client using the API it will be always different - good enough)

@bhs
Copy link
Contributor Author

bhs commented May 28, 2017

By the way, despite my radio silence I have been thinking a lot about this.

In order to facilitate the integration of OpenTracing into things like linkerd, Envoy, Istio, and so forth, we really may need to take a stance on a "default" / generic id scheme for OpenTracing. These thoughts have in turn made me reticent to pursue the CorrelationId() change here since it would most likely be partially overlapping.

I plan to bring this up at the next OTSC meeting (scheduled for friday of next week).

@tedsuo
Copy link
Member

tedsuo commented Jul 14, 2017

I am wondering if CorrelationID is needed. The lack of a SpanID complicates the Observer API, and if something like CorrelationID is workable at all, does that not point to exposing SpanID() and TraceID() as being more viable than previously thought?

Basically, can we call this TraceID rather than introduce a new CorrelationID concept, and have TraceID be defined simply as a string or bytes of arbitrary length. Is that effectively what this is?

@rbtcollins
Copy link

There are I think four primary use cases, and much of the discussion here has been about the logging one;

  1. feedback to the caller - "what trace should you lookup to help me with my problem": this is a key into the tracing system to identify the work we did in response to their request.
  2. labelling a trace explicitly - "please be able to lookup this work you're doing for me later under ID XYZ": this is a request to make a specific key be attached to a trace. Similar to e.g. jaeger-debug-id but without the semantic of forcing sampling and preventing downsampling. This is symmetrical to 1 - and it can be argued that only one of 1|2 is needed: callers can log the id from 1 to allow them to correlate, or callers can feed-forward their correlation id to permit correlation via 2. This can be done today by some consistent use of tags
  3. feed-forward from a call recipient - "what trace should I lookup to understand why I called you badly/too frequently/whatever": this is a key into the tracing system to identify what we were doing when we made a calloutside the boundary of the tracer. This is the flip side of 2: supplying an internal id with the expectation that the other party will record it somehow.
  4. internal correlation - "I have things I'm generating I need to correlate outside of the traceable system - e.g. audit messages going to a secured audit system, system logs with different visibility than the tracing system, and so on.

While OpenTracing may be hyper-generic, I think these use cases are real, and right now we're having to use hacks like the Observer API from opentracing-contrib/java-api-extensions#3 which lose important state - just to get at /any/ id.

I don't have a deep view on whether span/trace id is appropriate to use as a correlation Id, but I think the minimal API we need is to be able to query a span:

  • to get the correlation ID for it - whether thats short or long, composite or unique, I dont' care [we can file bugs on given backend implementations to tweak that]

And thats the entirety of it.

@yurishkuro
Copy link
Member

Basically, can we call this TraceID rather than introduce a new CorrelationID concept, and have TraceID be defined simply as a string or bytes of arbitrary length. Is that effectively what this is?

@tedsuo I tend to agree with that, and given the way TraceContext-spec is moving there doesn't seem to be anyone arguing against the notion of a single trace ID. I know @bhs had arguments in the past that there may be situations with multiple parents etc., but is there a tracing system in existence that actually does assign multiple trace IDs to a span and would have a problem with this API? This feature has been asked for so many times, I'd rather be practical than purist.

string or bytes of arbitrary length

I say string - tracer should be able to represent ID as string for HTTP headers anyway, so I don't see a lot of value in returning []byte and forcing the caller to figure out how to serialize it e.g. for logging.

@rbtcollins I would put case (2) aside, since we can't even agree to expose the trace ID, not to mention on forcing the tracer to accept an external one. The recommended way of tagging a trace with an external ID is by storing external ID as a tag. E.g. when you use jaeger-debug-id header, its value is stored as a tag and allows later lookup without relying on the trace ID API.

@bhs
Copy link
Contributor Author

bhs commented Oct 31, 2017

I can live with a SpanContext.TraceID() concept, and I agree with @yurishkuro about making it a string...

I still slightly prefer to call it a CorrelationID(), esp with the string type, but I don't need to win this battle. :) They are nearly but not precisely the same thing... if we call it TraceID(), the comments need to be extremely clear about what this is and what this isn't.

(I also think there's some sort of vague slippery-slope argument about introducing semantically-meaningful getters into the OT APIs, but I also grudgingly acknowledge how often this has come up)

@tedsuo
Copy link
Member

tedsuo commented Oct 31, 2017

@bhs I agree that Span getters are bad, but SpanContext getters are required almost - you can make the case that the point of propagating trace context is to access it and correlate things with it.

I know some Baggage methods have leaked into the Span API, but if you consider Baggage to be just on the SpanContext, then you could essentially have something like this:

// SpanContext represents the propagation context for the current trace. 
// In the case where they are not supported, TraceID and SpanID return empty string.
// Additional trace context can be accessed via Baggage.
type SpanContext interface {
  TraceID() string
  SpanID() string
  Baggage(string) string
}

This feels very natural to me, and fits in well with TraceContext and current tracing architectures. TraceID and SpanID seem like they could be provided by nearly every system. For additional, less-standard context info, such as trace-joining experiments, sampling bits, etc we have Baggage as an interface for accessing this additional info. No reason tracers can't provide their own baggage alongside the User's baggage.

Is there any tracer involved with OT right now that would have trouble with the above interface?

@wu-sheng
Copy link
Member

wu-sheng commented Nov 1, 2017

type SpanContext interface {
  TraceID() string
  SpanID() string
  Baggage(string) string
}

@tedsuo For a Producer/Consumer module in Skywalking, we have more than one traceId, if the consumer processes messages in batch mode. So TraceID for skywalking is too explicit, and I can't return the first traceId or even segmentId instead.

So I agree with @bhs about using CorrelationID than trace word.

@bhs
Copy link
Contributor Author

bhs commented Nov 1, 2017

There are plenty of good (great?) arguments about moving all baggage methods (esp copy-on-write setters) to SpanContext. @bogdandrutu has suggested as much IRL, and if I could do it all again I'd have designed things that way in the first place. (Of course this can be done in a backwards-compatible way with deprecation, etc)

cc @tedsuo

@tedsuo
Copy link
Member

tedsuo commented Nov 7, 2017

I agree it's conceptually cleaner to have the baggage setters on SpanContext. The issue is immutability: once you set a new baggage item, presumably you get back a new SpanContext. But that would decouple the span context from the span. I wonder if it's better to let the tracer control how it manages that relationship, and leave the setters on the span.

I definitely think SpanContext should have key lookups for baggage, not just an iterator. That would be very useful and does not come with the design cost of adding the setters.

@painhardcore
Copy link

It would be great to finally expose TraceID or CorrelationID in SpanContext interface .

@uudashr
Copy link

uudashr commented Apr 23, 2019

I don't understand on how this discussion can become official spec? is there any progress on this?

@yurishkuro
Copy link
Member

I believe this has been resolved by https://github.com/opentracing/specification/blob/master/rfc/trace_identifiers.md

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests