Add the HTTP header format proposal for TraceContext propagation. #1

bogdandrutu · 2017-03-31T20:17:49Z

Add the first spec for the trace-context http header format.

codefromthecrypt

some things noticed, but overall nice job

codefromthecrypt · 2017-04-01T01:31:14Z

HTTP_HEADER_FORMAT.md

@@ -0,0 +1,88 @@
+# Trace Context HTTP Header Format
+
+Date: 31/03/2017


I see why this is here, though it will be maintenance causing.. wonder if there's a badge to auto-include the git timestamp?

codefromthecrypt · 2017-04-01T01:35:08Z

HTTP_HEADER_FORMAT.md

+
+The value will be US-ASCII encoded (which is UTF-8 compliant).
+
+### Version


Version or variant? maybe mention in docs this could be used by variants of the spec? or just wait and see..

(As a random, interested passerby:)

The problem with having an explicit version field is that you risk breaking all receivers as soon as you increment the field, which means a user can't do a staged rollout of a system that uses a new version of the protocol.

A better strategy would be to reserve some of the options bits and then requiring current implementations to set them to zero and ignore any trailing data. This way, you can evolve your protocol gradually.

codefromthecrypt · 2017-04-01T01:35:43Z

HTTP_HEADER_FORMAT.md

+#### Trace-id
+
+Is the ID of the whole trace forest. It is represented as a 16-bytes array,
+e.g., `0x4bf92f3577b34da6a3ce929d0e0e4736`. All bytes 0 is considered invalid.


is 0x prefix intended here?

codefromthecrypt · 2017-04-01T01:37:03Z

HTTP_HEADER_FORMAT.md

+#### Trace-options
+
+Controls tracing options such as sampling, trace level etc. It is a 4-bytes
+representing a 32-bit unsigned integer in little-endian order. The least


java's default is big-endian.. there are a lot of java programmers

ps to elaborate on this, there's a couple things in play. On one hand, if you have only one flag, the endianness isn't an issue. Most users have sampled set as a single bit, and that's the only flag used in zipkin routinely.

On the other hand, the platform default encoding of java as big-endian results in many not being aware of endian-ness in general. Mistakes coding anything binary have been difficult in zipkin, resulting in json actually. While the case endianness is "little" elsewhere, anectodally I've noticed those who have little endian platforms already know how to do big endian. Main thing is that they get annoyed when endian isn't documented. Ex when folks were doing thrift encoding I remember someone rather annoyed that thrift tbinary format was big endian without saying so (because it was implicitly that in java).

food for thought

java's default is big-endian.. there are a lot of java programmers

not only that, but big-endian is also "network byte order". Wikipedia says:

Big-endian is the most common format in data networking; fields in the protocols of the Internet protocol suite, such as IPv4, IPv6, TCP, and UDP, are transmitted in big-endian order. For this reason, big-endian byte order is also referred to as network byte order.

If I can write the 32bit flags number as 0x00000001, it seems more natural to have 00000001 in the header.

I know that network byte order is big-endian. But probably 99% of the CPU in used are little-endian.

Let's vote here #3 for this.

I think @bhs proposal to reduce this to 1 byte is an alternative option. Updated the issue with this possibility.

Done. 1-byte used for options.

codefromthecrypt · 2017-04-01T01:37:47Z

HTTP_HEADER_FORMAT.md

+```
+Value = 004bf92f3577b34da6a3ce929d0e0e473600f067aa0ba902b701000000
+Version = 0 (0x00)
+TraceId = 0x4bf92f3577b34da6a3ce929d0e0e4736


0x prefix will likely lead to encoding bugs as people will almost certainly copy/paste if we give opportunity to

codefromthecrypt · 2017-04-01T01:38:14Z

HTTP_HEADER_FORMAT.md

+Version = 0 (0x00)
+TraceId = 0x4bf92f3577b34da6a3ce929d0e0e4736
+SpanId = 0x00f067aa0ba902b7
+TraceOptions = 0x0  // not-sampled


make an example that breaks with endian, especially if we use little endian

@adriancole This is terrific initiative. The proposal is looking quite reasonable. I think the only debatable piece is the presence of Version. Iе would better use your suggestion, like Variant, or Encoding, or Scheme. To give an example, I would like to see something like that: Z<traceid><spanid><options> where Z would tell me that this is Zipkin encoding scheme. For HTrace it could be H<spandId><options> (there is no separate Trace Id, it is encoded in Span Id, at least in 4.1). It is the same 8 bits but using characters instead of 0. Not sure if it makes sense to you guys. What do you think?

No more endianness.

I like the idea of maybe having multiple supported systems, but one of the main purpose is to make all these systems work together so having separate encodings for each system will not be that much of a win.

wu-sheng · 2017-04-07T01:45:51Z

@adriancole Will this become the TraceContext propagation spec of zipkin? Or do you have wider purpose?

codefromthecrypt · 2017-04-07T03:00:02Z

@wu-sheng this is a spec for dapper-like systems. It will take some time for something like this to work its way through implementations and interop, but it is indeed something I'd like to see possible in zipkin tracers. For example, there's some TODO: work left with brave to show an alternative context header, and this format is what I'll use to test that.

wu-sheng · 2017-04-07T03:02:02Z

@adriancole I will keep watching this. Your trace-related projects are very interesting for me.

bhs

Thanks for doing this, @bogdandrutu. I made some comments.

@bogdandrutu what's the vision/scope for the TraceContext github org?

bhs · 2017-04-09T20:40:23Z

HTTP_HEADER_FORMAT.md

+Date: 31/03/2017
+
+A trace context header is used to pass trace context information across systems
+for a HTTP request. Our goal is to share this with the community so that various


I hear a lot of pain around tracing things like messages buses / kafka / etc. Might be worth intentionally decoupling from HTTP and making this about any messaging tech that has room for metadata/headers.

bhs · 2017-04-09T20:45:43Z

HTTP_HEADER_FORMAT.md

+
+#### Trace-id
+
+Is the ID of the whole trace forest. It is represented as a 16-bytes array,


The room in 🇩🇪 seemed to favor a variable-length field here... I thought that was for good reason, too: 128 bits of precision is completely unnecessary for the purposes of most tracing systems, yet insufficient for others we heard about that needed 192. Why does this spec need to try and get this right? (We can set a hard max at 32 bytes or something if you are concerned about unbounded allocations)

fwiw As a unit, zipkin usually implement things a lot later than the first request after significant cross-project interest exists, and there are hands to do it. Ex 128bit trace ids was implemented probably 2 years after the first sites started forking zipkin to accomplish this. This happened after a couple more recent glitches occurred including users of AWS lambda not being able to reuse the uuid there, as well zipkin users who have hybrid sites that employ stackdriver (which is 128bit). This topic has been discussed at length on other forums, so not trying to replicate it here. Suffice to say that 128bit is definitely in use in zipkin sites.

To make things less conjectury, it would be nice to have a matrix of tracing systems and their trace context requirements someplace, maybe on a google doc or otherwise.

In practice, Zipkin has never been asked directly to support a bit length higher than 128. However, we have been asked for more "squishy" things like opaque strings. Also, some tracers like sleuth were asked for lambda, which has an interesting timestamp+identifier trace id format. Having a matrix around can make things more objective as people can see if all of the systems they care about can work with a spec or not. It won't be about most, rather what site is running. For example, 9/10 is meaningless if that last one is the only one they use.

If we accept variable size context we should definitely have a max size for both trace_id/span_id if 32 bytes fits all the systems then we should probably change to that. Do you know any system that is willing to change to a common format and uses more than 128bits? As @adriancole mentioned for the moment the systems that are interested in using this format are Zipkin and Google and both are fine with 128bits.

One option to move forward is to make v0 having fixed sizes and we can define the v1 later when we have a clear system that wants to use this and requires more than 128bits.

Also filed an issue to keep track of who wants/is willing to use this format.

Please add your project here #4. Add any extra requirements.

Tracelytics uses 160-bit trace IDs currently, and we are interested in a common format. FWIW AWS X-Ray uses a 128-bit trace ID format consisting of a 64-bit timestamp concatenated (using a hyphen) to a 96-bit unique ID, both encoded in hex.

bhs · 2017-04-09T20:49:41Z

HTTP_HEADER_FORMAT.md

+*Valid sampled Trace-Context:*
+
+```
+Value = 004bf92f3577b34da6a3ce929d0e0e473600f067aa0ba902b701000000


IMO this is over-optimized / pre-optimized for performance. In the context of an RPC, parsing these things is going to be a joke from an overhead standpoint... I'd rather have an id format that's human-readable, or at least human-parseable (without counting characters by hand, that is).

E.g., this:

Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01000000

made comment elsewhere but I also agree a delimited format is near required from an operator POV. Especially as people cross-reference trace ids in other systems. Like splunk search shows something where one system logs the trace id as a logging key. Copy/paste this into zipkin or another system. There's also the "curl"ability which is "can a user work only having curl?" Straightforwardness is important for adoption, or rejection prevention.

I think it is not unreasonable to add an extra character especially for the HTTP/String version to make this more readable.

If this is an agreement I think we should do it this way. I will start few issues where we can vote for certain things to get all these concerns/issues/questions resolved and CC interested people.

Please vote/comment here: #2

Done. Used a delimiter between all fields.

bhs · 2017-04-09T20:52:38Z

HTTP_HEADER_FORMAT.md

+Implementation may decide to completely ignore the trace-context if the span-id
+is invalid.
+
+#### Trace-options


it's odd to me that we support a versioning bit, yet have this 4-byte thing we only have 1 bit specified for... we could alternatively just have a single byte for version 0 and skip all of the endianness discussion.

I was expecting this to fill up (the 4-byte options) with things like what you proposed (sampling probability). Uber also suggested an extra bit for deferred sampling decision.

I am trying to get for the moment the minimum requirement.

I am trying to collect data about how many bytes we need. So far the list contains 3 (based on Jager requirements).

If the list is less than 8 we can definitely go with 1 byte for the options in v0.

Should we have the sampling probability that @bhs proposed as a separate field?

Done. 1-byte used for options.

bhs · 2017-04-09T20:53:42Z

HTTP_HEADER_FORMAT.md

@@ -0,0 +1,88 @@
+# Trace Context HTTP Header Format


Which systems are committed to support this? Which systems would you like to support this? Might be helpful to @-mention people from the latter so we can have any debates before this is merged.

putting in 2p eventhough the question was for @bogdandrutu :)

TL;DR; I'd expect tracing systems which maintain all of their tracers to make a more top-down decision, but yeah not currently the case in zipkin as this trace context is compatible with zipkin's wire format.

Currently, some zipkin-compatible tracers support vendor-specific or not widely used formats. Those types of tracers will have an easier time with this since zipkin's trace context isn't inherently incompatible with this specification (at the moment). I'd expect the cross-section of google+zipkin users to be first to ask, as it is likely google will land some variant of this first (grpc, cloud services and stackdriver instrumentation).

Similar to other things that happen, when that demand occurs it is usually in a repo or two. For example, our first requests for StackDriver and X-Ray trace support came from sleuth issues list. Tracers run independently and can move to support something sooner or later. I often ping people across tracers on things like this so that they can weigh-in before organic demand hits.

Regardless, support or not support lies in the scope of each tracer to decide until there are server implications like the trace context is incompatible or too wide to store in zipkin.

With the current format (that is compatible with ZIpkin and Google) we try to make at least these systems to work with this format. Having a common place for the specs (maybe some simple implementation in multiple languages) is one of the goal.

Anyone who is interested in using this format is welcome to join the effort and send patches/PRs etc.

codefromthecrypt · 2017-04-12T03:04:20Z

@saintstack @reta any feedback on this from htrace pov?

costinm · 2017-05-23T16:58:43Z

HTTP_HEADER_FORMAT.md

+
+## Header name
+
+`Trace-Context`


'Via' is a standard header with pretty similar semantics. It is also in hpack.

Trace-Context can be mistaken with something carrying the baggage or context properties. This is id. We decided to use Request-Id for correlation protocol in .NET. If we will be able to converge on format - it may be a good one to use.

Request-Id is quite descriptive for the header purpose I think.

bogdandrutu

PTAL adriancole@

bogdandrutu · 2017-06-20T02:10:21Z

HTTP_HEADER_FORMAT.md

@@ -0,0 +1,88 @@
+# Trace Context HTTP Header Format
+
+Date: 31/03/2017


bogdandrutu · 2017-06-20T02:13:25Z

HTTP_HEADER_FORMAT.md

+#### Trace-id
+
+Is the ID of the whole trace forest. It is represented as a 16-bytes array,
+e.g., `0x4bf92f3577b34da6a3ce929d0e0e4736`. All bytes 0 is considered invalid.


bogdandrutu · 2017-06-20T02:20:39Z

HTTP_HEADER_FORMAT.md

+Implementation may decide to completely ignore the trace-context if the span-id
+is invalid.
+
+#### Trace-options


Done. 1-byte used for options.

bogdandrutu · 2017-06-20T02:20:56Z

HTTP_HEADER_FORMAT.md

+#### Trace-options
+
+Controls tracing options such as sampling, trace level etc. It is a 4-bytes
+representing a 32-bit unsigned integer in little-endian order. The least


Done. 1-byte used for options.

bogdandrutu · 2017-06-20T02:21:15Z

HTTP_HEADER_FORMAT.md

+*Valid sampled Trace-Context:*
+
+```
+Value = 004bf92f3577b34da6a3ce929d0e0e473600f067aa0ba902b701000000


Done. Used a delimiter between all fields.

bogdandrutu · 2017-06-20T02:28:32Z

HTTP_HEADER_FORMAT.md

+Version = 0 (0x00)
+TraceId = 0x4bf92f3577b34da6a3ce929d0e0e4736
+SpanId = 0x00f067aa0ba902b7
+TraceOptions = 0x0  // not-sampled


No more endianness.

I like the idea of maybe having multiple supported systems, but one of the main purpose is to make all these systems work together so having separate encodings for each system will not be that much of a win.

SergeyKanzhelev · 2017-06-27T17:27:57Z

HTTP_HEADER_FORMAT.md

+1.  Trust and abuse.
+2.  Bug in caller
+3.  Different load between caller service and callee service might force callee
+    to down sample.


Assuming sampling logic may differ much between vendors - why place sampling flag to the trace context? Why not pass it as a separate header that is subject for vendor-specific logic? In some cases to resolve the issues you describe - extension libraries will need more than just a sampling flag. Most probably some additional information from baggage or other headers.

You may also have multi-tier sampling. Take an example of local agent mode that @bogdandrutu demo-ed for census. You may want to sample data that you send to backend with sampling rate 0.01 and locally - 0.1. So you'd need more than one bit to carry both sampling decisions.

Another consideration - properties of Trace-Context belongs to the new span you create. So developer or extension library can access those fields by taking the "running" span details. You can use those identifiers for the non-tracing needs. Sampling flag will not be one of a span properties (will it?). So you have no access to the original value of that flag. And you may not need any options as you may only care about the identifiers. So having options in Trace-Context looks inconsistent.

@SergeyKanzhelev consistent sampling across the whole trace is very important characteristic. If the sampling decision is not propagated and the trace spans multiple implementations t1,...,tN, then each implementation will have to make a sampling decision over and over, and the probability that you will capture the whole trace becomes very small, p1 * ... * pN.

@yurishkuro if you are making a sampling decision based on traceid you will have min(p1, p2, .. pN) number of full traces, not the multiplication. If all probabilities match - all the traces will be collected fully,

When you do sampling you may need to estimate the count of spans exhibit certain properties based on sampled data. In case of statistical sampling you may just multiple the raw count of spans to sampling percentage to get statistically accurate number.

If sampling decision forced from above without the information on sampling percentage of originator - you cannot do this type of estimations.

statistical sampling on every layer is just one example of different type of sampling you may want to implement and you will need more than a bit of informaiton

if you are making a sampling decision based on traceid...
When you do sampling you may need to estimate the count of spans

Having the sampling bit does not prevent you from passing extra data or making more elaborate decisions, but it's not part of the proposed standard. NOT having the sampling bit pretty much guarantees that the trace will be broken, unless every implementation makes the decision based on the exact same formula, like traceId < p * 2^128. which is again not a part of the proposed standard.

Sampling bit is a recommendation. If a service can respect it and handle the volume - great. If it cannot respect it 100%, maybe it can respect it for "more important" spans like RPCs, and shed the load by dropping in-process spans & metrics. Essentially it does not put any restrictions on how you want to implement sampling, but still provides a way to achieve consistent sampling across the stack, if the volume allows it.

That is precisely my question. Do you think it will be typical for libraries to trust this flag? Or every vendor and library will implement own logic so this flag will never be used? Should this standard define the flag or let customers decide on sampling algorithm across services in their org as a decision separate from the data correlation protocol?

Beauty of sampling flag is ability to implement solutions like forced data collection for period of time or specific span name. Also it removes the need to synchronize the sampling decision algorithm.

On negative side - services looses control of the data volume and statistical accuracy of collected data.

Protocol is optimized for a single team owning many components with relatively similar load. It is not always the case. Every component may be owned by a team which want to play nice and contribute to the overall correlation story. But have a bigger priority to fully control telemetry volume and distribution collected from this component.

On negative side - services looses control of the data volume and statistical accuracy of collected data.

I don't agree with this assessment - the flag is recommendation, an implementation does not have to respect it if it thinks it can do a better job.

Do you think it will be typical for libraries to trust this flag?

Yes, I think it's a very common implementation to simply trust the flag, in the absence of other knowledge about the system.

I'm not against the flag that will force all layers to record a specific trace. It is very useful for troubleshooting scenarios.

I also agree that collecting the entire trace as often as possible is very useful.

It is unavoidable that downstream services will collect more telemetry than was sampled in by front door. So there will be incomplete traces.

What I want to avoid is the situation when you heavily rely on implementation detail of upstream component sampling algo. Also I want to make sure there is an easy to replicate on any language mechanism to control the flood of telemetry in case of non-matching load patterns. Third, for many Application Insights scenarios we need to keep the sampling percentage that was used to sample telemetry out so we can run statistical algorithms on telemetry and recognize patterns. So single bit will not generally work for us as a universal sampling mechanism.

I'd propose to have a debug flag with the semantic of occasional forceful tracing across the system. Sampling flag then can be vendor specific.

codefromthecrypt · 2017-06-28T11:06:15Z

HTTP_HEADER_FORMAT.md

+
+### Version
+
+Is a 1-byte representing a 8-bit unsigned integer. Version 255 reserved.


rationale here is that we expect the key to remain the same even if the value changes. This way, we can change format in worst case where such a thing is needed, and that can be done without a fan out of administrative activity such as new filter patterns to afford a new trace header key. We should update this to make very clear that version changes are not expected and highly discouraged, especially breaking ones.

@adriancole I really like an explicit versioning. I think it will help a lot long term

@bogdandrutu was the idea of the version an incremental thing or the format of the header? Let's say 00 is what explained here, 01 - binary format, 02 - alternative length of request ID or other implementation? We've been thinking suggest to use it to indicate the alternative way of generating identifiers.

@SergeyKanzhelev cool. I am in favor of versioning at this point, notably to help folks know when to not process a header (ex check magic type of thing likely used similarly in X-Ray's format).

Wrt version also being a format flag, not sure it applies here. For example, in grpc, binary headers are actually encoded as base64, and their header names have -bin appended to them. So in this case, they don't need a different format bit inside their encoded trace data as it is already implicit in the scheme.

what if a service is called from device via http with the size concern and from another service with the regular header format? This service need to read two headers? Which one will win?

When you have a single header and allow for binary format - things got easier I believe

codefromthecrypt · 2017-06-28T20:35:20Z

If we take out the sampling flag, what we are left with in general is a way to specify ids in a tree. Users of this format can do logging or other sorts of correlation using them. What we couldn't do is notify in any portable way that the trace/span ID pair was reported to the tracing system. I actually see this flag as different than a sampling algo data. It is really a signal that the ID you just received is either on the way to the tracing system or accidentally dropped. Sampling algo data surely can be a side-car to enrich this information. What would be the harm in sending this, if it were defined as a signal of reporting status vs sampling algo? Well, maybe someone who does coordinated, after-the-fact pull collection would get nervous? Maybe they are nervous because the best they could do is send 0 as they don't know if they will report that span or not. If this is true, worst case is as bad as not defining the bit... such a system doesn't know and leaves the bit unset. On the other hand, allowing the bit to be set preserves a widely used feature for systems who have trace-id consistent export policy. If there was a compelling reason to prevent these systems from being able to interop with a single header, we'd add another header to do it, and use it consistently. Vendor-specific would just undo interop between systems who surely can today (bear in mind multiple systems use zipkin's B3 today!) This is important to reason through because we don't want to undermine the common good if we don't have to.

codefromthecrypt · 2017-06-28T20:53:30Z

the TL;DR; would be maybe we can restate the sample flag from a command to an FYI. Like instead of Sampled=1 meaning.. you better sample this, it means FYI, some of this trace is was reported to the tracing system A lot of current tracers will see the FYI and decide to also do that, especially for internal networks. The command of "force trace" could be another flag as sergey mentions.

SergeyKanzhelev · 2017-07-03T17:58:28Z

I'm trying to summarize the main difference between this format and other formats:

This format vs. B3:

one header vs. multiple headers
more RFC-friendly name as prefixes are deprecated
versioning support
mixing identity and flags in the same header. In B3 you can pass identity without specifying any flags
no sharing of span ID between client and server

This format vs. AWS. The best link I found:

machine, not human friendly. No names of fields as Amazon header has
more RFC-friendly name as prefixes are deprecated
not additive as amazon header. Amazon header do not follow RFC as it uses ;, not a comma:

It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma
do not include time apart from Amazon
trace-id longer than Amazon's trace ID (it seems that Amazon uses the same length id, but encodes time and version into it), span-id shorter than Amazon's Self-ID

This format vs. .NET proposed:

non-hierarchical identifiers. No way to establish sequencing without the whole trace
strict requirements on identifiers format. .NET supports strings
mixing identity and flags in the same header
versioning supported
machine, not human friendly. No names of fields

I'm trying to get sense of why to use the new format, not use one of existing. I still think it would make sense to detach sampling/debug flag from the identity information. Support of less strict identifiers can also help integrating with other monitoring systems. I'll post another comment on how the analog of this format can be merged with .NET proposed.

SergeyKanzhelev · 2017-07-04T04:17:31Z

Another typical case is Heroku with X-Request-ID (example of another typical case. I even found this document) ). Differences with this proposal:

Strict format. Heroku sets no limits on format. Just a base64 string.
Allows to hierarchy by sending two identifiers
more RFC-friendly name as prefixes are deprecated

bhs · 2017-07-05T06:25:09Z

@SergeyKanzhelev thank you for the brief survey above.

The more I think about this issue the more I like the idea of leaving the precise bit-width and even the encoding of the trace and span ids out of this spec... why can't they just be strings? (I.e., from a passing-from-process-to-process standpoint, the encoding doesn't matter as long as the (opaque) trace/span ID can be extracted)

yurishkuro · 2017-07-05T15:30:04Z

@bhs I think many existing tracing systems won't be able to support trace ID as arbitrary string, but I think this is actually an important question about this spec: is it meant to enforce a single trace ID across different tracing backends, or just provide a correlation between different sub-graphs? At the last Zipkin workshop it was mentioned that blindly respecting a foreign trace ID is dangerous, since a bad player could send the same ID over and over, messing up all of the internal traces for another sub-system. On the other hand, treating it simply as a correlation ID (foreign reference) is safe and does not require any conventions about the bit-length and encoding.

bhs · 2017-07-05T15:55:39Z

@yurishkuro

I think many existing tracing systems won't be able to support trace ID as arbitrary string

I absolutely agree – my idea here is less that all tracing systems should support an arbitrary string, and more that the tracing system should have control over what a trace ID looks like. I'd be fine restricting it to hex or base36 or whatever if that makes people feel better about it. I just don't see the ROI in trying to enforce a particular bit width, etc. Interop between tracing systems within a single application deployment has always been a non-goal of mine. I'm more interested in allowing the routers and service meshes (and other L7 tech of the world) to pass tracing information along properly.

codefromthecrypt · 2017-07-06T00:45:24Z

I think ben has a different goal as mentioned here. This is an interop spec for those who can, not just a "firewall rule" for intermediaries. Those who can pass a string can also pass an encoded one. By specifying a type many can agree to, many will interop even better than they can now. Some will not use this spec anyway. If we remove the encoding this becomes worse than B3 not better in terms of interop. Regardless those propagating other things like tags will need another "firewall rule" expression to forward them. The scope of a generic template for forwarding rules is a different one than this. On 5 Jul 2017 23:55, "Ben Sigelman" <notifications@github.com> wrote: @yurishkuro <https://github.com/yurishkuro> I think many existing tracing systems won't be able to support trace ID as arbitrary string I absolutely agree – my idea here is less that all tracing systems should support an arbitrary string, and more that the tracing system should have control over what a trace ID looks like. I'd be fine restricting it to hex or base36 or whatever if that makes people feel better about it. I just don't see the ROI in trying to enforce a particular bit width, etc. Interop between tracing systems *within a single application deployment* has always been a non-goal of mine. I'm more interested in allowing the routers and service meshes (and other L7 tech of the world) to pass tracing information along properly. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAD618mzV6h3S7xbKxdp5INmMrIJcy_hks5sK7H-gaJpZM4MwHLX> .

codefromthecrypt · 2017-07-06T01:09:56Z

For the sake of argument, let's say we didn't have encoding specified. Then we'd basically consider whatever the character set is with regards to how big the ID is. So, the storage/indexing aspect is much wider, and there would be no means to reliably narrow if an ID is passed bigger than some can store: some folks have inexpensive setups that are limited to 64bit trace ids. If the encoding was instead specified (ideally hex as it is now), storage can at least reliably narrow 128bit to 64bit at risk of collision. If encoding is not specified (as suggested as opaque), you really couldn't do anything like this reliably without interpreting the data, or adding functionality which isn't universal like linked ids. Any time we have to map identifiers, there's correlation loss, for example, the ID is mildly different encoding when plopped in logs vs tracing vs other things. It would be a sad step back if the majority of us decide to not interop to the point of specifying ID encoding.

SergeyKanzhelev · 2017-07-12T06:52:28Z

@adriancole I understand frustration. I still have questions to this format. We also thinking how it can be converged with .NET http correlation protocol and working on proposal. That's why all the questions.

My understanding was that the idea of this proposal was to set a standard for all things tracing, not limit it to Census and Zipkin use case.

@mtwo, @bogdandrutu it will be great if you can summarize the open questions from this thread and plans to merge and evolve the format with Census (I'll also try to reach you directly tomorrow).

codefromthecrypt · 2017-07-12T07:11:17Z

@adriancole <https://github.com/adriancole> I understand frustration. I still have questions to this format. We also thinking how it can be converged with .NET http correlation protocol and working on proposal. That's why all the questions.

It is mainly an issue of github pull requests not being as good at exploratory questions as google doc (which we started with last year and preceded this). I'm trying to be conscious of the near impossibility of anyone jumping in and understanding the threads here.

My understanding was that the idea of this proposal was to set a standard for all things tracing, not limit it to Census and Zipkin use case.

nothing can standardize all things tracing. that's an unrealistic goal, as there are dramatically different things and folks who are not participating on this (such as htrace, aws, x-trace, pinpoint, app dynamics, etc). we cannot expect to standardize something for every use case but we could standardize something for dapper-like systems.

bhs · 2017-07-13T04:47:27Z

I likely don't have time to finish this argument, but I do want to re-raise the fact that the benefits of specifying a bit width for ids are in my mind outweighed by the benefits of not-specifying a bit width. The path-id thing is a perfect example... I actually implemented something like that for a few recent demos and expect to see something similar in production before too long; I would love to use a standard inter-process propagation format, but that would be a non-starter with the fixed-width id scheme (at least without some arbitrary baggage map to stick things in instead; though I can understand leaving something that generic out of the spec for now).

I think the current proposal is utterly sane but too narrow. What we'll end up with is yet-another-standard for the inter-process propagation (and maybe even a decent one), but a standard that's too narrow to absorb the semantics of other popular propagation formats and thus one that's limited. I.e., I like it, but I'd like it more if it only placed the restrictions that are literally necessary: since the ids aren't parsed, who cares how wide they are?

SergeyKanzhelev · 2017-07-13T05:58:58Z

@bhs I was hoping for the same. I like some ideas in this format though. Like format number prefix and clean dash-delimited string.

In the most distributed systems I see one header for always changing unique string and one for something you need to keep along the entire trace. Most of the time first is being optimized to have a prefix of the <trace-id> so you do not need a separate header for a single value. It will be great to have something like this. Thinking out loud:

Request-Id: <version>-<trace-id>-<whatever you want>
Baggage(/Context?):<key>=<value>,<key>=<value>,<key>=<value>

I've seen older systems that do not do <trace-id> and rely on guaranteed delivery. However now everybody have <trace-id> so it may be part of a standard till artificial intelligence will come up with something more advanced for distributed tracing :-). I think it's safe to agree on it's length either.

This way as I mentioned in #4 you can force all "dumb http services" to simply trace the content of a header with the specific name alongside the logs or pass it forward if it is a proxy of sort.

codefromthecrypt · 2017-07-13T08:51:07Z

So ben/sergey would you use this if it didnt specify id constraints? I havent seen lightstep on the issue tracking stakeholders and their requirements. If there was no format, how would you interop with another cloud or is that just not important to you? How would systems that desire interop do so if nothing is specified practically speaking? Would they scan the characters and guess? Whats the point?

mtwo · 2017-07-13T14:25:15Z

I agree with Adrian. Two of the goals of this effort benefit from a common format encoding:

Instrumentation libraries can be more easily shared between tracing vendors. Without a common context format, any kind of vendor-neutral instrumentation library would have to provide a way for vendors to inject their own context propagation scheme, which introduces a fair amount of complexity.
PaaS providers can match up customer traces to their own. For example, if I'm managing a hosted database service and a customer is complaining about poor performance, and I'm propagating customer trace IDs and generating traces on them through my service, then I can debug a customer's request if I'm given a specific trace ID. This has particular benefits to Microsoft, Google, Amazon, and other platform providers, but it also benefits any tracing system that can uses the same format as their customers can provide trace IDs to their provider for debugging purposes.

SergeyKanzhelev · 2017-07-13T17:06:55Z

@adriancole @mtwo the important thing is semantic. One header should be unique for every request and another is constant for a trace. If you have a string which semantic is trace-id you pass it forward and use for correlation. If you have a string which semantic is request-id - generate a new one for outbound connection. For baggage collection - parse if you know how. If not - pass forward. So every vendor can respect semantic and use the values of the headers even without understanding it.

Furthermore - if you recognize the format of the header you can apply some additional logic on it. So having similar instrumentation and format will allow for an extra features. Like ordering or such.

In Application Insights we are using strings to store ID for this specific reason. We have no control over existing instrumentation and allow to use pre-existing correlation practices alongside the Application Insights SDK.

yurishkuro · 2017-07-13T19:21:01Z

I tend to agree with @SergeyKanzhelev from a philosophical point of view, even though it would be a breaking change in Jaeger, which currently, similar to Zipkin or StackDriver, only works with fixed-width IDs. There is no strong reason for those IDs to be fixed-width, aside from minor optimization for sampling where the ID can be treated as a random integer and one can do sampled := id < MAXINT * samplingProbability (but not in all languages), and potentially some storage restrictions. Semantically, an opaque variable length string would work just as well for traceID an spanID, and would actually allow for higher degree of interoperability across different implementations, which is important for a standard.

SergeyKanzhelev · 2017-07-13T19:40:11Z

@yurishkuro do you use id or trace-id for score generation? I think trace-id that often is used for sampling score generation is less a controversy. It may not be a huge problem to make it a 128bit number in specific format, not just a string in this spec.

In Application Insights we currently using EJB hash to calculate sampling score (.NET, JavaScript). If we will know that it is most probably just a 128bit number - it can make life easier for the score generation - this formulae can be used sampled := trace-id < MAXINT * samplingProbability for such cases.

bhs · 2017-07-13T23:23:11Z

@mtwo I want to be clear that I'm not arguing against a common trace format. It has obvious advantages, and you've listed two great ones.

My concern is that this standard will not take off in practice because it is not semantically general enough to represent the trace context for certain tracing systems.

I've spoken at length about this with the Envoy team (cc @mattklein123) and with a variety of organizations (large and small) that depend on things like service meshes, L7 routers, etc. The requirements I've been hearing are basically (1) a truly standard header name (or "names" per @SergeyKanzhelev's points), (2) some way to separate the trace_id from the span_id, and perhaps a way to make sense of standard (e.g., sampling bits) and user-specific propagated state (aka baggage). For the purposes of these use cases, actually specifying the encoding of the ids is not important.

I should mention that getting the service mesh stuff to work right end-to-end also requires some standardization of how to serialize what I typically refer to as "out-of-band" trace data, much like Zipkin issue #1499, though ideally not tied to zipkin (just as this tracecontext-spec discussion is not tied to zipkin).

Stepping back, my hope is to pair a minimalist, generic in-band-trace-context format and a minimalist, generic out-of-band-trace-data format and allow things like service meshes (etc) work with as many tracing systems as possible without code changes in those service meshes (etc). cc += @opentracing/otsc @opentracing/otiab as there has been some discussion of wading into these waters given this topic's increasing importance.

Getting back to @mtwo's vision of trace handoff between apps and PaaS/IaaS providers, I think it's appealing but also problematic in practice: my personal conversations with big-3 providers suggest there's little chance that app developers are going to be given access to traces of the internals of these managed services, many of which are proprietary.

codefromthecrypt · 2017-07-14T01:25:46Z

In general, things without an insane amount of marketing dont take off until there is value proven. For example spdy led to http/2. Starting practice here is important even if some are not here yet or willing. For example B3 already is used across different systems and this can help practice further. I also spoke with Matt about envoy/istio. In the near term it is completely unrealistic that a single header will work without tying the mesh to only greenfield projects. Today, major meshes specify a pattern and thinking you can have one pattern knowing brown field exist requires one to change approach. This isnt one big company. You cant just tell people to change their stuff. It doesnt work like that and thinking it does is unrealistic. Efforts needed by mesh are larger than this format and should not tie this format from its aims. There are other propagated data like flags etc. If I were doing this I wouldnt try to increase scope of something which already has practical value (if nothing but better than B3). I would start an effort to make a propagation thing *decoupled from tracing* which doesnt pretend everything will big bang and use the same header names. I would try to help forward a tool agnostic propagation system which has buy in from others like those doing metering and flags. It would be more effective and not be so brittle as to expect every thing and every vendor to share a header prefix. Back to what we are doing here. Back in the day, google started spdy then more got involved then http/2. Folks who are uninterested need not apply. This is a practical work with goals you may not be interested in or not have interest in being solved. That might make it less compelling for you and thats fair. There are definitely folks who want interop as otherwise B3 wouldnt be used today in systems that share no code. People run hybrid clouds and are increasingly using lambda architecture which may call back out of platform run services. Iterating towards something more adoptable than B3 moves the ball forward in practical and testable ways. We dont pretend to solve something impractical or unsolvable. If we prove value iterating it is far better than pretending to solve something and not, and far better than sitting on this doc afraid to experiment.

yurishkuro · 2017-07-14T03:56:36Z

@adriancole but what if the standard addressed both goals? Putting flags and baggage aside for a moment, what if it simply standardized on two headers (or one, with well defined value separator) to represent trace and span IDs as opaque strings, with a provision that when those IDs are fixed-width byte arrays they should be hex-encoded?

if both sending & receiving tracing systems use compatible format of IDs (either opaque string or fixed-width), then interop is achieved
if receiver works with opaque strings, then interop is also achieved
if sender uses strings but receiver uses fixed-width ID, interop is not possible, which is no worse than if this standard did not exist. It's actually better that nothing because at least the receiver can still log the incoming ID and support some sort of correlation lookup (according to Bogdan, the "big providers" are actually unlikely to trust external ID anyway).

codefromthecrypt · 2017-07-14T04:33:34Z

@adriancole <https://github.com/adriancole> but what if the standard addressed both goals? Putting flags and baggage aside for a moment, what if it simply standardized on two headers (or one, with well defined value separator) to represent trace and span IDs as opaque strings, with a provision that when those IDs are fixed-width byte arrays they should be hex-encoded? - if both sending & receiving tracing systems use compatible fixed-width ID (like Jaeger and Zipkin), then interop is achieved - if receiver works with opaque strings, then interop is also achieved - if sender uses strings but receiver uses fixed-width ID, interop is not possible, which is no worse than if this standard did not exist. It's actually better that nothing because at least the receiver can still log the incoming ID and support some sort of correlation lookup (according to Bogdan, the "big providers" are actually unlikely to trust external ID anyway). hey yuri. My main point is that this pull request is trying to solve all

problems in one go, and that's unrealistic expectation and why there has been nothing merged since last december. For example, in a normal project we would merge some functionality, then decorate things like a second header as an iteration on that, driven by use cases. I would like to elaborate on some of your points, but this is a very ineffective way to progress the several topics mentioned. For example, elaborating on your third point would be yet another tangent better contained to a topic (separate issue) about how to address external IDs.

yurishkuro · 2017-07-14T05:44:06Z

I think we're trying to solve only one problem - interop. In the current form the proposal does not solve the problem of interop any more than already can be achieved with B3 headers, which are already a standard on their own, as you point out. I, too, prefer to close PRs quickly, but clearly there are many tracing systems that won't be able to use fixed-width byte arrays, so instead of unifying we'll be fragmenting the space even more.

Do you think what I proposed above creates a serious issue for Dapper-like systems? To me it seems very doable to support that format.

codefromthecrypt · 2017-07-14T05:53:39Z

@yurishkuro I do not think creating a draft and proving it works with fixed length before widening scope is fragmenting. It is simple iteration, and separates problems. Just like someone wanting Elasticsearch in jaeger can follow the issue until that's resolved or otherwise, someone interested in variable length ids can follow that issue as well. It is simply a feature on one hand, but requires work on the other. Meanwhile, jaeger has interop tests for your trace context format. It is unfair to limit good testability to just your project.

codefromthecrypt · 2017-07-14T06:16:57Z

@yurishkuro @SergeyKanzhelev I opened an issue on opaque ids since it seems that's the only way to get one opened. #6 Please make an effort to raise separate issues for separate features. Hijacking this issue for an unbounded scope isn't an effective way to progress work and results in circular discussion. We all have limited amounts of time and doing real work with that time is better than cluttering this PR.

bogdandrutu · 2017-07-14T13:32:18Z

@yurishkuro for trusting because I saw multiple mentions of that, I said that the system at the edge may not trust the trace_id, but what we will do is still record them as Links for example (see https://github.com/census-instrumentation/opencensus-java/blob/master/api/src/main/java/io/opencensus/trace/Link.java), so in oder words you don't directly use the same trace_id but you record it to keep the relationship/correlation. Also this is just a possible implementation, and for sure an implementation detail.

@yurishkuro we should also keep in mind that we would like to use the same format even when we send the trace context between processes in the same organization where we want to keep the same trace-id, so understanding the format is critical there.

@adriancole I think the idea to move forward with this proposal and start having working examples between different tracing systems/libraries will prove (or not) the current work.

@SergeyKanzhelev + @bhs your main concern is that we do not specify a length for the span-id/trace-id? Are there any other major concerns? If we add that, is it enough for you to use it?

@bhs you did not specify the format that you use in LightStep, is that available publicly?

bhs · 2017-07-14T15:40:06Z

@bogdandrutu

@bhs you did not specify the format that you use in LightStep, is that available publicly?

Kinda? LightStep is quite adaptable and can downgrade functionality to accommodate many other trace formats, including weird in-house ones that don't have a public presence. That said, our product roadmap involves protocol changes which are strictly incompatible with the tracecontext proposal. I'm not suggesting that you need to make changes to accommodate LightStep, but since you ask, there's your answer.

@adriancole, I understand your desire to merge something. IMO the discussion is sprawling here because the file we're actually reviewing starts off with this:

Our goal is to share this with the community so that various tracing and diagnostics products can operate together, and so that services can pass context through them, even if they're not being traced (useful for load balancers, etc.)

If this is indeed designed for "various tracing and diagnostics products", it only seems natural for those tracing and diagnostic products to explain where there are problems. Bogdan brought an earlier version of this proposal up at the workshop in Berlin earlier this year, and IIRC there were a lot of semantic mismatches when we got down to brass tacks. I want to be ultra-clear that I am actually quite enthusiastic about this work, but I also want to be ultra-clear that I think it would be more successful and adopt-able if it was one notch less opinionated about things that don't actually matter right now in practice (again, like bit widths).

@bogdandrutu

@bhs your main concern is that we do not specify a length for the span-id/trace-id? Are there any other major concerns? If we add that, is it enough for you to use it?

Variable-length ids is one thing (btw, there's no need to specify a length IMO... it'd be fine in my view to just delimit fields and restrict ids to alphanumeric strings). Other problems are:

no key:value userdata (i.e., "baggage")
I would love a place to stash the sampling rate (or reciprocal sampling rate), though given baggage support we'd be there already
the sampling bits do not match our model, though I can overlook that / ignore those bits

If you're asking me (you're not, I know :), I would be excited about a format that has the version-ish stuff exactly as you've proposed it, alphanumeric ids for trace and span ids separated by -, then a free-form map for baggage, possibly separated between Tracer-internal state which must be propagated and user state which "should" be propagated, best-effort.

codefromthecrypt · 2017-07-14T16:06:07Z

I think it would be more successful and adopt-able if it was one notch less opinionated about things that don't actually matter right now in practice (again, like bit widths).

I think many of us are a victim of your success, Ben. The detail of trace ID contents matters precisely right now, as it simplifies the bootstrapping of this project. Most dapper based systems do not store IDs as alphanumeric strings, rather 64 or 128bits. These very systems exist today, are often internal and do trust ids passed to them. If we open this up to arbitrary text on day one, it guarantees the initial work will be less effective or more time consuming or both. With regards to success, I'm not worried about it as it is defined now, as long as we can actually start.

bhs · 2017-07-14T16:22:40Z

Most dapper based systems do not store IDs as alphanumeric strings, rather 64 or 128bits.

Yes, agreed, but the store-ing of the IDs and the transmission of the IDs are separate concerns. We are already talking about sending ids as hexadecimal ASCII, which is (obviously) not the in-memory representation in most tracing systems. I feel like we're arguing in circles about this... as @bogdandrutu mentioned, cloud-managed services would most likely just record the inbound ids and not even use them "natively" for tracing, so the length restriction matters even less than previously argued AFAICT.

BRMatt · 2017-07-25T10:31:08Z

HTTP_HEADER_FORMAT.md

+
+## Field value
+
+`base16(<version>)-<version_format>)`


Is this meant to be base16(<version>)-base16(<version_format>), or base16(<version>)-<version_format>?

bogdandrutu · 2017-09-06T17:36:41Z

I am going to merge this pull request and address open issues in separate PRs.

bogdandrutu requested review from codefromthecrypt and mtwo March 31, 2017 20:17

bogdandrutu force-pushed the http branch from 887cb27 to 79b51af Compare March 31, 2017 20:25

Add the HTTP header format proposal for TraceContext propagation.

8257a88

bogdandrutu force-pushed the http branch from 79b51af to 8257a88 Compare March 31, 2017 20:30

mtwo approved these changes Mar 31, 2017

View reviewed changes

codefromthecrypt reviewed Apr 1, 2017

View reviewed changes

bhs reviewed Apr 9, 2017

View reviewed changes

bogdandrutu mentioned this pull request Apr 10, 2017

Update to use the new proposed encoding. census-instrumentation/opencensus-java#197

Merged

codefromthecrypt mentioned this pull request Apr 26, 2017

Add HTTP Correlation protocol dotnet/corefx#18887

Merged

costinm reviewed May 23, 2017

View reviewed changes

bogdandrutu commented Jun 20, 2017

View reviewed changes

Fix the first round of comments.

d61547f

bogdandrutu force-pushed the http branch from 59353df to d61547f Compare June 20, 2017 02:31

SergeyKanzhelev reviewed Jun 27, 2017

View reviewed changes

codefromthecrypt reviewed Jun 28, 2017

View reviewed changes

bogdandrutu mentioned this pull request Jul 7, 2017

Context tracer for flask googleapis/google-cloud-python#3573

Closed

BRMatt reviewed Jul 25, 2017

View reviewed changes

mabn mentioned this pull request Sep 3, 2017

Vendor-neutral (OpenTracing?) tracing in sidecar processes opentracing/specification#86

Open

bogdandrutu merged commit cd8051c into w3c:master Sep 6, 2017

bogdandrutu deleted the http branch September 6, 2017 17:36

codefromthecrypt mentioned this pull request Sep 8, 2017

Do we need sampling to be more than one bit in trace-parent to define priority #8

Closed

bogdandrutu mentioned this pull request Sep 15, 2017

Make TraceOptions optional. #12

Merged

yurishkuro mentioned this pull request Sep 22, 2017

Remove "hexadecimal" restriction #16

Closed

jacpull mentioned this pull request Sep 26, 2017

Catalog of well-known Correlation Context key names #26

Closed

SergeyKanzhelev mentioned this pull request Jan 4, 2018

Migrate to W3C #44

Closed

		@@ -0,0 +1,88 @@
		# Trace Context HTTP Header Format

		Date: 31/03/2017


		The value will be US-ASCII encoded (which is UTF-8 compliant).

		### Version


		#### Trace-id

		Is the ID of the whole trace forest. It is represented as a 16-bytes array,


		### Version

		Is a 1-byte representing a 8-bit unsigned integer. Version 255 reserved.

Add the HTTP header format proposal for TraceContext propagation. #1

Add the HTTP header format proposal for TraceContext propagation. #1

Conversation

bogdandrutu commented Mar 31, 2017

codefromthecrypt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reta Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wu-sheng commented Apr 7, 2017

codefromthecrypt commented Apr 7, 2017

wu-sheng commented Apr 7, 2017

bhs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codefromthecrypt Apr 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codefromthecrypt Apr 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codefromthecrypt commented Apr 12, 2017

costinm May 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogdandrutu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codefromthecrypt commented Jun 28, 2017 via email

codefromthecrypt commented Jun 28, 2017 via email

SergeyKanzhelev commented Jul 3, 2017 • edited Loading

SergeyKanzhelev commented Jul 4, 2017

bhs commented Jul 5, 2017

yurishkuro commented Jul 5, 2017

reta Apr 12, 2017 •

edited

Loading

codefromthecrypt Apr 10, 2017 •

edited

Loading

codefromthecrypt Apr 10, 2017 •

edited

Loading

costinm May 23, 2017 •

edited

Loading

SergeyKanzhelev commented Jul 3, 2017 •

edited

Loading

SergeyKanzhelev commented Jul 13, 2017 •

edited

Loading