-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the HTTP header format proposal for TraceContext propagation. #1
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# IntelliJ IDEA | ||
.idea | ||
*.iml |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# Trace Context HTTP Header Format | ||
|
||
A trace context header is used to pass trace context information across systems | ||
for a HTTP request. Our goal is to share this with the community so that various | ||
tracing and diagnostics products can operate together, and so that services can | ||
pass context through them, even if they're not being traced (useful for load | ||
balancers, etc.) | ||
|
||
# Format | ||
|
||
## Header name | ||
|
||
`Trace-Context` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 'Via' is a standard header with pretty similar semantics. It is also in hpack. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
## Field value | ||
|
||
`base16(<version>)-<version_format>)` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this meant to be |
||
|
||
The value will be US-ASCII encoded (which is UTF-8 compliant). Character `-` is | ||
used as a delimiter between fields. | ||
|
||
### Version | ||
|
||
Is a 1-byte representing a 8-bit unsigned integer. Version 255 reserved. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rationale here is that we expect the key to remain the same even if the value changes. This way, we can change format in worst case where such a thing is needed, and that can be done without a fan out of administrative activity such as new filter patterns to afford a new trace header key. We should update this to make very clear that version changes are not expected and highly discouraged, especially breaking ones. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @adriancole I really like an explicit versioning. I think it will help a lot long term @bogdandrutu was the idea of the version an incremental thing or the format of the header? Let's say There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SergeyKanzhelev cool. I am in favor of versioning at this point, notably to help folks know when to not process a header (ex check magic type of thing likely used similarly in X-Ray's format). Wrt version also being a format flag, not sure it applies here. For example, in grpc, binary headers are actually encoded as base64, and their header names have -bin appended to them. So in this case, they don't need a different format bit inside their encoded trace data as it is already implicit in the scheme. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what if a service is called from device via http with the size concern and from another service with the regular header format? This service need to read two headers? Which one will win? When you have a single header and allow for binary format - things got easier I believe |
||
|
||
### Version = 0 | ||
|
||
#### Format | ||
|
||
`base16(<trace-id>)-base16(<span-id>)-base16(<trace-options>)` | ||
|
||
All fields are required. Character `-` is used as a delimiter between fields. | ||
|
||
#### Trace-id | ||
|
||
Is the ID of the whole trace forest. It is represented as a 16-bytes array, | ||
e.g., `4bf92f3577b34da6a3ce929d0e0e4736`. All bytes 0 is considered invalid. | ||
|
||
Implementation may decide to completely ignore the trace-context if the trace-id | ||
is invalid. | ||
|
||
#### Span-id | ||
|
||
Is the ID of the caller span (parent). It is represented as a 8-bytes array, | ||
e.g., `00f067aa0ba902b7`. All bytes 0 is considered invalid. | ||
|
||
Implementation may decide to completely ignore the trace-context if the span-id | ||
is invalid. | ||
|
||
#### Trace-options | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's odd to me that we support a versioning bit, yet have this 4-byte thing we only have 1 bit specified for... we could alternatively just have a single byte for version 0 and skip all of the endianness discussion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was expecting this to fill up (the 4-byte options) with things like what you proposed (sampling probability). Uber also suggested an extra bit for deferred sampling decision. I am trying to get for the moment the minimum requirement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am trying to collect data about how many bytes we need. So far the list contains 3 (based on Jager requirements). If the list is less than 8 we can definitely go with 1 byte for the options in v0. Should we have the sampling probability that @bhs proposed as a separate field? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. 1-byte used for options. |
||
|
||
Controls tracing options such as sampling, trace level etc. It is a 1-byte | ||
representing a 8-bit unsigned integer. The least significant bit provides | ||
recommendation whether the request should be traced or not (1 recommends the | ||
request should be traced, 0 means the caller does not make a decision to trace | ||
and the decision might be deferred). The flags are recommendations given by the | ||
caller rather than strict rules to follow for 3 reasons: | ||
|
||
1. Trust and abuse. | ||
2. Bug in caller | ||
3. Different load between caller service and callee service might force callee | ||
to down sample. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assuming sampling logic may differ much between vendors - why place sampling flag to the trace context? Why not pass it as a separate header that is subject for vendor-specific logic? In some cases to resolve the issues you describe - extension libraries will need more than just a sampling flag. Most probably some additional information from You may also have multi-tier sampling. Take an example of local agent mode that @bogdandrutu demo-ed for census. You may want to sample data that you send to backend with sampling rate Another consideration - properties of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SergeyKanzhelev consistent sampling across the whole trace is very important characteristic. If the sampling decision is not propagated and the trace spans multiple implementations There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yurishkuro if you are making a sampling decision based on When you do sampling you may need to estimate the count of spans exhibit certain properties based on sampled data. In case of statistical sampling you may just multiple the raw count of spans to sampling percentage to get statistically accurate number. If sampling decision forced from above without the information on sampling percentage of originator - you cannot do this type of estimations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. statistical sampling on every layer is just one example of different type of sampling you may want to implement and you will need more than a bit of informaiton There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Having the sampling bit does not prevent you from passing extra data or making more elaborate decisions, but it's not part of the proposed standard. NOT having the sampling bit pretty much guarantees that the trace will be broken, unless every implementation makes the decision based on the exact same formula, like Sampling bit is a recommendation. If a service can respect it and handle the volume - great. If it cannot respect it 100%, maybe it can respect it for "more important" spans like RPCs, and shed the load by dropping in-process spans & metrics. Essentially it does not put any restrictions on how you want to implement sampling, but still provides a way to achieve consistent sampling across the stack, if the volume allows it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is precisely my question. Do you think it will be typical for libraries to trust this flag? Or every vendor and library will implement own logic so this flag will never be used? Should this standard define the flag or let customers decide on sampling algorithm across services in their org as a decision separate from the data correlation protocol? Beauty of sampling flag is ability to implement solutions like forced data collection for period of time or specific span name. Also it removes the need to synchronize the sampling decision algorithm. On negative side - services looses control of the data volume and statistical accuracy of collected data. Protocol is optimized for a single team owning many components with relatively similar load. It is not always the case. Every component may be owned by a team which want to play nice and contribute to the overall correlation story. But have a bigger priority to fully control telemetry volume and distribution collected from this component. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't agree with this assessment - the flag is recommendation, an implementation does not have to respect it if it thinks it can do a better job. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's a very common implementation to simply trust the flag, in the absence of other knowledge about the system. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What I want to avoid is the situation when you heavily rely on implementation detail of upstream component sampling algo. Also I want to make sure there is an easy to replicate on any language mechanism to control the flood of telemetry in case of non-matching load patterns. Third, for many Application Insights scenarios we need to keep the sampling percentage that was used to sample telemetry out so we can run statistical algorithms on telemetry and recognize patterns. So single bit will not generally work for us as a universal sampling mechanism. I'd propose to have a |
||
|
||
The behavior of other bits is currently undefined. | ||
|
||
#### Examples of HTTP headers | ||
|
||
*Valid sampled Trace-Context:* | ||
|
||
``` | ||
Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 | ||
base16(<Version>) = 00 | ||
base16(<TraceId>) = 4bf92f3577b34da6a3ce929d0e0e4736 | ||
base16(<SpanId>) = 00f067aa0ba902b7 | ||
base16(<TraceOptions>) = 01 // sampled | ||
``` | ||
|
||
*Valid not-sampled Trace-Context:* | ||
|
||
``` | ||
Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-00 | ||
base16(<Version>) = 00 | ||
base16(<TraceId>) = 4bf92f3577b34da6a3ce929d0e0e4736 | ||
base16(<SpanId>) = 00f067aa0ba902b7 | ||
base16(<TraceOptions>) = 00 // not-sampled | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which systems are committed to support this? Which systems would you like to support this? Might be helpful to
@
-mention people from the latter so we can have any debates before this is merged.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting in 2p eventhough the question was for @bogdandrutu :)
TL;DR; I'd expect tracing systems which maintain all of their tracers to make a more top-down decision, but yeah not currently the case in zipkin as this trace context is compatible with zipkin's wire format.
Currently, some zipkin-compatible tracers support vendor-specific or not widely used formats. Those types of tracers will have an easier time with this since zipkin's trace context isn't inherently incompatible with this specification (at the moment). I'd expect the cross-section of google+zipkin users to be first to ask, as it is likely google will land some variant of this first (grpc, cloud services and stackdriver instrumentation).
Similar to other things that happen, when that demand occurs it is usually in a repo or two. For example, our first requests for StackDriver and X-Ray trace support came from sleuth issues list. Tracers run independently and can move to support something sooner or later. I often ping people across tracers on things like this so that they can weigh-in before organic demand hits.
Regardless, support or not support lies in the scope of each tracer to decide until there are server implications like the trace context is incompatible or too wide to store in zipkin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current format (that is compatible with ZIpkin and Google) we try to make at least these systems to work with this format. Having a common place for the specs (maybe some simple implementation in multiple languages) is one of the goal.
Anyone who is interested in using this format is welcome to join the effort and send patches/PRs etc.