-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the HTTP header format proposal for TraceContext propagation. #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some things noticed, but overall nice job
HTTP_HEADER_FORMAT.md
Outdated
@@ -0,0 +1,88 @@ | |||
# Trace Context HTTP Header Format | |||
|
|||
Date: 31/03/2017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why this is here, though it will be maintenance causing.. wonder if there's a badge to auto-include the git timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
HTTP_HEADER_FORMAT.md
Outdated
|
||
The value will be US-ASCII encoded (which is UTF-8 compliant). | ||
|
||
### Version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version or variant? maybe mention in docs this could be used by variants of the spec? or just wait and see..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(As a random, interested passerby:)
The problem with having an explicit version field is that you risk breaking all receivers as soon as you increment the field, which means a user can't do a staged rollout of a system that uses a new version of the protocol.
A better strategy would be to reserve some of the options bits and then requiring current implementations to set them to zero and ignore any trailing data. This way, you can evolve your protocol gradually.
HTTP_HEADER_FORMAT.md
Outdated
#### Trace-id | ||
|
||
Is the ID of the whole trace forest. It is represented as a 16-bytes array, | ||
e.g., `0x4bf92f3577b34da6a3ce929d0e0e4736`. All bytes 0 is considered invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 0x prefix intended here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
HTTP_HEADER_FORMAT.md
Outdated
#### Trace-options | ||
|
||
Controls tracing options such as sampling, trace level etc. It is a 4-bytes | ||
representing a 32-bit unsigned integer in little-endian order. The least |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java's default is big-endian.. there are a lot of java programmers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ps to elaborate on this, there's a couple things in play. On one hand, if you have only one flag, the endianness isn't an issue. Most users have sampled set as a single bit, and that's the only flag used in zipkin routinely.
On the other hand, the platform default encoding of java as big-endian results in many not being aware of endian-ness in general. Mistakes coding anything binary have been difficult in zipkin, resulting in json actually. While the case endianness is "little" elsewhere, anectodally I've noticed those who have little endian platforms already know how to do big endian. Main thing is that they get annoyed when endian isn't documented. Ex when folks were doing thrift encoding I remember someone rather annoyed that thrift tbinary format was big endian without saying so (because it was implicitly that in java).
food for thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java's default is big-endian.. there are a lot of java programmers
not only that, but big-endian is also "network byte order". Wikipedia says:
Big-endian is the most common format in data networking; fields in the protocols of the Internet protocol suite, such as IPv4, IPv6, TCP, and UDP, are transmitted in big-endian order. For this reason, big-endian byte order is also referred to as network byte order.
If I can write the 32bit flags number as 0x00000001, it seems more natural to have 00000001
in the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that network byte order is big-endian. But probably 99% of the CPU in used are little-endian.
Let's vote here #3 for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @bhs proposal to reduce this to 1 byte is an alternative option. Updated the issue with this possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. 1-byte used for options.
HTTP_HEADER_FORMAT.md
Outdated
``` | ||
Value = 004bf92f3577b34da6a3ce929d0e0e473600f067aa0ba902b701000000 | ||
Version = 0 (0x00) | ||
TraceId = 0x4bf92f3577b34da6a3ce929d0e0e4736 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0x prefix will likely lead to encoding bugs as people will almost certainly copy/paste if we give opportunity to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
HTTP_HEADER_FORMAT.md
Outdated
Version = 0 (0x00) | ||
TraceId = 0x4bf92f3577b34da6a3ce929d0e0e4736 | ||
SpanId = 0x00f067aa0ba902b7 | ||
TraceOptions = 0x0 // not-sampled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make an example that breaks with endian, especially if we use little endian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriancole This is terrific initiative. The proposal is looking quite reasonable. I think the only debatable piece is the presence of Version
. Iе would better use your suggestion, like Variant
, or Encoding
, or Scheme
. To give an example, I would like to see something like that: Z<traceid><spanid><options>
where Z
would tell me that this is Z
ipkin encoding scheme. For H
Trace it could be H<spandId><options>
(there is no separate Trace Id, it is encoded in Span Id, at least in 4.1
). It is the same 8 bits but using characters instead of 0
. Not sure if it makes sense to you guys. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more endianness.
I like the idea of maybe having multiple supported systems, but one of the main purpose is to make all these systems work together so having separate encodings for each system will not be that much of a win.
@adriancole Will this become the TraceContext propagation spec of zipkin? Or do you have wider purpose? |
@wu-sheng this is a spec for dapper-like systems. It will take some time for something like this to work its way through implementations and interop, but it is indeed something I'd like to see possible in zipkin tracers. For example, there's some TODO: work left with brave to show an alternative context header, and this format is what I'll use to test that. |
@adriancole I will keep watching this. Your trace-related projects are very interesting for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, @bogdandrutu. I made some comments.
@bogdandrutu what's the vision/scope for the TraceContext
github org?
HTTP_HEADER_FORMAT.md
Outdated
Date: 31/03/2017 | ||
|
||
A trace context header is used to pass trace context information across systems | ||
for a HTTP request. Our goal is to share this with the community so that various |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear a lot of pain around tracing things like messages buses / kafka / etc. Might be worth intentionally decoupling from HTTP and making this about any messaging tech that has room for metadata/headers.
HTTP_HEADER_FORMAT.md
Outdated
|
||
#### Trace-id | ||
|
||
Is the ID of the whole trace forest. It is represented as a 16-bytes array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The room in 🇩🇪 seemed to favor a variable-length field here... I thought that was for good reason, too: 128 bits of precision is completely unnecessary for the purposes of most tracing systems, yet insufficient for others we heard about that needed 192. Why does this spec need to try and get this right? (We can set a hard max at 32 bytes or something if you are concerned about unbounded allocations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw As a unit, zipkin usually implement things a lot later than the first request after significant cross-project interest exists, and there are hands to do it. Ex 128bit trace ids was implemented probably 2 years after the first sites started forking zipkin to accomplish this. This happened after a couple more recent glitches occurred including users of AWS lambda not being able to reuse the uuid there, as well zipkin users who have hybrid sites that employ stackdriver (which is 128bit). This topic has been discussed at length on other forums, so not trying to replicate it here. Suffice to say that 128bit is definitely in use in zipkin sites.
To make things less conjectury, it would be nice to have a matrix of tracing systems and their trace context requirements someplace, maybe on a google doc or otherwise.
In practice, Zipkin has never been asked directly to support a bit length higher than 128. However, we have been asked for more "squishy" things like opaque strings. Also, some tracers like sleuth were asked for lambda, which has an interesting timestamp+identifier trace id format. Having a matrix around can make things more objective as people can see if all of the systems they care about can work with a spec or not. It won't be about most, rather what site is running. For example, 9/10 is meaningless if that last one is the only one they use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we accept variable size context we should definitely have a max size for both trace_id/span_id if 32 bytes fits all the systems then we should probably change to that. Do you know any system that is willing to change to a common format and uses more than 128bits? As @adriancole mentioned for the moment the systems that are interested in using this format are Zipkin and Google and both are fine with 128bits.
One option to move forward is to make v0 having fixed sizes and we can define the v1 later when we have a clear system that wants to use this and requires more than 128bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also filed an issue to keep track of who wants/is willing to use this format.
Please add your project here #4. Add any extra requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracelytics uses 160-bit trace IDs currently, and we are interested in a common format. FWIW AWS X-Ray uses a 128-bit trace ID format consisting of a 64-bit timestamp concatenated (using a hyphen) to a 96-bit unique ID, both encoded in hex.
HTTP_HEADER_FORMAT.md
Outdated
*Valid sampled Trace-Context:* | ||
|
||
``` | ||
Value = 004bf92f3577b34da6a3ce929d0e0e473600f067aa0ba902b701000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is over-optimized / pre-optimized for performance. In the context of an RPC, parsing these things is going to be a joke from an overhead standpoint... I'd rather have an id format that's human-readable, or at least human-parseable (without counting characters by hand, that is).
E.g., this:
Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01000000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made comment elsewhere but I also agree a delimited format is near required from an operator POV. Especially as people cross-reference trace ids in other systems. Like splunk search shows something where one system logs the trace id as a logging key. Copy/paste this into zipkin or another system. There's also the "curl"ability which is "can a user work only having curl?" Straightforwardness is important for adoption, or rejection prevention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not unreasonable to add an extra character especially for the HTTP/String version to make this more readable.
If this is an agreement I think we should do it this way. I will start few issues where we can vote for certain things to get all these concerns/issues/questions resolved and CC interested people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please vote/comment here: #2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Used a delimiter between all fields.
Implementation may decide to completely ignore the trace-context if the span-id | ||
is invalid. | ||
|
||
#### Trace-options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's odd to me that we support a versioning bit, yet have this 4-byte thing we only have 1 bit specified for... we could alternatively just have a single byte for version 0 and skip all of the endianness discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting this to fill up (the 4-byte options) with things like what you proposed (sampling probability). Uber also suggested an extra bit for deferred sampling decision.
I am trying to get for the moment the minimum requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to collect data about how many bytes we need. So far the list contains 3 (based on Jager requirements).
If the list is less than 8 we can definitely go with 1 byte for the options in v0.
Should we have the sampling probability that @bhs proposed as a separate field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. 1-byte used for options.
@@ -0,0 +1,88 @@ | |||
# Trace Context HTTP Header Format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which systems are committed to support this? Which systems would you like to support this? Might be helpful to @
-mention people from the latter so we can have any debates before this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting in 2p eventhough the question was for @bogdandrutu :)
TL;DR; I'd expect tracing systems which maintain all of their tracers to make a more top-down decision, but yeah not currently the case in zipkin as this trace context is compatible with zipkin's wire format.
Currently, some zipkin-compatible tracers support vendor-specific or not widely used formats. Those types of tracers will have an easier time with this since zipkin's trace context isn't inherently incompatible with this specification (at the moment). I'd expect the cross-section of google+zipkin users to be first to ask, as it is likely google will land some variant of this first (grpc, cloud services and stackdriver instrumentation).
Similar to other things that happen, when that demand occurs it is usually in a repo or two. For example, our first requests for StackDriver and X-Ray trace support came from sleuth issues list. Tracers run independently and can move to support something sooner or later. I often ping people across tracers on things like this so that they can weigh-in before organic demand hits.
Regardless, support or not support lies in the scope of each tracer to decide until there are server implications like the trace context is incompatible or too wide to store in zipkin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current format (that is compatible with ZIpkin and Google) we try to make at least these systems to work with this format. Having a common place for the specs (maybe some simple implementation in multiple languages) is one of the goal.
Anyone who is interested in using this format is welcome to join the effort and send patches/PRs etc.
@saintstack @reta any feedback on this from htrace pov? |
|
||
## Header name | ||
|
||
`Trace-Context` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Via' is a standard header with pretty similar semantics. It is also in hpack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trace-Context
can be mistaken with something carrying the baggage or context properties. This is id
. We decided to use Request-Id
for correlation protocol in .NET. If we will be able to converge on format - it may be a good one to use.
Request-Id
is quite descriptive for the header purpose I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL adriancole@
HTTP_HEADER_FORMAT.md
Outdated
@@ -0,0 +1,88 @@ | |||
# Trace Context HTTP Header Format | |||
|
|||
Date: 31/03/2017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
HTTP_HEADER_FORMAT.md
Outdated
#### Trace-id | ||
|
||
Is the ID of the whole trace forest. It is represented as a 16-bytes array, | ||
e.g., `0x4bf92f3577b34da6a3ce929d0e0e4736`. All bytes 0 is considered invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
Implementation may decide to completely ignore the trace-context if the span-id | ||
is invalid. | ||
|
||
#### Trace-options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. 1-byte used for options.
HTTP_HEADER_FORMAT.md
Outdated
#### Trace-options | ||
|
||
Controls tracing options such as sampling, trace level etc. It is a 4-bytes | ||
representing a 32-bit unsigned integer in little-endian order. The least |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. 1-byte used for options.
HTTP_HEADER_FORMAT.md
Outdated
*Valid sampled Trace-Context:* | ||
|
||
``` | ||
Value = 004bf92f3577b34da6a3ce929d0e0e473600f067aa0ba902b701000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Used a delimiter between all fields.
HTTP_HEADER_FORMAT.md
Outdated
Version = 0 (0x00) | ||
TraceId = 0x4bf92f3577b34da6a3ce929d0e0e4736 | ||
SpanId = 0x00f067aa0ba902b7 | ||
TraceOptions = 0x0 // not-sampled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more endianness.
I like the idea of maybe having multiple supported systems, but one of the main purpose is to make all these systems work together so having separate encodings for each system will not be that much of a win.
1. Trust and abuse. | ||
2. Bug in caller | ||
3. Different load between caller service and callee service might force callee | ||
to down sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming sampling logic may differ much between vendors - why place sampling flag to the trace context? Why not pass it as a separate header that is subject for vendor-specific logic? In some cases to resolve the issues you describe - extension libraries will need more than just a sampling flag. Most probably some additional information from baggage
or other headers.
You may also have multi-tier sampling. Take an example of local agent mode that @bogdandrutu demo-ed for census. You may want to sample data that you send to backend with sampling rate 0.01
and locally - 0.1
. So you'd need more than one bit to carry both sampling decisions.
Another consideration - properties of Trace-Context
belongs to the new span you create. So developer or extension library can access those fields by taking the "running" span details. You can use those identifiers for the non-tracing needs. Sampling flag will not be one of a span properties (will it?). So you have no access to the original value of that flag. And you may not need any options as you may only care about the identifiers. So having options in Trace-Context looks inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev consistent sampling across the whole trace is very important characteristic. If the sampling decision is not propagated and the trace spans multiple implementations t1,...,tN
, then each implementation will have to make a sampling decision over and over, and the probability that you will capture the whole trace becomes very small, p1 * ... * pN
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro if you are making a sampling decision based on traceid
you will have min(p1, p2, .. pN)
number of full traces, not the multiplication. If all probabilities match - all the traces will be collected fully,
When you do sampling you may need to estimate the count of spans exhibit certain properties based on sampled data. In case of statistical sampling you may just multiple the raw count of spans to sampling percentage to get statistically accurate number.
If sampling decision forced from above without the information on sampling percentage of originator - you cannot do this type of estimations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
statistical sampling on every layer is just one example of different type of sampling you may want to implement and you will need more than a bit of informaiton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are making a sampling decision based on traceid...
When you do sampling you may need to estimate the count of spans
Having the sampling bit does not prevent you from passing extra data or making more elaborate decisions, but it's not part of the proposed standard. NOT having the sampling bit pretty much guarantees that the trace will be broken, unless every implementation makes the decision based on the exact same formula, like traceId < p * 2^128
. which is again not a part of the proposed standard.
Sampling bit is a recommendation. If a service can respect it and handle the volume - great. If it cannot respect it 100%, maybe it can respect it for "more important" spans like RPCs, and shed the load by dropping in-process spans & metrics. Essentially it does not put any restrictions on how you want to implement sampling, but still provides a way to achieve consistent sampling across the stack, if the volume allows it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is precisely my question. Do you think it will be typical for libraries to trust this flag? Or every vendor and library will implement own logic so this flag will never be used? Should this standard define the flag or let customers decide on sampling algorithm across services in their org as a decision separate from the data correlation protocol?
Beauty of sampling flag is ability to implement solutions like forced data collection for period of time or specific span name. Also it removes the need to synchronize the sampling decision algorithm.
On negative side - services looses control of the data volume and statistical accuracy of collected data.
Protocol is optimized for a single team owning many components with relatively similar load. It is not always the case. Every component may be owned by a team which want to play nice and contribute to the overall correlation story. But have a bigger priority to fully control telemetry volume and distribution collected from this component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On negative side - services looses control of the data volume and statistical accuracy of collected data.
I don't agree with this assessment - the flag is recommendation, an implementation does not have to respect it if it thinks it can do a better job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it will be typical for libraries to trust this flag?
Yes, I think it's a very common implementation to simply trust the flag, in the absence of other knowledge about the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm not against the flag that will force all layers to record a specific trace. It is very useful for troubleshooting scenarios.
- I also agree that collecting the entire trace as often as possible is very useful.
- It is unavoidable that downstream services will collect more telemetry than was sampled in by front door. So there will be incomplete traces.
What I want to avoid is the situation when you heavily rely on implementation detail of upstream component sampling algo. Also I want to make sure there is an easy to replicate on any language mechanism to control the flood of telemetry in case of non-matching load patterns. Third, for many Application Insights scenarios we need to keep the sampling percentage that was used to sample telemetry out so we can run statistical algorithms on telemetry and recognize patterns. So single bit will not generally work for us as a universal sampling mechanism.
I'd propose to have a debug
flag with the semantic of occasional forceful tracing across the system. Sampling flag then can be vendor specific.
|
||
### Version | ||
|
||
Is a 1-byte representing a 8-bit unsigned integer. Version 255 reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rationale here is that we expect the key to remain the same even if the value changes. This way, we can change format in worst case where such a thing is needed, and that can be done without a fan out of administrative activity such as new filter patterns to afford a new trace header key. We should update this to make very clear that version changes are not expected and highly discouraged, especially breaking ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriancole I really like an explicit versioning. I think it will help a lot long term
@bogdandrutu was the idea of the version an incremental thing or the format of the header? Let's say 00
is what explained here, 01
- binary format, 02
- alternative length of request ID or other implementation? We've been thinking suggest to use it to indicate the alternative way of generating identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev cool. I am in favor of versioning at this point, notably to help folks know when to not process a header (ex check magic type of thing likely used similarly in X-Ray's format).
Wrt version also being a format flag, not sure it applies here. For example, in grpc, binary headers are actually encoded as base64, and their header names have -bin appended to them. So in this case, they don't need a different format bit inside their encoded trace data as it is already implicit in the scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if a service is called from device via http with the size concern and from another service with the regular header format? This service need to read two headers? Which one will win?
When you have a single header and allow for binary format - things got easier I believe
If we take out the sampling flag, what we are left with in general is a way
to specify ids in a tree. Users of this format can do logging or other
sorts of correlation using them. What we couldn't do is notify in any
portable way that the trace/span ID pair was reported to the tracing system.
I actually see this flag as different than a sampling algo data. It is
really a signal that the ID you just received is either on the way to the
tracing system or accidentally dropped. Sampling algo data surely can be a
side-car to enrich this information.
What would be the harm in sending this, if it were defined as a signal of
reporting status vs sampling algo? Well, maybe someone who does
coordinated, after-the-fact pull collection would get nervous? Maybe they
are nervous because the best they could do is send 0 as they don't know if
they will report that span or not. If this is true, worst case is as bad as
not defining the bit... such a system doesn't know and leaves the bit unset.
On the other hand, allowing the bit to be set preserves a widely used
feature for systems who have trace-id consistent export policy. If there
was a compelling reason to prevent these systems from being able to interop
with a single header, we'd add another header to do it, and use it
consistently. Vendor-specific would just undo interop between systems who
surely can today (bear in mind multiple systems use zipkin's B3 today!)
This is important to reason through because we don't want to undermine the
common good if we don't have to.
|
the TL;DR; would be maybe we can restate the sample flag from a
command to an FYI.
Like instead of Sampled=1 meaning.. you better sample this, it means
FYI, some of this trace is was reported to the tracing system
A lot of current tracers will see the FYI and decide to also do that,
especially for internal networks.
The command of "force trace" could be another flag as sergey mentions.
|
I'm trying to summarize the main difference between this format and other formats: This format vs. B3:
This format vs. AWS. The best link I found:
This format vs. .NET proposed:
I'm trying to get sense of why to use the new format, not use one of existing. I still think it would make sense to detach sampling/debug flag from the identity information. Support of less strict identifiers can also help integrating with other monitoring systems. I'll post another comment on how the analog of this format can be merged with |
Another typical case is Heroku with
|
@SergeyKanzhelev thank you for the brief survey above. The more I think about this issue the more I like the idea of leaving the precise bit-width and even the encoding of the trace and span ids out of this spec... why can't they just be strings? (I.e., from a passing-from-process-to-process standpoint, the encoding doesn't matter as long as the (opaque) trace/span ID can be extracted) |
@bhs I think many existing tracing systems won't be able to support trace ID as arbitrary string, but I think this is actually an important question about this spec: is it meant to enforce a single trace ID across different tracing backends, or just provide a correlation between different sub-graphs? At the last Zipkin workshop it was mentioned that blindly respecting a foreign trace ID is dangerous, since a bad player could send the same ID over and over, messing up all of the internal traces for another sub-system. On the other hand, treating it simply as a correlation ID (foreign reference) is safe and does not require any conventions about the bit-length and encoding. |
I absolutely agree – my idea here is less that all tracing systems should support an arbitrary string, and more that the tracing system should have control over what a trace ID looks like. I'd be fine restricting it to hex or base36 or whatever if that makes people feel better about it. I just don't see the ROI in trying to enforce a particular bit width, etc. Interop between tracing systems within a single application deployment has always been a non-goal of mine. I'm more interested in allowing the routers and service meshes (and other L7 tech of the world) to pass tracing information along properly. |
I think ben has a different goal as mentioned here. This is an interop spec
for those who can, not just a "firewall rule" for intermediaries. Those who
can pass a string can also pass an encoded one. By specifying a type many
can agree to, many will interop even better than they can now. Some will
not use this spec anyway. If we remove the encoding this becomes worse than
B3 not better in terms of interop.
Regardless those propagating other things like tags will need another
"firewall rule" expression to forward them. The scope of a generic template
for forwarding rules is a different one than this.
On 5 Jul 2017 23:55, "Ben Sigelman" <notifications@github.com> wrote:
@yurishkuro <https://github.com/yurishkuro>
I think many existing tracing systems won't be able to support trace ID as
arbitrary string
I absolutely agree – my idea here is less that all tracing systems should
support an arbitrary string, and more that the tracing system should have
control over what a trace ID looks like. I'd be fine restricting it to hex
or base36 or whatever if that makes people feel better about it. I just
don't see the ROI in trying to enforce a particular bit width, etc. Interop
between tracing systems *within a single application deployment* has always
been a non-goal of mine. I'm more interested in allowing the routers and
service meshes (and other L7 tech of the world) to pass tracing information
along properly.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD618mzV6h3S7xbKxdp5INmMrIJcy_hks5sK7H-gaJpZM4MwHLX>
.
|
For the sake of argument, let's say we didn't have encoding specified. Then
we'd basically consider whatever the character set is with regards to how
big the ID is. So, the storage/indexing aspect is much wider, and there
would be no means to reliably narrow if an ID is passed bigger than some
can store: some folks have inexpensive setups that are limited to 64bit
trace ids.
If the encoding was instead specified (ideally hex as it is now), storage
can at least reliably narrow 128bit to 64bit at risk of collision. If
encoding is not specified (as suggested as opaque), you really couldn't do
anything like this reliably without interpreting the data, or adding
functionality which isn't universal like linked ids.
Any time we have to map identifiers, there's correlation loss, for example,
the ID is mildly different encoding when plopped in logs vs tracing vs
other things. It would be a sad step back if the majority of us decide to
not interop to the point of specifying ID encoding.
|
@adriancole I understand frustration. I still have questions to this format. We also thinking how it can be converged with .NET http correlation protocol and working on proposal. That's why all the questions. My understanding was that the idea of this proposal was to set a standard for all things tracing, not limit it to Census and Zipkin use case. @mtwo, @bogdandrutu it will be great if you can summarize the open questions from this thread and plans to merge and evolve the format with Census (I'll also try to reach you directly tomorrow). |
@adriancole <https://github.com/adriancole> I understand frustration. I
still have questions to this format. We also thinking how it can be
converged with .NET http correlation protocol and working on proposal.
That's why all the questions.
It is mainly an issue of github pull requests not being as good at
exploratory questions as google doc (which we started with last year and
preceded this). I'm trying to be conscious of the near impossibility of
anyone jumping in and understanding the threads here.
My understanding was that the idea of this proposal was to set a standard
for all things tracing, not limit it to Census and Zipkin use case.
nothing can standardize all things tracing. that's an unrealistic goal, as
there are dramatically different things and folks who are not participating
on this (such as htrace, aws, x-trace, pinpoint, app dynamics, etc). we
cannot expect to standardize something for every use case but we could
standardize something for dapper-like systems.
|
I likely don't have time to finish this argument, but I do want to re-raise the fact that the benefits of specifying a bit width for ids are in my mind outweighed by the benefits of not-specifying a bit width. The path-id thing is a perfect example... I actually implemented something like that for a few recent demos and expect to see something similar in production before too long; I would love to use a standard inter-process propagation format, but that would be a non-starter with the fixed-width id scheme (at least without some arbitrary baggage map to stick things in instead; though I can understand leaving something that generic out of the spec for now). I think the current proposal is utterly sane but too narrow. What we'll end up with is yet-another-standard for the inter-process propagation (and maybe even a decent one), but a standard that's too narrow to absorb the semantics of other popular propagation formats and thus one that's limited. I.e., I like it, but I'd like it more if it only placed the restrictions that are literally necessary: since the ids aren't parsed, who cares how wide they are? |
@bhs I was hoping for the same. I like some ideas in this format though. Like format number prefix and clean dash-delimited string. In the most distributed systems I see one header for always changing unique string and one for something you need to keep along the entire trace. Most of the time first is being optimized to have a prefix of the
I've seen older systems that do not do This way as I mentioned in #4 you can force all "dumb http services" to simply trace the content of a header with the specific name alongside the logs or pass it forward if it is a proxy of sort. |
So ben/sergey would you use this if it didnt specify id constraints? I
havent seen lightstep on the issue tracking stakeholders and their
requirements.
If there was no format, how would you interop with another cloud or is that
just not important to you? How would systems that desire interop do so if
nothing is specified practically speaking? Would they scan the characters
and guess? Whats the point?
|
I agree with Adrian. Two of the goals of this effort benefit from a common format encoding:
|
@adriancole @mtwo the important thing is semantic. One header should be unique for every request and another is constant for a trace. If you have a string which semantic is Furthermore - if you recognize the format of the header you can apply some additional logic on it. So having similar instrumentation and format will allow for an extra features. Like ordering or such. In Application Insights we are using strings to store ID for this specific reason. We have no control over existing instrumentation and allow to use pre-existing correlation practices alongside the Application Insights SDK. |
I tend to agree with @SergeyKanzhelev from a philosophical point of view, even though it would be a breaking change in Jaeger, which currently, similar to Zipkin or StackDriver, only works with fixed-width IDs. There is no strong reason for those IDs to be fixed-width, aside from minor optimization for sampling where the ID can be treated as a random integer and one can do |
@yurishkuro do you use In Application Insights we currently using EJB hash to calculate sampling score (.NET, JavaScript). If we will know that it is most probably just a 128bit number - it can make life easier for the score generation - this formulae can be used |
@mtwo I want to be clear that I'm not arguing against a common trace format. It has obvious advantages, and you've listed two great ones. My concern is that this standard will not take off in practice because it is not semantically general enough to represent the trace context for certain tracing systems. I've spoken at length about this with the Envoy team (cc @mattklein123) and with a variety of organizations (large and small) that depend on things like service meshes, L7 routers, etc. The requirements I've been hearing are basically (1) a truly standard header name (or "names" per @SergeyKanzhelev's points), (2) some way to separate the trace_id from the span_id, and perhaps a way to make sense of standard (e.g., sampling bits) and user-specific propagated state (aka baggage). For the purposes of these use cases, actually specifying the encoding of the ids is not important. I should mention that getting the service mesh stuff to work right end-to-end also requires some standardization of how to serialize what I typically refer to as "out-of-band" trace data, much like Zipkin issue #1499, though ideally not tied to zipkin (just as this tracecontext-spec discussion is not tied to zipkin). Stepping back, my hope is to pair a minimalist, generic in-band-trace-context format and a minimalist, generic out-of-band-trace-data format and allow things like service meshes (etc) work with as many tracing systems as possible without code changes in those service meshes (etc). cc += @opentracing/otsc @opentracing/otiab as there has been some discussion of wading into these waters given this topic's increasing importance. Getting back to @mtwo's vision of trace handoff between apps and PaaS/IaaS providers, I think it's appealing but also problematic in practice: my personal conversations with big-3 providers suggest there's little chance that app developers are going to be given access to traces of the internals of these managed services, many of which are proprietary. |
In general, things without an insane amount of marketing dont take off
until there is value proven. For example spdy led to http/2. Starting
practice here is important even if some are not here yet or willing. For
example B3 already is used across different systems and this can help
practice further.
I also spoke with Matt about envoy/istio. In the near term it is completely
unrealistic that a single header will work without tying the mesh to only
greenfield projects. Today, major meshes specify a pattern and thinking you
can have one pattern knowing brown field exist requires one to change
approach. This isnt one big company. You cant just tell people to change
their stuff. It doesnt work like that and thinking it does is unrealistic.
Efforts needed by mesh are larger than this format and should not tie this
format from its aims. There are other propagated data like flags etc. If I
were doing this I wouldnt try to increase scope of something which already
has practical value (if nothing but better than B3). I would start an
effort to make a propagation thing *decoupled from tracing* which doesnt
pretend everything will big bang and use the same header names. I would try
to help forward a tool agnostic propagation system which has buy in from
others like those doing metering and flags. It would be more effective and
not be so brittle as to expect every thing and every vendor to share a
header prefix.
Back to what we are doing here. Back in the day, google started spdy then
more got involved then http/2. Folks who are uninterested need not apply.
This is a practical work with goals you may not be interested in or not
have interest in being solved. That might make it less compelling for you
and thats fair. There are definitely folks who want interop as otherwise B3
wouldnt be used today in systems that share no code. People run hybrid
clouds and are increasingly using lambda architecture which may call back
out of platform run services. Iterating towards something more adoptable
than B3 moves the ball forward in practical and testable ways. We dont
pretend to solve something impractical or unsolvable.
If we prove value iterating it is far better than pretending to solve
something and not, and far better than sitting on this doc afraid to
experiment.
|
@adriancole but what if the standard addressed both goals? Putting flags and baggage aside for a moment, what if it simply standardized on two headers (or one, with well defined value separator) to represent trace and span IDs as opaque strings, with a provision that when those IDs are fixed-width byte arrays they should be hex-encoded?
|
@adriancole <https://github.com/adriancole> but what if the standard
addressed both goals? Putting flags and baggage aside for a moment, what if
it simply standardized on two headers (or one, with well defined value
separator) to represent trace and span IDs as opaque strings, with a
provision that when those IDs are fixed-width byte arrays they should be
hex-encoded?
- if both sending & receiving tracing systems use compatible
fixed-width ID (like Jaeger and Zipkin), then interop is achieved
- if receiver works with opaque strings, then interop is also achieved
- if sender uses strings but receiver uses fixed-width ID, interop is
not possible, which is no worse than if this standard did not exist. It's
actually better that nothing because at least the receiver can still log
the incoming ID and support some sort of correlation lookup (according to
Bogdan, the "big providers" are actually unlikely to trust external ID
anyway).
hey yuri. My main point is that this pull request is trying to solve all
problems in one go, and that's unrealistic expectation and why there has
been nothing merged since last december.
For example, in a normal project we would merge some functionality, then
decorate things like a second header as an iteration on that, driven by use
cases. I would like to elaborate on some of your points, but this is a very
ineffective way to progress the several topics mentioned. For example,
elaborating on your third point would be yet another tangent better
contained to a topic (separate issue) about how to address external IDs.
|
I think we're trying to solve only one problem - interop. In the current form the proposal does not solve the problem of interop any more than already can be achieved with B3 headers, which are already a standard on their own, as you point out. I, too, prefer to close PRs quickly, but clearly there are many tracing systems that won't be able to use fixed-width byte arrays, so instead of unifying we'll be fragmenting the space even more. Do you think what I proposed above creates a serious issue for Dapper-like systems? To me it seems very doable to support that format. |
@yurishkuro I do not think creating a draft and proving it works with fixed length before widening scope is fragmenting. It is simple iteration, and separates problems. Just like someone wanting Elasticsearch in jaeger can follow the issue until that's resolved or otherwise, someone interested in variable length ids can follow that issue as well. It is simply a feature on one hand, but requires work on the other. Meanwhile, jaeger has interop tests for your trace context format. It is unfair to limit good testability to just your project. |
@yurishkuro @SergeyKanzhelev I opened an issue on opaque ids since it seems that's the only way to get one opened. #6 Please make an effort to raise separate issues for separate features. Hijacking this issue for an unbounded scope isn't an effective way to progress work and results in circular discussion. We all have limited amounts of time and doing real work with that time is better than cluttering this PR. |
@yurishkuro for trusting because I saw multiple mentions of that, I said that the system at the edge may not trust the trace_id, but what we will do is still record them as Links for example (see https://github.com/census-instrumentation/opencensus-java/blob/master/api/src/main/java/io/opencensus/trace/Link.java), so in oder words you don't directly use the same trace_id but you record it to keep the relationship/correlation. Also this is just a possible implementation, and for sure an implementation detail. @yurishkuro we should also keep in mind that we would like to use the same format even when we send the trace context between processes in the same organization where we want to keep the same trace-id, so understanding the format is critical there. @adriancole I think the idea to move forward with this proposal and start having working examples between different tracing systems/libraries will prove (or not) the current work. @SergeyKanzhelev + @bhs your main concern is that we do not specify a length for the span-id/trace-id? Are there any other major concerns? If we add that, is it enough for you to use it? @bhs you did not specify the format that you use in LightStep, is that available publicly? |
Kinda? LightStep is quite adaptable and can downgrade functionality to accommodate many other trace formats, including weird in-house ones that don't have a public presence. That said, our product roadmap involves protocol changes which are strictly incompatible with the tracecontext proposal. I'm not suggesting that you need to make changes to accommodate LightStep, but since you ask, there's your answer. @adriancole, I understand your desire to merge something. IMO the discussion is sprawling here because the file we're actually reviewing starts off with this:
If this is indeed designed for "various tracing and diagnostics products", it only seems natural for those tracing and diagnostic products to explain where there are problems. Bogdan brought an earlier version of this proposal up at the workshop in Berlin earlier this year, and IIRC there were a lot of semantic mismatches when we got down to brass tacks. I want to be ultra-clear that I am actually quite enthusiastic about this work, but I also want to be ultra-clear that I think it would be more successful and adopt-able if it was one notch less opinionated about things that don't actually matter right now in practice (again, like bit widths).
Variable-length ids is one thing (btw, there's no need to specify a length IMO... it'd be fine in my view to just delimit fields and restrict ids to alphanumeric strings). Other problems are:
If you're asking me (you're not, I know :), I would be excited about a format that has the version-ish stuff exactly as you've proposed it, alphanumeric ids for trace and span ids separated by |
I think it would be more successful and adopt-able if it was one notch
less opinionated about things that don't actually matter right now in
practice (again, like bit widths).
I think many of us are a victim of your success, Ben. The detail of trace
ID contents matters precisely right now, as it simplifies the bootstrapping
of this project. Most dapper based systems do not store IDs as alphanumeric
strings, rather 64 or 128bits. These very systems exist today, are often
internal and do trust ids passed to them. If we open this up to arbitrary
text on day one, it guarantees the initial work will be less effective or
more time consuming or both.
With regards to success, I'm not worried about it as it is defined now, as
long as we can actually start.
|
Yes, agreed, but the store-ing of the IDs and the transmission of the IDs are separate concerns. We are already talking about sending ids as hexadecimal ASCII, which is (obviously) not the in-memory representation in most tracing systems. I feel like we're arguing in circles about this... as @bogdandrutu mentioned, cloud-managed services would most likely just record the inbound ids and not even use them "natively" for tracing, so the length restriction matters even less than previously argued AFAICT. |
|
||
## Field value | ||
|
||
`base16(<version>)-<version_format>)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be base16(<version>)-base16(<version_format>)
, or base16(<version>)-<version_format>
?
I am going to merge this pull request and address open issues in separate PRs. |
Add the first spec for the trace-context http header format.