Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address #337 #344

Closed
wants to merge 3 commits into from
Closed

Address #337 #344

wants to merge 3 commits into from

Conversation

SergeyKanzhelev
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev commented Oct 22, 2019

Fixes #337

The intent of the paragraph was to describe the ID generation. It fails to mention that the whole trace-id MUST be propagated.

@bogdandrutu
Copy link
Contributor

  1. Can we please clarify what are the use-cases we try to solve?
  2. Do we want to support 64bit trace-id as a valid implementation, or this is just a transition?
  3. If this is a transition problem then having a deterministic way to convert to/from may help with having independent traces.

My 2 cents:

  1. If the trace-id is generated by a 64bit system then we should fill with 0 when propagating;
  2. If the trace-id is received by a 64bit system instead of dropping the last 64bits which will lead to have different traces, the 64bit system may better generate a new 64bit trace-id and use some kind of linkage if they cannot pass 128bits trace-id. I would actually strongly encourage to say that for a system to be w3c compliant it requires to carry 128 bit trace-id.

@codefromthecrypt
Copy link

@bogdandrutu You know that there exist mixed instrumentation. you have a dog in the race of instrumentation. Most people on this spec are too biased because literally same people here are developing new instrumentation they desire to be used: open-telemetry

Step away from this bias, and realize even if you wanted the world to change to your tools, you need to provide a path to that.

There are instrumentation today and for at least 3 years that are in different places, some may never change, but some may. To allow them to upgrade is better than "big bang" where you pretend entire environments can change immediately. The downgrade works with backfilling zeros.

For example, proxy can rewrite old b3 into w3c trace context mechanically by backfill zeros same as we used to upgrade people into google stackdriver format back in 2016. It is really dangerous to have zero desire to work with existing systems when designing a w3c spec. Please be reasonable and don't break what already works.

@tylerbenson
Copy link

I would like to voice support for 0 backfill as opposed to random backfill as I believe it is more intuitive to understand what is going on. It is also far more compatible with 64 bit systems. I don’t really understand the downside.

@SergeyKanzhelev
Copy link
Member Author

Random backfill was added to make sure that systems don't take dependency on zeroes and make the best effort to propagate an entire thing even if they only use the half of the trace-id for the data recording. Specs doesn't forbid the zeroes backfill, but asks to propagate an entire thing.

@codefromthecrypt
Copy link

I think what's misunderstood here is what the "system" is.

Random backfill is intentionally breaking, which would be surprising to have as a goal in a spec whose purpose is interop.

the propagation system is a collection of instrumentation which have different capacities. 64bit is too prevalent to either ignore or to intentionally break.

The indexing, re-fitting etc capacity of the data pipeline is out of scope for this spec, as that is about data not defined here. It is relevant to suggest that there are systems who are able to conditionally handle things in a mixed mode by looking at the 64bit part. You should know that this doesn't mean they ignore the high bits. What literally happens is that traces are grouped by 64 bits, then re-qualified by the upper bits. When the upper bits are zero (unknown) the traces are kept in a generic grouping. The "backfill random" aspect of this spec wrecks this qualification and correction heuristic. Why this was added, if to intentionally break systems today, seems literally reckless.

@SergeyKanzhelev
Copy link
Member Author

I understand the problem of changing backend. Spec suggesting that you can only look at subset of bytes of a trace-id. But must propagate the rest bytes thru the process. No need to save them. It is written in an assumption that tracestate must also be propagated so there is a mechanism to carry a significant number of bytes from incoming to outgoing calls. Adding a few bytes of trace-id shouldn't be a problem than.

If system makes an assumption that it will never be called by anything that will have full trace-id set with random bytes and will always be called from the zero backfill systems, than zero backfill can be used, it is not forbidden by a spec.

Let me know if I'm missing something.

@codefromthecrypt
Copy link

what led someone to tell me about this issue was bogdan's comment here:

If the trace-id is received by a 64bit system instead of dropping the last 64bits which will lead to have different traces, the 64bit system may better generate a new 64bit trace-id and use some kind of linkage if they cannot pass 128bits trace-id. I would actually strongly encourage to say that for a system to be w3c compliant it requires to carry 128 bit trace-id.

the idea to say that nothing can use tracecontext unless every individual part of that system is 128-bit is a big problem. There's a continued misunderstanding that "propagation system" is homogeneous and it is certainly not nor likely to be for many sites. Implying you have to change all instrumentation to use tracecontext is severe and more to helping force people into open telemetry vs solving the goal of propagation header.

The other thing, and more to the point here, is that the text is very confused and misguided. Ex it says 64 bit system should add random data to make believe it is not. This hints at a control that doesn't exist. Also the example is backwards 23ce929d0e0e47360000000000000000 should be 000000000000000023ce929d0e0e4736. It is more simple to discuss what happens when you receive data, and that there are cases when you convert into or out of 64 bit trace ID carriers. No need to dance around it, just mention it, and please do not suggest anything strange. just backfill left zeroes when converting, and know that this may happen when parsing.

@jcchavezs
Copy link

As a user, I think backfilling the traceID with a random trace is counter intuitive and confusing for people debugging issues in production.

More so, backfilling from the rightside insted of the left side is not only counter intuitive but unnatural, given that we are talking about numbers (in a hex representation) and when a number need to fulfill a certain length we add 0 on the left because it does not impact the actual value and I believe (again as a user) that standards should not be counterintuitive nor divorced from reality.

@yurishkuro
Copy link
Member

I think the requirement to “always propagate" leaves a lot of legacy systems in the dust, even though they could've been partially compatible with left 0 padding. I could even wrap a legacy app that uses 64bit zipkin or Jaeger header in a sidecar and automatically rewrite those headers into w3c traceparent with empty tracestate.

Left 0 padding gives a clear signal that the transaction passed through a legacy layer. Randomization requirement makes it impossible. I think randomization requirement does more harm than good (the good part isn't even that clear to me).

@yurishkuro
Copy link
Member

I would even add that if a modern 128bit capable system generates a new trace id, it MUST ensure that the first 64bits are not zeros.

@SergeyKanzhelev
Copy link
Member Author

We balancing many requirements and strive to encourage interoperability. I've added two more sections into rationale doc and expanded the paragraph. Please comment - it must be easier to leave comment to the document than in PR description.

@SergeyKanzhelev
Copy link
Member Author

This paragraph came from the discussion with NewRelic. @victorNewRelic can you please comment on your experience of a transition

@codefromthecrypt
Copy link

codefromthecrypt commented Oct 24, 2019 via email

When addressing interoperability with these systems requirements the following
is taken into consideration:

1. The main objective of the specification is to promote interoperability of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these points are true, then shouldn't the recommendation be to backfill with zeros, not another random 64 bits? Wouldn't that promote better interoperability with brown field ecosystems that are currently on 64 bit, with the way forward being that those systems would organically move to 128 bit as they can? Currently, by recommending backfilling with random 64 bits, the spec is encouraging breakage (and it's just surprising/confusing/unexpected to see backfilling with randomness - I would expect something deterministic, and the brown field has chosen the deterministic path of backfilling with zeros).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further in the doc there is a note that backfilling with random numbers will encourage you to test that those numbers will be propagated. And 64 system will not break the 128 bit system passing trace thru it. Spec is encouraging interoperability of different systems, which don't know about each other.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encourage you to test

This doesn't seem like a good enough reason to break compatibility with some 64-bit systems. The spec can encourage you to test by stating so in the spec. I'm not sure I can see many spec implementors who would only test because the random numbers encouraged testing, and would fail to test if it was simply stated in the spec.

`trace-id` generation it SHOULD fill-in the extra bytes with random values
rather than zeroes. Let's say the system works with an 8-byte `trace-id` like
`23ce929d0e0e4736`. Instead of setting `trace-id` value to
`23ce929d0e0e47360000000000000000` (note, that random part is kept on the left

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by others, this example is backwards and doesn't match the comment in parenthesis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random part is kept to the left. Why is it not matching?

Copy link

@nicmunroe nicmunroe Oct 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the big problem here is the words "random part" can be interpreted multiple ways. The text says that the "random part is kept on the left" but I think it means to say "the original 8-byte trace ID is kept on the left". Yes/no? This section is discussing backfilling with random values vs. zeros, so when you say "random part is kept on the left" I'm reading that as the random backfilling on the left. But you're using the original 8-byte trace ID on the left, not random-backfill values, and the right side is all zeros, which nobody here is arguing should be done.

And looking at it from the zeros point of view: why is the 0000000000000000 on the right here? The text says "Instead of setting trace-id value to
23ce929d0e0e47360000000000000000". Who would ever set it to 23ce929d0e0e47360000000000000000? If it was zero-backfilled wouldn't it be 000000000000000023ce929d0e0e4736? i.e. the backfilled zeros would be on the left.


Specification uses "SHOULD language" when asking to fill up extra bytes with
random numbers instead of zeros. Typically, tracing systems that will not
propagate extra bytes of incoming `trace-id` will not follow this ask and will

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but an upstream system that follows the SHOULD advice to fill up the left side with random 64 bits will break a downstream that is using brown field 64 bit tracers. Wouldn't it be better to have the SHOULD advice recommend filling with zeros to maximize default interoperability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systems that fail to propagate extra bytes of trace id already breaking couple MUST of a spec =). This comment is simply saying that IF system uses 64 bits, has no way to propagate extra bits unchanged, but still wants to follow the spec, while pretty confident that all components are controlled by this system - zero backfill is what this system will implement.


When tracing system has this capability, specification suggests that extra bytes
are fill out with random numbers instead of zeros. This requirement helps to
validate that tracing systems implements propagation from incoming request to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What use is this validation? Isn't it better to have de-facto interoperability by default? Why is this validation more beneficial and outweigh default interoperability with existing brown field systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key here is that spec is addressing interoperabilty of different tracing systems. So if 64 bit system receives 128bit trace-id, it must do the best effort to propagate the entire 128 bits. The best way to ensure that this is implemented is ask to fill up spaces with random numbers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actual users are telling you that the best way to ensure that is to backfill with zeros.


Specification explains the "left padding" requirement for the random `trace-id`
generation. Tracing systems will implement various algorithms that use
`trace-id` as a base to make a sampling decision. Typically, it will be some

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh ... is this the main reason for recommending random 64 bits for backfill? I can understand how this would be beneficial for such sampling algorithms, but I would suggest that interoperability-by-default with existing brown field systems is a much better tradeoff.

(Also, how is "typically" determined here? How common is this really? How many major large-userbase tracers have this functionality? How many users actually use it? I'd be surprised if the userbase using this kind of sampling algorithm is (1) really big enough to warrant breaking default interoperability with 64 bit systems, and (2) would mostly use the leftmost bytes such that they would actually be negatively affected by backfilling with zeros.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it's the only reason, and it doesn't hold water for me. Someone mentioned elsewhere that some systems may use UUID algorithms for generating trace IDs, and the left-most bytes could be composed of a timestamp - not random at all.

If systems want to avoid RNG by reusing randomness in the trace ID, an easier way to do that is XOR the left and right 8 bytes of the full 16-bytes ID and stop caring about which side of it contains more randomness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh ... is this the main reason for recommending

we need to keep this rationale file up to date. It really helps I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the recommendation - we had this issue in Microsoft with the "bad ordering" of randomness and I saw a few implementations which only looking at 4 bytes to calculate sampling priority. However I personally don't feel strongly about this clause. Exactly because of these reasons that it is hard to enforce randomness.

Specification explains the "left padding" requirement for the random `trace-id`
generation. Tracing systems will implement various algorithms that use
`trace-id` as a base to make a sampling decision. Typically, it will be some
variation of hash calculation. Those algorithms may be giving different "weight"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking compatibility by default for a "may be" seems wrong. How common is this really?

`trace-id` as a base to make a sampling decision. Typically, it will be some
variation of hash calculation. Those algorithms may be giving different "weight"
to different bytes of a `trace-id`. So the requirement to keep randomness to the
left helps interoperability between tracing systems by suggesting which bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that this is prioritizing a suggestion of interoperability with a much smaller number of systems over the definite breakage with a large volume of brown field systems.

(I'm assuming numbers on hash-based-left-bytes sampling algorithm are small compared to brown field 64 bit systems. I'm open to being proven wrong if anyone has solid numbers - I could just be out of the loop.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that many systems uses trace-ID where the right most part is random, not the left most? Or we are still talking about zero backfill? If backfill - why not backfill other bytes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-backfill.

Because of all the reasons multiple users have stated in this thread. They are telling you that backfilling with zeros will provide better compatibility and interoperability than backfilling with randomness.

left helps interoperability between tracing systems by suggesting which bytes
carry a bigger weight in hash calculation.

Practically, there will be tracing systems filling first bytes with zeros (see

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather see the specification reverse expectations here. IMO systems SHOULD backfill with zeros, and there could be a note that since it's only a SHOULD and not a requirement, then systems could choose to do the random 64 bit backfill if they feel they need to due to a hash-based-left-bytes sampling algorithm.

In other words, default to maximum interoperability, and point out where there might be valid use cases for doing something else. As it stands, this feels backwards.

The reality is that implementors will see SHOULD, and the language around 64 bit trace IDs being "violations", and they will blindly backfill with random 64 bits. As a result we'll see breakage by default instead of interoperability by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't follow what system will break. And why the randomness requirement will break it? It felt almost like you suggesting all systems only fill up 64 bits.

In the doc there are three types of systems described. First, which simply ignores extra bytes. These systems definitely will be better off by filling up with zeros. And they can. They already breaking couple MUST of this spec. Ignoring second. Third, that only recording 64 but respecting others and propagates 128. For this tracing systems - why not ask to fill up extra bytes with random numbers? What difference does it make? It still will do it when some external 128bit system will call into it and extra bytes are needed to be propagated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not ask to fill up extra bytes with random numbers?

Try looking at this from the other direction - why not fill up with zeros? Multiple people are telling you that backfilling with zeros will allow for greater compatibility with 64 bit systems.

@nicmunroe
Copy link

I added review comments on the diffs as requested. I agree with the arguments of others in this PR that backfilling with zeros is both more compatible by default with existing brown field systems and more intuitive.

I've seen other arguments here that essentially say backfilling with randomness isn't a big deal, because it's only a suggestion and not a spec requirement. That's not cover for unnecessary interoperability breakage by default. One of my review comments talks directly to this:

The reality is that implementors will see SHOULD, and the language around 64 bit trace IDs being "violations", and they will blindly backfill with random 64 bits. As a result we'll see breakage by default instead of interoperability by default.

If this spec is truly about maximizing interoperability, I don't see how we could pick random-backfilling instead of zeros-backfilling.

@tedsuo
Copy link

tedsuo commented Oct 25, 2019

Sorry for the late addition. We are actively implementing this, and I would prefer zeros as well.

The reason has to do with split traces. If Service A calls two services, B and C, with a 64-bit trace, and both B and C each backfill with different random bits, then the trace is now split, as I now have two separate trace IDs. SO the trace has become split.

Now, consider that the analysis system consuming this data accepts 128-bit IDs. When the analysis system receives 64-bit IDs directly from A, it must likewise pad the IDs with zeros. I do not see another choice, when some services only report 64 bits, and others report 128.

This is a serious issue for us. However, if there is a clear definition in the spec for how to backfill with zeros, this solves these split-trace problems very cleanly. Unfortunately, I see no solution where every service pads with random bits; currently I am watching it create a host of issues as I try to implement it.

@reyang
Copy link
Contributor

reyang commented Oct 25, 2019

The reason has to do with split traces. If Service A calls two services, B and C, with a 64-bit trace, and both B and C each backfill with different random bits, then the trace is now split, as I now have two separate trace IDs. SO the trace has become split.

If B and C could both update tracestate by clarifying that the original 64-bit trace has been backfilled (so the backend can use this additional hint to avoid interpreting this as a split), would this solve the problem?

@tedsuo
Copy link

tedsuo commented Oct 25, 2019

@reyang unfortunately no, as information placed in tracestate is not reported by any system we interoperate with. Also, there are scenarios where we must switch headers - to propagate to a service configured with B3, for example - and thus there is no tracestate.

@SergeyKanzhelev
Copy link
Member Author

@nicmunroe hi! Glad to see you. Thank you for review!

@SergeyKanzhelev
Copy link
Member Author

@tedsuo I think you are missing a point. First, behavior when system recieves a trace is not specified. Random backfill is only for new trace-id. Second, it is recommended that B and C will propagate extra bytes. If you talking about any of opentracing implementations - there is probably some sort of baggage mechanism, so propagating of extra 64 bits shouldn't be a major problem.

Please let me know what you think

@SergeyKanzhelev
Copy link
Member Author

@adriancole

new relic's tracing offering is one data point, but as far as I know it is
a relatively new product and closed source. I still don't understand why
years of existing OSS practice should be ignored over this.

Absolutely! And I'm glad to see this discussion here. I only want them to share their experience of migrating to 128 bits as this entire paragraph was added after discussion of NewRelic's implementation of trace context.

@danielkhan
Copy link
Contributor

danielkhan commented Oct 26, 2019

Summary

  1. Most people here would prefer zero backfilling instead of random backfilling.
  2. The main reason why it is in the spec is that it makes it more explicit that the id should be propagated as 128bits avoiding a shortcut where systems will just start sticking to 64 and ignore the rest.
  3. There might be more use cases that would justify random backfill but they don't seem to outweigh the reasons to backfill with zeros.

Is this correct so far?
I think this would be a normative change @SergeyKanzhelev?

@nicmunroe
Copy link

nicmunroe commented Oct 28, 2019

First, behavior when system recieves a trace is not specified. Random backfill is only for new trace-id.

@SergeyKanzhelev After responding to a few other comments, this really caught my eye, and may be one of the main reasons people in this PR are having trouble hearing each other. I'm not sure you can specify one thing (how to generate the trace ID) without considering the other (how to receive a 128 bit trace when you're a 64 bit system, and then how to propagate it again later).

Why? If a 64 bit system receives a 128 bit trace ID and downgrades it to 64 bit (assuming that's the best it can do), then what is it supposed to do when it goes to propagate it on outbound calls? The spec is very clear about all trace IDs being 128 bit, and this section talks about backfilling with random values. So they'll probably backfill with random values when they propagate downstream. This is really bad, as @tedsuo 's example showed.

It doesn't matter that there's a small side note buried in the text saying "it's only SHOULD, you can backfill with zeros if you want" - implementors who are trying to be good citizens will backfill with random, because it's what the spec recommends. And it doesn't matter that the random backfill stuff is only supposed to be for ID generation - the 64 bit system has to do something to conform to the w3c 128 bit trace ID requirement when propagating, so it's logical that they would follow the section talking about what 64 bit systems are supposed to do with trace IDs.

If, on the other hand, the recommendation was to backfill with zeros both for generation and 64-bit downgrade+propagation, then you'd get some level of interoperability and compatibility by default. You'd get this base level of interoperability even if a trace bounced back and forth through multiple 64 bit and multiple 128 bit systems on a single request.

@SergeyKanzhelev
Copy link
Member Author

@nicmunroe thank you for the note. I'm also trying to understand how to make sure we are not talking pass each other.

Yes, spec is only talking about new ID generation. In case of propagation - either fill up with zeroes or try to pass by extra bytes unchanged if you can. It is definitely not clear from the text as it raises so much confusion. This is how this PR is actually started.

As for the inter-oparability and left padding - the prior art reference goes to how, for example, getTrace works on Zipkin:

When strict trace ID is disabled, spans with the same right-most 16 characters are returned even if the characters to the left are not.

So asking for left padding of randomness breaks this scenario. You'd want some randomness in the right part of a trace so inter-operability with 64bit systems would work.

@SergeyKanzhelev
Copy link
Member Author

@danielkhan as I see it - in an order of importance there are three questions:

  1. Backfilling on left vs. right
  2. Left padding of randomness.
  3. Zero vs. random backfill

1st is not specified in a spec. So we can add it without normative change to the spec.

We have a strange SHOULD-MUST combination of randomness generation. Reading it is unclear what it actually asks for. One option is to nuke it completely from the spec. Again, I'd argues not a normative change as the entire paragraph is confusing.

For the item 3 "Zero vs. random backfill" on a new trace-id generation - I'd suggest to keep randomness. With the first two items addressed inter-operability in terms of downgraded experience will be addressed. And randomness will improve inter-operability in terms of upgrading experience.

If we all agree with this - there will be (from my perspective) no normative changes and everybody's interests will be addressed.

@yurishkuro
Copy link
Member

I recommend filing separate issues and holding a discussion / vote there before proposing a PR to change.

For the item 3 "Zero vs. random backfill" on a new trace-id generation - I'd suggest to keep randomness. With the first two items addressed inter-operability in terms of downgraded experience will be addressed.

I think most people on this thread disagree with this. If a request goes through several 64bit systems, then zero-padding on egress from those systems allows almost the whole request to be associated with a single trace ID. Random padding will break it into many traces, with no way of correlating.

@tedsuo
Copy link

tedsuo commented Oct 28, 2019

I agree with @yurishkuro. I hardly felt heard in this discussion, and I do not see my concerns being clearly addressed. I would like to see this hashed out more fully.

@SergeyKanzhelev
Copy link
Member Author

@tedsuo so you mentioning two issues. First, split trace. In this example you assume that B and C will fill up trace-id with zeroes:

The reason has to do with split traces. If Service A calls two services, B and C, with a 64-bit trace, and both B and C each backfill with different random bits, then the trace is now split, as I now have two separate trace IDs. SO the trace has become split.

When spec is saying:

  1. "on new trace-id generation"

When a system operates with a trace-id that is shorter than 16 bytes, on new trace-id generation it SHOULD fill-in the extra bytes with random values rather than zeroes.

  1. On propagation:

In situations, when it is absolutely impossible to propagate the entire trace-id to the downstream components, but only a subset of the original trace-id will be propagated, system SHOULD fill up extra bytes with zeroes.

Second concern is about analytical system. And rationale doc addresses this concern saying that:

Specification uses "SHOULD language" when asking to fill up extra bytes with
random numbers instead of zeros. Typically, tracing systems that will not
propagate extra bytes of incoming trace-id will not follow this ask and will
fill up extra bytes with zeros.

So this is not really a concern - if system breaks this spec requirement anyway, you can break one more "SHOULD". In fact spec is not addressing these type of services at all - it uses MUST langugage that the entire trace-id must be propagated even if you only recording part of it.

Furthermore if you look at the same example and assume that A that started trace is 128bit system, you still need to build an analytics system that operated on lower bytes only. So random generation on new trace-id will help.

What am I missing?

@SergeyKanzhelev
Copy link
Member Author

@yurishkuro

I recommend filing separate issues and holding a discussion / vote there before proposing a PR to change

Absolutely! This PR simply removes the uncertancy in the current spec w.r.t. how it was written. Mostly around the fact that random backfill is only requested for new trace-id, not the case when you propagate somebody else's trace-id.

So the way I'd suggest we approach it is by merging this PR and filing these three issues:

  1. Backfilling on left vs. right
  2. Left padding of randomness.
  3. Zero vs. random backfill

My comment was written in response to @danielkhan's summary trying to break down the problem in smaller issues.

@yurishkuro
Copy link
Member

@SergeyKanzhelev "random backfill is only requested for new trace-id" kind-of doesn't make sense to me as a use case. If a system is truly 64bit, and not upgradeable, it most likely won't have any place to store the other random part of the trace. The backfill only applies on egress from a 64bit system, which is where many are arguing a zero-padding is better than random padding. Note that the padding may be done not by the 64bit system itself, but by a sidecar that rewrites the headers.

@SergeyKanzhelev
Copy link
Member Author

SergeyKanzhelev commented Oct 28, 2019

@yurishkuro these systems

it most likely won't have any place to store the other random part of the trace.

breaks a lot (like not propagating tracestate) and can back fill with zeros on trace-id generation.

Spec uses MUST language on propagation with a small not on what to do when it's absolutely impossible.

However, once the systems is capable to propagate all 128 bit (even though 64 of these bits may not be recorded), the ask is to do random generation. For downstream 64bit only systems these "slightly upgraded" 64 bit systems may look like a truly 128 systems that still needs to be supported.

Are you suggesting to add another issue for discussion:

  1. Spec must address 64 bit systems that doesn't have propagation mechanism and remove MUST language on the whole trace-id propagation for these systems.

@SergeyKanzhelev
Copy link
Member Author

Also I realized we may need a names to differentiate systems:

  • 128bit system - system operating with 128bit trace-id (fully compliant).
  • 64bit system - system operating with 64bit trace-id. Not capable to propagate tracestate and 64 extra bits of trace-id
  • 128on64bit system - system that records 64 bit trace-id. But capable to propagate extra 64 bit of trace-id from incoming request to outgoing request.

@SergeyKanzhelev
Copy link
Member Author

so spec mostly talking about 128on64 systems. And have a small note about 64bit system:

when it is absolutely impossible to propagate the
entire trace-id to the downstream components, but only a subset of the
original trace-id will be propagated, system SHOULD fill up extra bytes with
zeroes.

@yurishkuro
Copy link
Member

Spec must address 64 bit systems that don't have propagation mechanism and remove MUST language on the whole trace-id propagation for these systems.

I think so. Most of the feedback on this thread was about "legacy" systems. If a system can be upgraded to support tracestate propagation, it might as well be upgraded to support full 128bit trace-id, I don't see much difference. The concern is that upgrading is not feasible in many cases.

@SergeyKanzhelev
Copy link
Member Author

I think so. Most of the feedback on this thread was about "legacy" systems. If a system can be upgraded to support tracestate propagation, it might as well be upgraded to support full 128bit trace-id, I don't see much difference. The concern is that upgrading is not feasible in many cases.

Good. We are on the same page here.

@tedsuo
Copy link

tedsuo commented Oct 28, 2019

Spec must address 64 bit systems that doesn't have propagation mechanism and remove MUST language on the whole trace-id propagation for these systems.

@SergeyKanzhelev to be clear, this is the scenario I am raising. I believe the confusion relates to whether existing systems had somewhere to put the extra 64 bits. The 64bit, 126bit, 64on128 distinction helps. Thank you @yurishkuro for making the case for handling legacy 64bit systems.

Furthermore if you look at the same example and assume that A that started trace is 128bit system, you still need to build an analytics system that operated on lower bytes only. So random generation on new trace-id will help.

I admit I don't see the logic here. As long as we are accepting data from 64bit services, our analytics system will have to operate with that amount of randomness in the ID. How would generating new random IDs on the client side be helpful? Not saying that it is unhelpful, just that it is not clear to me.

So the way I'd suggest we approach it is by merging this PR and filing these three issues:

Backfilling on left vs. right
Left padding of randomness.
Zero vs. random backfill

Thank you. Having looked at it from several directions now, I don't believe there is a complete solution for backwards compatibility, except to stick with only 64 bits in the trace id until every 64bit service in the trace is upgraded to, at minimum, 128on64bit mode. Since this kind of compatibility is turning out to be critical for a number of users, I believe it would be helpful for the spec to provide guidance on this issue.

@SergeyKanzhelev
Copy link
Member Author

I'm closing this PR in favor of this issue: #349 Thank you everybody for a valuable feedback! Let's make sure the spec is fixed to address compatibility scenarios identified in this discussion.

@SergeyKanzhelev SergeyKanzhelev deleted the sergkanz/337 branch August 5, 2021 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revert advice against backfilling of zeros
10 participants