-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop,log: tracing improvements #70370
Conversation
Going to leave the review to @abarganier but feel free to loop me into specific parts of the review where I can be helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tracing changes overall lgtm, just a couple small comments, but I'd also like to hear from someone better versed in KV about the changes to the Stopper usage.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @andreimatei)
pkg/util/stop/stopper.go, line 161 at r3 (raw file):
tracer *tracing.Tracer // tracer used to create spans for tasks
Does this need to be protected by a mutex if we're going to expose methods like SetTracer()
?
pkg/util/tracing/tracer.go, line 293 at r3 (raw file):
Quoted 12 lines of code…
type registry struct { mu struct { syncutil.Mutex spansPerTrace map[uint64]traceInfo } } type traceInfo struct { rootName string remoteRoot bool spans int }
nit: can we add some comments about the purpose of this registry?
a744d12
to
5e32c73
Compare
First two commits are #70850. I've reworked much of this; decided to go a different way. I've reduced the control one has at the point where it creates a child span, in favor of affording a new option at the point where a parent span is created. I've introduced the concept of a span that cannot have any children, to model a long-running task. Long-running operations getting children leads to too many problems. |
5e32c73
to
7985c25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @abarganier, and @andreimatei)
pkg/kv/kvserver/replica_range_lease.go, line 359 at r15 (raw file):
// acquisition's trace in all of them, but let's at least include it in // the request that triggered it. SpanOpt: stop.ChildSpan,
This isn't necessarily related to this patch, but could you explain why this isn't a FollowsFromSpan
relationship instead?
Your comment above the FollowsFromSpan
SpanOption
says the following:
// Use this when the caller will not wait for the task to finish, but a
// relationship between the caller and the task might still be useful to
// visualize in a trace collector. Generally avoid using this for tasks
pkg/util/tracing/span.go, line 270 at r14 (raw file):
// Note that this field is unlike all the other in that it doesn't make it // across the wire through a carrier. As can be seen in // Tracer.InjectMetaInto(carrier), is sterile is set, then we don't propagate
nit: s/is sterile/if sterile
7985c25
to
134c616
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've verified with BenchmarkTracing
that this patch does not regress performance. In fact, it seems to marginally help it.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @abarganier)
pkg/kv/kvserver/replica_range_lease.go, line 359 at r15 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
This isn't necessarily related to this patch, but could you explain why this isn't a
FollowsFromSpan
relationship instead?Your comment above the
FollowsFromSpan
SpanOption
says the following:// Use this when the caller will not wait for the task to finish, but a // relationship between the caller and the task might still be useful to // visualize in a trace collector. Generally avoid using this for tasks
FollowsFromSpan
is not appropriate here because, even though the parent doesn't necessarily wait for the child, it does wait for the child in the normal path. And if you're looking at a trace for the parent, you want to see the lease acquisition in it. That's what the comment above is trying to say.
pkg/util/stop/stopper.go, line 161 at r3 (raw file):
Previously, abarganier (Alex Barganier) wrote…
tracer *tracing.Tracer // tracer used to create spans for tasks
Does this need to be protected by a mutex if we're going to expose methods like
SetTracer()
?
Well SetTracer
is documented as not allowing calls concurrent with starting tasks.
The tracer is supposed to be passed before the stopper starts being generally used. I have tried making WithTracer
the only way to supply it but, alas, I've failed because of the intricate way in which our Server
is built up; the Stopper
is instantiated by the cli
layer, etc. I've tried to untangle the spaghetti, but failed.
pkg/util/tracing/span.go, line 270 at r14 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
nit:
s/is sterile/if sterile
done
pkg/util/tracing/tracer.go, line 293 at r3 (raw file):
Previously, abarganier (Alex Barganier) wrote…
type registry struct { mu struct { syncutil.Mutex spansPerTrace map[uint64]traceInfo } } type traceInfo struct { rootName string remoteRoot bool spans int }
nit: can we add some comments about the purpose of this registry?
This was unused debugging code. Deleted, thanks.
This patch introduces a new options for tracing spans: a sterile span does not admit child spans. Trying to create a child of a sterile span results in a root span. In other words, in: parent := t.StartSpan(WithSterile()) child1 := t.StartSpan(WithParentAndAutoCollection(parent)) child2 := t.StartSpan(WithParentAndManualCollection(parent.Meta())) both child1 and child2 are root spans. This is useful for modelling long-running operations that don't want a relationship with all the sub-operations they spawn (think long-running stopper tasks or pgwire connections). In particular, these long operations don't want to accumulate the recordings of their children, and they more generally don't want to be associated with their children in a trace collector because that leads to infinitely long traces which don't display well. For example, Zipkin doesn't make a distinction between our ChildOf and FollowsFrom relationships, displaying both types of relationships the same way. This makes long-running tasks with children quite awkward. Given the tools at our disposal, maintaining the relationship between random operations and their long-running parents is more trouble than it's worth. The modelling of different kinds of relationships between spans is an active topic of debate in the OpenTelemetry project, so in the future I expect that we'll take note of what has been decided there and apply it to our tracing. Without this new option, it is difficult to decouple a span from what would otherwise be its parent. I've explored different options that push the responsibility of detaching to the point where the would-be child is created, but the practicality sucks. Sometimes the code creating a span doesn't know what context it's running in, and so it can't really make a decision about whether it wants its span to be a child or not (i.e. sometimes it wants a child, sometimes it doesn't). Instead, this patch let's the parent brutally say "I never want any children", and then the sub-operations can use functions like EnsureChildSpan() to get a child span or a root span depending on the decision the parent made for them. The next commits start using this new option for long-running stopper tasks and such. Release note: None
Before this patch, RunAsyncTask would not create a span for the task is the caller didn't have a span. If the caller did have a span, the method could create either a child or a follows-from span. This patch makes RunAsyncTask create a root span by default when the caller doesn't have a span. This seems like the right thing to do; async tasks should generally have spans. Then, the patch adds a new options about the kind of span to be created: a sterile span (as introduced in the previous commit). The sterile option results in the task having a sterile root span. The sterile option is used for some long-running tasks started under the server's startup span. Having a relationship between these tasks and the server's statup span didn't serve any purpose, and made the startup span look very confusing in Zipkin (Zipkin displays the follows-from spans as regular children) Similarly, having a relationship between the children of these long-running tasks and the long-running task was also a mostly bad thing. Release note: None
This patch makes the purgatories' spans be sterile, just like the regular queue task already was. This means that the span representing the processing of individual ranges in the purgatory are no longer children of the long-running purgatory operation. Also, this patch clean up the creation of spans for the processing of a replica. Before the span creation was awkward because it was using the queue's AmbientCtx after a lower-level AmbientCtx (the replica's) had been used to annotate the ctx. Release note: None
134c616
to
e0e83ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @abarganier)
Build succeeded: |
Improvements around making the traces of long-running background processes look better. See individual commits.