Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop,log: tracing improvements #70370

Merged
merged 3 commits into from
Oct 8, 2021
Merged

Conversation

andreimatei
Copy link
Contributor

Improvements around making the traces of long-running background processes look better. See individual commits.

@andreimatei andreimatei requested a review from a team as a code owner September 17, 2021 17:56
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

cc @aayushshah15

@tbg
Copy link
Member

tbg commented Sep 20, 2021

Going to leave the review to @abarganier but feel free to loop me into specific parts of the review where I can be helpful.

@tbg tbg removed their request for review September 20, 2021 09:24
@abarganier abarganier requested a review from a team September 23, 2021 20:02
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tracing changes overall lgtm, just a couple small comments, but I'd also like to hear from someone better versed in KV about the changes to the Stopper usage.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @andreimatei)


pkg/util/stop/stopper.go, line 161 at r3 (raw file):

	tracer   *tracing.Tracer   // tracer used to create spans for tasks

Does this need to be protected by a mutex if we're going to expose methods like SetTracer()?


pkg/util/tracing/tracer.go, line 293 at r3 (raw file):

Quoted 12 lines of code…
type registry struct {
	mu struct {
		syncutil.Mutex
		spansPerTrace map[uint64]traceInfo
	}
}

type traceInfo struct {
	rootName   string
	remoteRoot bool
	spans      int
}

nit: can we add some comments about the purpose of this registry?

@andreimatei
Copy link
Contributor Author

First two commits are #70850.

I've reworked much of this; decided to go a different way. I've reduced the control one has at the point where it creates a child span, in favor of affording a new option at the point where a parent span is created. I've introduced the concept of a span that cannot have any children, to model a long-running task. Long-running operations getting children leads to too many problems.
This time, I'd like to kindly ask @aayushshah15 for a review.

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @abarganier, and @andreimatei)


pkg/kv/kvserver/replica_range_lease.go, line 359 at r15 (raw file):

			// acquisition's trace in all of them, but let's at least include it in
			// the request that triggered it.
			SpanOpt: stop.ChildSpan,

This isn't necessarily related to this patch, but could you explain why this isn't a FollowsFromSpan relationship instead?

Your comment above the FollowsFromSpan SpanOption says the following:

	// Use this when the caller will not wait for the task to finish, but a
	// relationship between the caller and the task might still be useful to
	// visualize in a trace collector. Generally avoid using this for tasks

pkg/util/tracing/span.go, line 270 at r14 (raw file):

	// Note that this field is unlike all the other in that it doesn't make it
	// across the wire through a carrier. As can be seen in
	// Tracer.InjectMetaInto(carrier), is sterile is set, then we don't propagate

nit: s/is sterile/if sterile

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've verified with BenchmarkTracing that this patch does not regress performance. In fact, it seems to marginally help it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @abarganier)


pkg/kv/kvserver/replica_range_lease.go, line 359 at r15 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

This isn't necessarily related to this patch, but could you explain why this isn't a FollowsFromSpan relationship instead?

Your comment above the FollowsFromSpan SpanOption says the following:

	// Use this when the caller will not wait for the task to finish, but a
	// relationship between the caller and the task might still be useful to
	// visualize in a trace collector. Generally avoid using this for tasks

FollowsFromSpan is not appropriate here because, even though the parent doesn't necessarily wait for the child, it does wait for the child in the normal path. And if you're looking at a trace for the parent, you want to see the lease acquisition in it. That's what the comment above is trying to say.


pkg/util/stop/stopper.go, line 161 at r3 (raw file):

Previously, abarganier (Alex Barganier) wrote…
	tracer   *tracing.Tracer   // tracer used to create spans for tasks

Does this need to be protected by a mutex if we're going to expose methods like SetTracer()?

Well SetTracer is documented as not allowing calls concurrent with starting tasks.
The tracer is supposed to be passed before the stopper starts being generally used. I have tried making WithTracer the only way to supply it but, alas, I've failed because of the intricate way in which our Server is built up; the Stopper is instantiated by the cli layer, etc. I've tried to untangle the spaghetti, but failed.


pkg/util/tracing/span.go, line 270 at r14 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

nit: s/is sterile/if sterile

done


pkg/util/tracing/tracer.go, line 293 at r3 (raw file):

Previously, abarganier (Alex Barganier) wrote…
type registry struct {
	mu struct {
		syncutil.Mutex
		spansPerTrace map[uint64]traceInfo
	}
}

type traceInfo struct {
	rootName   string
	remoteRoot bool
	spans      int
}

nit: can we add some comments about the purpose of this registry?

This was unused debugging code. Deleted, thanks.

This patch introduces a new options for tracing spans: a sterile span
does not admit child spans. Trying to create a child of a sterile span
results in a root span. In other words, in:

parent := t.StartSpan(WithSterile())
child1 := t.StartSpan(WithParentAndAutoCollection(parent))
child2 := t.StartSpan(WithParentAndManualCollection(parent.Meta()))

both child1 and child2 are root spans.

This is useful for modelling long-running operations that don't want a
relationship with all the sub-operations they spawn (think long-running
stopper tasks or pgwire connections). In particular, these long
operations don't want to accumulate the recordings of their children,
and they more generally don't want to be associated with their children
in a trace collector because that leads to infinitely long traces which
don't display well. For example, Zipkin doesn't make a distinction
between our ChildOf and FollowsFrom relationships, displaying both types
of relationships the same way. This makes long-running tasks with
children quite awkward.  Given the tools at our disposal, maintaining
the relationship between random operations and their long-running
parents is more trouble than it's worth. The modelling of different
kinds of relationships between spans is an active topic of debate in the
OpenTelemetry project, so in the future I expect that we'll take note of
what has been decided there and apply it to our tracing.

Without this new option, it is difficult to decouple a span from what
would otherwise be its parent. I've explored different options that push
the responsibility of detaching to the point where the would-be child is
created, but the practicality sucks. Sometimes the code creating a span
doesn't know what context it's running in, and so it can't really make a
decision about whether it wants its span to be a child or not (i.e.
sometimes it wants a child, sometimes it doesn't). Instead, this patch
let's the parent brutally say "I never want any children", and then the
sub-operations can use functions like EnsureChildSpan() to get a child
span or a root span depending on the decision the parent made for them.

The next commits start using this new option for long-running stopper
tasks and such.

Release note: None
Before this patch, RunAsyncTask would not create a span for the task is
the caller didn't have a span. If the caller did have a span, the method
could create either a child or a follows-from span.

This patch makes RunAsyncTask create a root span by default when the
caller doesn't have a span. This seems like the right thing to do; async
tasks should generally have spans. Then, the patch adds a new options
about the kind of span to be created: a sterile span (as introduced in
the previous commit). The sterile option results in the task having a
sterile root span. The sterile option is used for some long-running
tasks started under the server's startup span. Having a relationship
between these tasks and the server's statup span didn't serve any
purpose, and made the startup span look very confusing in Zipkin (Zipkin
displays the follows-from spans as regular children) Similarly, having a
relationship between the children of these long-running tasks and the
long-running task was also a mostly bad thing.

Release note: None
This patch makes the purgatories' spans be sterile, just like the
regular queue task already was. This means that the span representing
the processing of individual ranges in the purgatory are no longer
children of the long-running purgatory operation.

Also, this patch clean up the creation of spans for the processing of a
replica. Before the span creation was awkward because it was using the
queue's AmbientCtx after a lower-level AmbientCtx (the replica's) had
been used to annotate the ctx.

Release note: None
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @abarganier)

@craig
Copy link
Contributor

craig bot commented Oct 8, 2021

Build succeeded:

@craig craig bot merged commit a2eedb4 into cockroachdb:master Oct 8, 2021
@andreimatei andreimatei deleted the stop.async-roots branch January 21, 2022 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants