-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing: propagate non-recording spans across rpc boundaries #59992
Conversation
c782a70
to
2f8a54a
Compare
2f8a54a
to
49ff09c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray "iasd" in the first commit message, typo with "Mult" in second commit.
Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @knz, and @tbg)
pkg/sql/logictest/testdata/logic_test/distsql_agg, line 1 at r3 (raw file):
# LogicTest: 5node-default-configs
Why did this change?
pkg/util/tracing/span_test.go, line 187 at r1 (raw file):
func TestSpan_LogStructured(t *testing.T) { tr := NewTracer() tr._mode = int32(modeBackground)
Why did this change?
pkg/util/tracing/tracer.go, line 297 at r1 (raw file):
// a real span they need to get one. In all other cases, a noop span will do. if !t.AlwaysTrace() && opts.parentTraceID() == 0 &&
I like this simplification. The semantics around recording being switched on/off and what that means for log messages/structured events still feels a bit confused to me (and was trying something elsewhere), and this makes it a bit easier to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @irfansharif, and @knz)
pkg/sql/logictest/testdata/logic_test/distsql_agg, line 1 at r3 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Why did this change?
Thanks for catching that, this is accidental, fixed.
pkg/util/tracing/span_test.go, line 187 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Why did this change?
To validate the changes in GetRecording
. Without the fixes to it, this test would have started failing as we short-circuited GetRecording()
to nil.
49ff09c
to
2e713e2
Compare
Oh wonderful I love the smell of mixed version problems in the afternoon. Will take a bit longer to get this in I suppose, this is crashing the old version not the new. |
Wow so I'm surprised we haven't seen crashes here more (maybe the crash here isn't even this PR, maybe it's just crashing now because I just added a step to turn tracing on in this test. The code on 20.2.4 is pretty sensitive to crashes because it has these assertions that make no sense, in the methods called from this area: cockroach/pkg/kv/kvclient/kvcoord/transport.go Lines 160 to 179 in 9c7ef62
Concretely, here what we see is this:
I assume that the way this particular crash happens is this:
But there could also have been a different scenario (pre this PR), though not one that you could hit in practice, since we don't use spans in that way at that part of the stack:
Now in 20.2.4, if you don't mess with the global tracing (which you shouldn't anyway, as it might well OOM you), you don't see any of this because nobody ever sents any recordings back. But with this PR, since SQL on the new nodes will always propagate a real trace, oof, we're exercising this code path all of the time. At this point, and seeing how we've made more changes to the tracer than ever before including moving off the OT interfaces, I think we'll play it safe - I'll add a cluster version and make sure that new nodes don't create any spans, period (unless the verbose cluster setting is set, in which case, good luck). I'll also give the code around the interfaces a do-over to make sure they are not exceedingly... assertive. Something that should happen in that context is to gracefully handle importing a recording for trace B into trace A. This should silently ignore the mismatched data. Of course we don't expect that to happen (after this PR), but resilience here will pay off and is cheap. |
With only this commit, |
f9f2103
to
849643a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the first commit: I am not convinceed by the unit test -- if we want to check that traces get collected across RPC boundaries, I'd expect a test that sets up a RPC server, sends an actuall RPC over a network connection, and asserts that the expected trace events are communicated.
In the 6th commit I dislike the approach you've taken, with more explanations about my view below.
(In contrast the 7th commit is super elegant and it was a relief to see that the way forward wasn't too painful)
The rest LGTM
Reviewed 4 of 10 files at r5, 2 of 2 files at r6, 1 of 1 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 2 of 2 files at r10, 10 of 10 files at r11, 8 of 8 files at r12, 1 of 1 files at r13.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @irfansharif, and @tbg)
pkg/sql/crdb_internal.go, line 1181 at r11 (raw file):
var numStructured int rec.Structured(func(any *ptypes.Any) {
Ouch. I though the entire point of this vtable which leaves out the details of traces was to avoid iterating over all events while populating the table, so that the SQL client can apply a selection before we pay the cost of looking into a recording.
Can you not have a new built-in function which takes a trace/span ID as argument and returns a JSON blob with its properties, including whether it contains any structured event as a boolean (and make the rec.Structured
logic stop the iteration as soon as 1 is encountered).
For context, I think you are under-estimating the run-time and memory costs of crdb_internal tables generally and how much (ab)use they take by end-users and our support staff. This stuff needs to be programmed defensively with adequate explanatory comments about their costs in in the table's comment.
pkg/sql/txn_state.go, line 162 at r11 (raw file):
var traceOpts []tracing.SpanOption if !tranCtx.execTestingKnobs.Pretend59315IsFixed { traceOpts = append(traceOpts, tracing.WithBypassRegistry())
Please add a comment to indicate that this conditional and the traceOpts can be removed entirely once the linked issue has been addressed.
pkg/sql/sem/builtins/builtins.go, line 3590 at r11 (raw file):
sp := tracing.SpanFromContext(ctx.Context) if sp == nil { return nil, errors.New("no active trace span")
The proper way in SQL to indicate there's no value is to return NULL. I don't see a good reason to trash the SQL transaction in this case.
pkg/util/tracing/span.go, line 159 at r9 (raw file):
// recorded traces from other nodes. func (s *Span) ImportRemoteSpans(remoteSpans []tracingpb.RecordedSpan) error { if s.tracer.mode() == modeLegacy && s.crdb.recordingType() == RecordingOff {
Isn't this change needed at soon as the first commit? Especially if, as I suggested above, the first commit was extended to assert that trace events are in fact collected across remote RPCs?
pkg/util/tracing/tracer.go, line 299 at r6 (raw file):
// also create a real child. Additionally, if the caller explicitly asked for // a real span they need to get one. In all other cases, a noop span will do. if !t.AlwaysTrace() &&
As a reviewer, I think that !(t.AlwaysTrace() || opts.parentTraceID() != 0 || opts.ForceRealSpan)
gives me a more intuitive idea of what this condition is about (i.e. "if I don't have any reason to trace")
As was rightly pointed out by @knz during review in cockroachdb#59992, the tracing interceptors were only tested indirectly. In this commit, we add explicit tests that set up a gRPC server and verify that for all combinations of unary/streaming the correct trace spans get created. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Regarding the first commit: I am not convinceed by the unit test -- if we want to check that traces get collected across RPC boundaries, I'd expect a test that sets up a RPC server, sends an actuall RPC over a network connection, and asserts that the expected trace events are communicated.
Good point - I added a comprehensive test for this in an additional commit.
In the 6th commit I dislike the approach you've taken, with more explanations about my view below.
Fair point, I assume this is more a failure to communicate, see the inline discussion.
I added a few more commits - haven't done any squashing or rearranging to keep the review easier, PTAL.
Dismissed @irfansharif from 3 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @angelapwen, @irfansharif, and @knz)
pkg/sql/crdb_internal.go, line 1181 at r11 (raw file):
Previously, knz (kena) wrote…
Ouch. I though the entire point of this vtable which leaves out the details of traces was to avoid iterating over all events while populating the table, so that the SQL client can apply a selection before we pay the cost of looking into a recording.
Can you not have a new built-in function which takes a trace/span ID as argument and returns a JSON blob with its properties, including whether it contains any structured event as a boolean (and make the
rec.Structured
logic stop the iteration as soon as 1 is encountered).For context, I think you are under-estimating the run-time and memory costs of crdb_internal tables generally and how much (ab)use they take by end-users and our support staff. This stuff needs to be programmed defensively with adequate explanatory comments about their costs in in the table's comment.
I agree with everything you're saying, and should have clarified how this code came to be (my bad). Essentially @angelapwen had trouble getting the structured data to show up in her attempt to add a built-in to render the payloads from a given span, and while investigating I found the random bits of breakage that caused this. I want this PR to give her a solid foundation to work from, and that meant adding at least a rudimentary test confirming that the payloads to indeed flow from where they originate to where they need to be. Long story short - I would like Angela (with your help) to tackle the shortcomings you're describing. I don't think num_structured
should remain in the table once we have a way to ask for the payloads of a specified trace. This was just the easiest thing I could do.
pkg/sql/txn_state.go, line 162 at r11 (raw file):
Previously, knz (kena) wrote…
Please add a comment to indicate that this conditional and the traceOpts can be removed entirely once the linked issue has been addressed.
Done.
pkg/sql/sem/builtins/builtins.go, line 3590 at r11 (raw file):
Previously, knz (kena) wrote…
The proper way in SQL to indicate there's no value is to return NULL. I don't see a good reason to trash the SQL transaction in this case.
Done.
pkg/util/tracing/span.go, line 159 at r9 (raw file):
Previously, knz (kena) wrote…
Isn't this change needed at soon as the first commit? Especially if, as I suggested above, the first commit was extended to assert that trace events are in fact collected across remote RPCs?
Yes, I will rearrange before the merge (when I squash the fixup commits).
pkg/util/tracing/tracer.go, line 299 at r6 (raw file):
Previously, knz (kena) wrote…
As a reviewer, I think that
!(t.AlwaysTrace() || opts.parentTraceID() != 0 || opts.ForceRealSpan)
gives me a more intuitive idea of what this condition is about (i.e. "if I don't have any reason to trace")
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)
pkg/sql/crdb_internal.go, line 1181 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I agree with everything you're saying, and should have clarified how this code came to be (my bad). Essentially @angelapwen had trouble getting the structured data to show up in her attempt to add a built-in to render the payloads from a given span, and while investigating I found the random bits of breakage that caused this. I want this PR to give her a solid foundation to work from, and that meant adding at least a rudimentary test confirming that the payloads to indeed flow from where they originate to where they need to be. Long story short - I would like Angela (with your help) to tackle the shortcomings you're describing. I don't think
num_structured
should remain in the table once we have a way to ask for the payloads of a specified trace. This was just the easiest thing I could do.
Acknowledged — this can be removed now that #60616 successfully finds a contention event from a trace via payload.
pkg/sql/crdb_internal.go, line 1187 at r17 (raw file):
if err := addRow( // TODO(angelapwen): we're casting uint64s to int64 here, // is that ok?
Hm, I see what you mean - there could be some casting problems if the unsigned int was large enough. I think the DInt
needs to be a int64
but we could potentially return an error if the ints are too large?
pkg/sql/sem/builtins/builtins.go, line 3595 at r17 (raw file):
Nit: did you mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r14, 1 of 1 files at r15, 1 of 1 files at r16, 7 of 7 files at r17.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/sql/crdb_internal.go, line 1187 at r17 (raw file):
Previously, angelapwen (Angela P Wen) wrote…
Hm, I see what you mean - there could be some casting problems if the unsigned int was large enough. I think the
DInt
needs to be aint64
but we could potentially return an error if the ints are too large?
I think the conversion between uint and int in Go is lossless, i.e. it does not lose information. A large uint value will convert to a negative signed value, but the conversion remains bijective.
Since these are identifiers (and not quantities), we only care about the "identifying" nature of the values.
However, I do agree it may be confusing for UX. @tbg was there any reason to make the trace/span IDs unsigned?
pkg/sql/sem/builtins/builtins.go, line 3595 at r17 (raw file):
Previously, angelapwen (Angela P Wen) wrote…
Nit: did you mean
possibly this is not
? Not sure what this means.
I think the comment is correct, although tobias could have clarified "stable (i.e. non-volatile)".
(Angela: The volatibility attribute indicates whether the result of a built-in function is going to change throughout the execution of the statement. Since our trace IDs are fixed for an entire SQL txn, at lease for an entire SQL stmt, I believe like Tobias that they are stable and thus this built-in can be marked as non-volatile, but I also agree it's a hassle to test / validate and so to be conservative we can keep this as is.)
@tbg I think also it's worth filing an issue for the optimizer team to look at. The queries we'd like to advertise to consume trace data will require SQL joins. If this builtin is marked volatile, the query plans will be more complex. We need some research to investigate the penalty and set the volatility properly if the penalty is high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @angelapwen)
pkg/sql/crdb_internal.go, line 1187 at r17 (raw file):
Previously, knz (kena) wrote…
I think the conversion between uint and int in Go is lossless, i.e. it does not lose information. A large uint value will convert to a negative signed value, but the conversion remains bijective.
Since these are identifiers (and not quantities), we only care about the "identifying" nature of the values.
However, I do agree it may be confusing for UX. @tbg was there any reason to make the trace/span IDs unsigned?
They have historically been unsigned, and I don't think there's a good reason. However, we can't just change it due to the necessary migration. Concretely, I think where the uintness matters is here:
cockroach/pkg/util/tracing/tracer.go
Lines 550 to 551 in a1c3912
carrier.Set(fieldNameTraceID, strconv.FormatUint(sm.traceID, 16)) | |
carrier.Set(fieldNameSpanID, strconv.FormatUint(sm.spanID, 16)) |
and here:
cockroach/pkg/util/tracing/tracer.go
Lines 604 to 615 in a1c3912
case fieldNameTraceID: | |
var err error | |
traceID, err = strconv.ParseUint(v, 16, 64) | |
if err != nil { | |
return opentracing.ErrSpanContextCorrupted | |
} | |
case fieldNameSpanID: | |
var err error | |
spanID, err = strconv.ParseUint(v, 16, 64) | |
if err != nil { | |
return opentracing.ErrSpanContextCorrupted | |
} |
So for 20.2 support, we have to keep it as a uint on the wire. Which should be fine! But honestly I'm not sure if there is enough of a reason here to make the change. uint64
is a fine choice and nobody likes IDs that have signs. Can we make the ID columns BIGINT
s instead? I do agree that we want the representation of IDs to match between internal and SQL, or we're going to be very confused once we walk further down the path in which trace messages are "stored" via log messages.
Filed #60668
pkg/sql/sem/builtins/builtins.go, line 3595 at r17 (raw file):
Previously, knz (kena) wrote…
I think the comment is correct, although tobias could have clarified "stable (i.e. non-volatile)".
(Angela: The volatibility attribute indicates whether the result of a built-in function is going to change throughout the execution of the statement. Since our trace IDs are fixed for an entire SQL txn, at lease for an entire SQL stmt, I believe like Tobias that they are stable and thus this built-in can be marked as non-volatile, but I also agree it's a hassle to test / validate and so to be conservative we can keep this as is.)
@tbg I think also it's worth filing an issue for the optimizer team to look at. The queries we'd like to advertise to consume trace data will require SQL joins. If this builtin is marked volatile, the query plans will be more complex. We need some research to investigate the penalty and set the volatility properly if the penalty is high.
I think we should reach out when we have some actual queries, otherwise we're asking them in a vacuum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @angelapwen, @irfansharif, and @knz)
pkg/sql/logictest/logic.go, line 567 at r23 (raw file):
Previously, RaduBerinde wrote…
[nit]
5node-pretend-59315-is-fixed
I didn't do this, was too scared to eat another cycle
pkg/sql/logictest/logic.go, line 567 at r23 (raw file): Previously, tbg (Tobias Grieger) wrote…
Heh, no problem. Hopefully it will be fixed soon so we won't have to pretend anymore :) |
59588: streamingest: add job level test for stream ingestion r=pbardea a=adityamaru This change adds a test that exercises all the components of the stream ingestion flow. It fixes some missing pieces that were discovered while writing the test. Informs: #59175 Release note: None 60424: sql: include sampled stats in TestSampledStatsCollection r=yuzefovich a=asubiotto Depends on #59992, which is required for this new regression test to pass. TestSampledStatsCollection would previously only check that stats that are collected regardless of the sample rate are returned. These types of stats (rows/bytes read) are propagated using metadata, rather than the trace. This resulted in us silently failing to collect any stats when sampling was enabled once the tracing mode was reverted back to legacy. To avoid this kind of thing happening again, this commit adds a check that max memory usage is reported to be non-zero. Release note: None (this is a new feature that has no user impact yet) 60626: kvserver: initialize propBuf LAI tracking r=andreimatei a=andreimatei The initialization of the LAI tracking in the proposal buffer seems pretty lacking (see #60625). This patch adds initialization of propBuf.liBase at propBuf.Init() time, which is irrelevant for production, but will help future tests which will surely want the a propBuf's first assigned LAIs to have some relationship to the replica state. Release note: None Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: Alfonso Subiotto Marques <alfonso@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Fixes cockroachdb#60677, removing a stop-gap introduced in cockroachdb#59992. We were previously leaking "async consensus" spans, which was possible when a given proposal was never flushed out of the replica's proposal buffer. On server shut down, this buffered proposal was never finished, and thus the embedded span never closed. We now add a closer to clean up after ourselves. Release justification: bug fixes and low-risk updates to new functionality. Release note: None
We no longer return errors from ImportRemoteSpans, and we saw in the [past] that it is not a good idea to do so - it's easy to mess up cross-version behavior. [past]: #59992 (comment) Release justification: low-risk code clarification Release note: None
Fixes cockroachdb#60677, removing a stop-gap introduced in cockroachdb#59992. We were previously leaking "async consensus" spans, which was possible when a given proposal was never flushed out of the replica's proposal buffer. On server shut down, this buffered proposal was never finished, and thus the embedded span never closed. We now add a closer to clean up after ourselves. Release justification: bug fixes and low-risk updates to new functionality. Release note: None
61279: kvserver: plug a tracing span leak r=irfansharif a=irfansharif Fixes #60677, removing a stop-gap introduced in #59992. We were previously leaking "async consensus" spans, which was possible when a given proposal was never flushed out of the replica's proposal buffer. On server shut down, this buffered proposal was never finished, and thus the embedded span never closed. We now add a closer to clean up after ourselves. Release justification: bug fixes and low-risk updates to new functionality. Release note: None --- +cc @angelapwen / @knz / @erikgrinaker for pod-visibility. 61344: sql: add estimated row count to all operators in EXPLAIN r=RaduBerinde a=RaduBerinde This change adds the estimated row count to all operators in the vanilla EXPLAIN, as long as the tables involved had statistics. Fixes #61310. Release justification: bug fixes and low-risk updates to new functionality Release note (sql change): EXPLAIN now shows estimated row counts for all operators even without VERBOSE (except when we don't have statistics for the tables). Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
We no longer return errors from ImportRemoteSpans, and we saw in the [past] that it is not a good idea to do so - it's easy to mess up cross-version behavior. [past]: #59992 (comment) Release justification: low-risk code clarification Release note: None
61327: tracing: don't return an error from ImportRemoteSpans r=erikgrinaker,angelapwen a=tbg We no longer return errors from ImportRemoteSpans, and we saw in the [past] that it is not a good idea to do so - it's easy to mess up cross-version behavior. [past]: #59992 (comment) Release justification: low-risk code clarification Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
We were previously propagating RPCs across boundaries only when they
were verbose. We now propagate any span (except the noop span)
regardless of current verbosity.
This ensures that SQL (which always creates real, non-verbose spans by
default) can trace the entirety of its operations.
This enables the background collection of contention metadata in 21.1
and more generally, the idea of always-on tracing. Note that we had
previously considered always-on tracing to mean avoiding the noop span
altogether; due to memory retention issues1 encountered we are
taking a step back from that idea for the 21.1 release.
Release note: None