Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
81107: sql: trace meta2 scans made by crdb_internal.probe_ranges r=joshimhoff a=joshimhoff

**sql: trace meta2 scans made by crdb_internal.probe_ranges**

With this commit, CRDB traces the meta2 scan made by
crdb_internal.probe_ranges. If the scan fails, the trace is attached to the
error returned to the SQL client. This way, outages involving a down meta2
range are observable.

According to my testing, meta2 is the only system range that can make
crdb_internal.probe_ranges unavailable. As a result, with this commit, we
should get a trace of a query to the unavailable range, regardless of which
range is unavailable, either in the SQL table output of
crdb_internal.probe_ranges or in an error string.

Release note. None.

------------------------

This is based on some thinking & testing I did at cockroachdb#79546 (comment).

With this commit, we get:

```
root@:26257/defaultdb> select range_id, error from crdb_internal.probe_ranges(INTERVAL '1s', 'write');
ERROR:      0.000ms      0.000ms    === operation:meta2scan _verbose:1 client:127.0.0.1:49588 node:1 user:root
     0.010ms      0.010ms        === operation:txn coordinator send _verbose:1 client:127.0.0.1:49588 node:1 txnID:be5903f5-1817-481b-8dcf-2e2bc8b0f052 user:root
     0.020ms      0.010ms            === operation:dist sender send _verbose:1 client:127.0.0.1:49588 node:1 txn:be5903f5 user:root
     0.063ms      0.043ms            event:kv/kvclient/kvcoord/range_iter.go:175 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] querying next range at /Meta2/"\x00"
     0.122ms      0.059ms            event:kv/kvclient/kvcoord/dist_sender.go:2032 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] r1: sending batch 1 Scan to (n1,s1):1
     0.132ms      0.010ms            event:rpc/nodedialer/nodedialer.go:161 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] sending request to local client
     0.136ms      0.004ms                === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:1 span.kind:server
     0.144ms      0.008ms                event:server/node.go:996 [n1] node received request: 1 Scan
     0.174ms      0.030ms                event:kv/kvserver/store_send.go:157 [n1,s1] executing Scan [/Meta2/"\x00",/Meta2/Max), [txn: be5903f5], [can-forward-ts]
     0.335ms      0.161ms                event:server/node.go:1009 [n1] error from stores.Send: boom: "sql txn" meta={id=be5903f5 key=/Min pri=0.02292157 epo=0 ts=1651868092.726275000,0 min=1651868092.726275000,0 seq=0} lock=false stat=PENDING rts=1651868092.726275000,0 wto=false gul=1651868093.226275000,0
     0.384ms      0.252ms            event:kv/kvclient/kvcoord/dist_sender.go:2144 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] boom: "sql txn" meta={id=be5903f5 key=/Min pri=0.02292157 epo=0 ts=1651868092.726275000,0 min=1651868092.726275000,0 seq=0} lock=false stat=PENDING rts=1651868092.726275000,0 wto=false gul=1651868093.226275000,0
     0.442ms      0.058ms            event:kv/kvclient/kvcoord/dist_sender.go:1674 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] reply error Scan [/Meta2/"\x00",/Meta2/Max), [txn: be5903f5], [can-forward-ts]: boom: "sql txn" meta={id=be5903f5 key=/Min pri=0.02292157 epo=0 ts=1651868092.726275000,0 min=1651868092.726275000,0 seq=0} lock=false stat=PENDING rts=1651868092.726275000,0 wto=false gul=1651868093.226275000,0
: boom
```

If run a test like this:

```
~/g/s/g/c/cockroach [tweak_crdb_internal_probe] $ git diff
diff --git a/pkg/kv/kvserver/replica_send.go b/pkg/kv/kvserver/replica_send.go
index 3295886e91..ff7a1ed4df 100644
--- a/pkg/kv/kvserver/replica_send.go
+++ b/pkg/kv/kvserver/replica_send.go
`@@` -12,7 +12,10 `@@` package kvserver
+var sigCh = func() chan os.Signal {
+       ch := make(chan os.Signal, 1)
+       signal.Notify(ch, syscall.SIGHUP)
+       return ch
+}()
+
+var weDead atomic.Int32
+
 // sendWithoutRangeID used to be called sendWithRangeID, accepted a `_forStacks
 // roachpb.RangeID` argument, and had the description below. Ever since Go
 // switched to the register-based calling convention though, this stopped
`@@` -126,6 +138,16 `@@` func (r *Replica) sendWithoutRangeID(
 ) (_ *roachpb.BatchResponse, rErr *roachpb.Error) {
        var br *roachpb.BatchResponse
 
+       select {
+       case <-sigCh:
+               log.Warningf(ctx, "time to drop incoming raft messages for range")
+               weDead.Store(1)
+       default:
+       }
+       if weDead.Load() == 1 && ba.RangeID >= 1 {
+               return nil, roachpb.NewError(errors.New("boom"))
+       }
+
        if r.leaseholderStats != nil && ba.Header.GatewayNodeID != 0 {
                r.leaseholderStats.RecordCount(r.getBatchRequestQPS(ctx, ba), ba.Header.GatewayNodeID)
        }
```

Without this commit, we just get:

```
root@:26257/?> select range_id, error from crdb_internal.probe_ranges(INTERVAL '1s', 'write');
ERROR: boom
```

Co-authored-by: Josh Imhoff <josh@cockroachlabs.com>
  • Loading branch information
craig[bot] and joshimhoff committed May 12, 2022
2 parents d05bbbe + fec45b8 commit 9a179ea
Showing 1 changed file with 23 additions and 8 deletions.
31 changes: 23 additions & 8 deletions pkg/sql/sem/builtins/generator_probe_ranges.go
Original file line number Diff line number Diff line change
Expand Up @@ -134,16 +134,31 @@ func makeProbeRangeGenerator(ctx *eval.Context, args tree.Datums) (eval.ValueGen
"only users with the admin role are allowed to use crdb_internal.probe_range",
)
}
// Handle args passed in.
// Trace the query to meta2. Return it as part of the error string if the query fails.
// This improves observability into a meta2 outage. We expect crdb_internal.probe_range
// to be available, unless meta2 is down.
var ranges []kv.KeyValue
{
txn := ctx.Txn
ctx, sp := tracing.EnsureChildSpan(
ctx.Context, ctx.Tracer, "meta2scan",
tracing.WithForceRealSpan(),
)
sp.SetRecordingType(tracing.RecordingVerbose)
defer func() {
sp.Finish()
}()
// Handle args passed in.
ranges, err = kvclient.ScanMetaKVs(ctx, txn, roachpb.Span{
Key: keys.MinKey,
EndKey: keys.MaxKey,
})
if err != nil {
return nil, errors.Wrapf(err, "%s", sp.FinishAndGetConfiguredRecording().String())
}
}
timeout := time.Duration(tree.MustBeDInterval(args[0]).Duration.Nanos())
isWrite := args[1].(*tree.DEnum).LogicalRep
ranges, err := kvclient.ScanMetaKVs(ctx.Context, ctx.Txn, roachpb.Span{
Key: keys.MinKey,
EndKey: keys.MaxKey,
})
if err != nil {
return nil, err
}
return &probeRangeGenerator{
rangeProber: ctx.RangeProber,
timeout: timeout,
Expand Down

0 comments on commit 9a179ea

Please sign in to comment.