Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
81107: sql: trace meta2 scans made by crdb_internal.probe_ranges r=joshimhoff a=joshimhoff **sql: trace meta2 scans made by crdb_internal.probe_ranges** With this commit, CRDB traces the meta2 scan made by crdb_internal.probe_ranges. If the scan fails, the trace is attached to the error returned to the SQL client. This way, outages involving a down meta2 range are observable. According to my testing, meta2 is the only system range that can make crdb_internal.probe_ranges unavailable. As a result, with this commit, we should get a trace of a query to the unavailable range, regardless of which range is unavailable, either in the SQL table output of crdb_internal.probe_ranges or in an error string. Release note. None. ------------------------ This is based on some thinking & testing I did at cockroachdb#79546 (comment). With this commit, we get: ``` root@:26257/defaultdb> select range_id, error from crdb_internal.probe_ranges(INTERVAL '1s', 'write'); ERROR: 0.000ms 0.000ms === operation:meta2scan _verbose:1 client:127.0.0.1:49588 node:1 user:root 0.010ms 0.010ms === operation:txn coordinator send _verbose:1 client:127.0.0.1:49588 node:1 txnID:be5903f5-1817-481b-8dcf-2e2bc8b0f052 user:root 0.020ms 0.010ms === operation:dist sender send _verbose:1 client:127.0.0.1:49588 node:1 txn:be5903f5 user:root 0.063ms 0.043ms event:kv/kvclient/kvcoord/range_iter.go:175 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] querying next range at /Meta2/"\x00" 0.122ms 0.059ms event:kv/kvclient/kvcoord/dist_sender.go:2032 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] r1: sending batch 1 Scan to (n1,s1):1 0.132ms 0.010ms event:rpc/nodedialer/nodedialer.go:161 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] sending request to local client 0.136ms 0.004ms === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:1 span.kind:server 0.144ms 0.008ms event:server/node.go:996 [n1] node received request: 1 Scan 0.174ms 0.030ms event:kv/kvserver/store_send.go:157 [n1,s1] executing Scan [/Meta2/"\x00",/Meta2/Max), [txn: be5903f5], [can-forward-ts] 0.335ms 0.161ms event:server/node.go:1009 [n1] error from stores.Send: boom: "sql txn" meta={id=be5903f5 key=/Min pri=0.02292157 epo=0 ts=1651868092.726275000,0 min=1651868092.726275000,0 seq=0} lock=false stat=PENDING rts=1651868092.726275000,0 wto=false gul=1651868093.226275000,0 0.384ms 0.252ms event:kv/kvclient/kvcoord/dist_sender.go:2144 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] boom: "sql txn" meta={id=be5903f5 key=/Min pri=0.02292157 epo=0 ts=1651868092.726275000,0 min=1651868092.726275000,0 seq=0} lock=false stat=PENDING rts=1651868092.726275000,0 wto=false gul=1651868093.226275000,0 0.442ms 0.058ms event:kv/kvclient/kvcoord/dist_sender.go:1674 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] reply error Scan [/Meta2/"\x00",/Meta2/Max), [txn: be5903f5], [can-forward-ts]: boom: "sql txn" meta={id=be5903f5 key=/Min pri=0.02292157 epo=0 ts=1651868092.726275000,0 min=1651868092.726275000,0 seq=0} lock=false stat=PENDING rts=1651868092.726275000,0 wto=false gul=1651868093.226275000,0 : boom ``` If run a test like this: ``` ~/g/s/g/c/cockroach [tweak_crdb_internal_probe] $ git diff diff --git a/pkg/kv/kvserver/replica_send.go b/pkg/kv/kvserver/replica_send.go index 3295886e91..ff7a1ed4df 100644 --- a/pkg/kv/kvserver/replica_send.go +++ b/pkg/kv/kvserver/replica_send.go `@@` -12,7 +12,10 `@@` package kvserver +var sigCh = func() chan os.Signal { + ch := make(chan os.Signal, 1) + signal.Notify(ch, syscall.SIGHUP) + return ch +}() + +var weDead atomic.Int32 + // sendWithoutRangeID used to be called sendWithRangeID, accepted a `_forStacks // roachpb.RangeID` argument, and had the description below. Ever since Go // switched to the register-based calling convention though, this stopped `@@` -126,6 +138,16 `@@` func (r *Replica) sendWithoutRangeID( ) (_ *roachpb.BatchResponse, rErr *roachpb.Error) { var br *roachpb.BatchResponse + select { + case <-sigCh: + log.Warningf(ctx, "time to drop incoming raft messages for range") + weDead.Store(1) + default: + } + if weDead.Load() == 1 && ba.RangeID >= 1 { + return nil, roachpb.NewError(errors.New("boom")) + } + if r.leaseholderStats != nil && ba.Header.GatewayNodeID != 0 { r.leaseholderStats.RecordCount(r.getBatchRequestQPS(ctx, ba), ba.Header.GatewayNodeID) } ``` Without this commit, we just get: ``` root@:26257/?> select range_id, error from crdb_internal.probe_ranges(INTERVAL '1s', 'write'); ERROR: boom ``` Co-authored-by: Josh Imhoff <josh@cockroachlabs.com>
- Loading branch information