kvcoord: MuxRangeFeed client uses 1 go routine per node #97957

miretskiy · 2023-03-02T23:57:00Z

Rewrite MuxRangeFeed client to use 1 Go routine per node,
instead of 1 Go routine per range.

Prior to this change, MuxRangeFeed client was structured
so that it was entirely compatible with the execution
model of the regular range feed. As a result,
1 Go routine was used per range. This rewrite replaces
old implementation with an almost clean slate implementation
which uses 1 Go routine per node.

Where possible, relatively small and targetted modifications
to the rangefeed library were made to extract common methods
(such as range splitting).

The reduction in the number of Go routines created by rangefeed
has direct impact on the cluster performance, and most importantly
SQL latency. This is mostly due to the fact that with this PR,
the number of Go routines started by MuxRangeFeed is down to
2 per range (on the rangefeed server side) vs 5 for the regular
rangefeed. When running changefeeds against tables with
10s-100s of thousands of ranges, this significant difference
in the Go routine count has direct impact on Go scheduler latency,
the number of runnable Go routines, and ultimately, on the SQL
latency.

Epic: CRDB-25044

Release note (enterprise change) : MuxRangeFeed client (enabled via
changefeed.mux_rangefeed.enabled setting) is more efficient
when running against large scale workloads.

cockroach-teamcity · 2023-03-02T23:57:13Z

This change is

miretskiy · 2023-03-03T00:07:14Z

Setup is: 5 node cluster; 75k splits w/ kv workload. KV workload running, while
changefeed is setup as:

CREATE CHANGEFEED FOR kv INTO 'null://' WITH no_initial_scan

This one shows small, but noticeable drop in the 99.99 SQL (~50ms)

Similarly, we see a drop in Go scheduler latency:

And runnable go routine count:

Total number of go routines is obviously different (375k vs 150k):

erikgrinaker

I need to review the mux client more thoroughly, but I'm all reviewed out for today. Flushing the comments I have so far.

erikgrinaker · 2023-03-10T15:20:48Z

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go

@@ -173,6 +180,8 @@ func (ds *DistSender) RangeFeedSpans(
 	for _, opt := range opts {
 		opt.set(&cfg)
 	}
+	// TODO(yevgeniy): Drop withDiff argument in favor of explicit RangeFeedOption.
+	cfg.withDiff = withDiff


nit: Can we just do this now?

erikgrinaker · 2023-03-10T17:01:46Z

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go

 		}
 	}
 	return ctx.Err()
 }

+type rangefeedErrorInfo struct {
+	transient bool // true if error is transient and should be retried.


We never actually use this field for anything. Is the idea that errors must have an explicit action? If so, let's assert that either this or restart is set. Otherwise, let's drop it.

erikgrinaker · 2023-03-10T17:09:23Z

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go

 		}
 	}
 	return ctx.Err()
 }

+type rangefeedErrorInfo struct {
+	transient bool // true if error is transient and should be retried.
+	restart   bool // true if the rangefeed for this span needs to be restarted.


restart feels like a misnomer, since we'll restart the rangefeed even when this isn't set too. I think the crucial property here is that we refresh the range descriptors because the range structure has changed, so consider e.g. refreshDescriptors or refreshSpan or something that indicates this.

erikgrinaker · 2023-03-10T17:19:21Z

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go

+
+	{
+		ptr, exists := m.muxClients.Load(int64(nodeID))
+		if !exists {


Why don't we call LoadOrStore immediately here? Is the future construction expensive enough to matter?

erikgrinaker · 2023-03-10T17:31:44Z

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go

+			recvErr = nil
+		}
+
+		if _, err := handleRangefeedError(ctx, recvErr); err != nil {


Shouldn't we properly handle cache eviction here, to avoid unnecessary cache invalidation on unrelated errors?

erikgrinaker · 2023-03-10T17:40:40Z

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go

 	client rpc.RestrictedInternalClient,
 	nodeID roachpb.NodeID,
-) error {
+	conn *future.Future[connOrError],


I'm finding it really confusing that we're passing in both client and conn here. Consider lifting all of the stream setup stuff above here, and make it clear that conn is not an input per sé but rather a result that will be populated from client. Maybe also consider using stream rather than conn for each individual multiplexed stream, since conn reads more like the underlying gRPC transport connection rather than the logical multiplexed stream (which logically wouldn't be possible since client depends on the transport connection).

erikgrinaker

No major issues at an initial glance, but let's resolve the comments first.

I'd encourage @aliher1911 to do another review pass over the mux rangefeed machinery as part of the upcoming rangefeed work, since he'll presumably be more intimately familiar with this area by then.

erikgrinaker · 2023-03-13T10:20:48Z

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go

+	// must be protected by this mutex (streams itself is thread safe,
+	// the only reason why this mu is required to be held is to ensure
+	// correct synchronization between start of a new rangefeed feed, and
+	// mux node connection tear down).


Should we move streams under mu then?

erikgrinaker · 2023-03-13T10:23:28Z

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go

-	err  error
-}
+	active.release()
+	active.token.Evict(ctx)


Same comment as above, we should only evict this when we need to since it affects foreground tail latencies.

erikgrinaker · 2023-03-13T10:26:05Z

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go

+	if err != nil {
+		return err
+	}
+	return divideSpanOnRangeBoundaries(ctx, m.ds, rs, active.startAfter, m.startSingleRangeFeed)


We can avoid the meta iteration by checking rangefeedErrorInfo.restart.

erikgrinaker · 2023-03-13T10:30:28Z

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go

+			if err := m.restartActiveRangeFeed(ctx, active, t.Error.GoError()); err != nil {
+				return err
+			}
+			continue


Can you write up the overall lifecycle somewhere? It isn't immediately clear to me why it's safe to reuse conn and conn.receiver here.

erikgrinaker · 2023-03-13T10:32:46Z

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go

 			}
+		case *kvpb.RangeFeedSSTable:


nit: stray case? Should we explicitly handle all event types here and error out on unknown events?

erikgrinaker · 2023-03-13T10:36:28Z

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go

+		if timeutil.Now().Before(nextStuckCheck) {
+			if threshold := stuckThreshold(); threshold > 0 {
+				if err := conn.eachStream(func(id int64, a *activeMuxRangeFeed) error {
+					if !a.startAfter.IsEmpty() && timeutil.Since(a.startAfter.GoTime()) > stuckThreshold() {


Won't this misfire if the eventCh send above is slow? Consider moving it higher up so that it primarily reacts to server events. Although I guess it doesn't matter all that much since we'll have head-of-line blocking between streams anyway which can cause false positives.

miretskiy

Ack.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go line 106 at r4 (raw file):