changefeedccl: webhook sink rewrite #99086

samiskin · 2023-03-21T02:25:51Z

Resolves #84676
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-11356

This PR implements the Webhook sink as part of a more general batchingSink framework that can be used to make adding new sinks an easier process, making it far more performant than it was previously.

A followup PR will be made to use the batchingSink for the pubsub client which also suffers performance issues.

Sink-specific code is encapsulated in a SinkClient interface

type SinkClient interface {
        MakeResolvedPayload(body []byte, topic string) (SinkPayload, error)
        MakeBatchWriter() BatchWriter
        Flush(context.Context, SinkPayload) error
        Close() error
}

type BatchWriter interface {
        AppendKV(key []byte, value []byte, topic string)
        ShouldFlush() bool
        Close() (SinkPayload, error)
}

type SinkPayload interface{}

Once the Batch is ready to be Flushed, the writer can be Close()'d to do any final formatting (ex: wrap in a json object with extra metadata) of the buffer-able data and obtain a final SinkPayload that is ready to be passed to SinkClient.Flush.

The SinkClient has a separate MakeResolvedPayload since the sink may require resolved events be formatted differently to a batch of kvs.

Flush(ctx, payload) encapsulates sending a blocking IO request to the sink endpoint, and may be called multiple times with the same payload due to retries. Any kind of formatting work should be served to run in the buffer's Close and stored as a SinkPayload to avoid multiple calls to Flush repeating work upon retries.

The batchingSink handles all the logic to take a SinkClient and form a full Sink implementation.

type batchingSink struct {
        client             SinkClient
        ioWorkers          int
        minFlushFrequency  time.Duration
        retryOpts          retry.Options
        eventPool          sync.Pool
        batchPool          sync.Pool
        eventCh            chan interface{}
        pacer              *admission.Pacer
        ...
}

var _ Sink = (*batchingSink)(nil)

It involves a single goroutine which handles:

Creating, building up, and finalizing BatchWriters to eventually form a SinkPayload to emit
Flushing batches when they have persisted longer than a configured minFlushFrequency
Flushing deliberately and being able to block until the Flush has completed
Logging all the various sink metrics

EmitRow calls are thread-safe therefore the use of the safeSink wrapper is not required for users of this sink.

Events sent through the goroutines would normally need to exist on the heap, but to avoid excessive garbage collection of hundreds of thousands of tiny structs, both the kvEvents{<data from EmitRow>} events (sent from the EmitRow caller to the batching worker) and the sinkBatchBuffer{<data about the batch>} events (sent from the batching worker to the IO routine in the next section) are allocated on object pools.

For a sink like Cloudstorage where there are large batches, doing the above and just one-by-one flushing the batch payloads on a separate routine is plenty good enough. Unfortunately the Webhook sink can be used with no batching at all with users wanting the lowest latency while still having good throughput. This means we need to be able to have multiple requests in flight. The difficulty here is if a batch with keys [a1,b1] is in flight, a batch with keys [b2,c1] needs to block until [a1,b1] completes as b2 cannot be sent and risk arriving at the destination prior to b1.

Flushing out Payloads in a way that is both able to maintain key-ordering guarantees but is able to run in parallel is done by a separate parallel_io struct.

type parallelIO struct {
	retryOpts retry.Options
	ioHandler IOHandler
	requestCh chan IORequest
	resultCh  chan IORequest
  ...
}

type IOHandler func(context.Context, IORequest) error

type IORequest interface {
	Keys() intsets.Fast
	SetError(error)
}

It involves one goroutine to manage the key ordering guarantees and a configurable number of IO Worker goroutines that simply call ioHandler on an IORequest.

IORequests represent the keys they shouldn't conflict on by providing a intsets.Fast struct, which allows for efficient Union/Intersects/Difference operations on them that parallelIO needs to maintain ordering guarantees.

Requests are received as IORequests and the response is also returned as an IORequest. This way the parallelIO struct does not have to do any heap allocations to communicate, the user of it can manage creating and freeing these objects in pools. The only heap allocations that occur are part of the intset operations as it uses a linkedlist internally.

The webhook sink is therefore formed by:

EmitRow is called, creating kvEvents that are sent to a Batching worker
The batching worker takes events and appends them to a batch
Once the batch is full, its encoded into an HTTP request
The request object is then sharded across a set of IO workers to be fully sent out in parallel with other non-key-conflicting requests.

With this setup, looking at the CPU flamegraph, at high throughputs most of the batchingSink/parallelIO work didn't really show up much, the work was largely just step 3, where taking a list of messages and calling json.Marshal on it took almost 10% of the time, specifically a call to json.Compress.

Since this isn't needed, and all we're doing is simply putting a list of already-formatted JSON messages into a surrounding JSON array and small object, I also swapped json.Marshal to just stitch together characters manually into a buffer.

In the following flamegraph of a node at around 35% CPU usage, only 5.56% of the total cputime in the graph (three small chunks between the parallelEventConsumer chunk and the kvevent chunk) is taken up by the paralelIO and batchingSink workers. This is with batch size of 100.

The max CPU usage here was around 37% with a max throughput of 135k for a single node (the other nodes had run out of data at this point). Since the majority of the flamegraph shows time spent in the event processing code I'm going to assume this will be handled by the pacer and won't be much of an issue.

In the above flamegraph runtime.gcDrain does show up using 10.55% cpu but when trying the cloudstorage sink it had around the same value. I'm guessing this means there isn't an extra gc thrashing issue. I believe the only non-pool allocations that occur are the intsets.

The following graph demonstrates a webhook first with batching of 100 messages, followed by no batching, on TPCC with 500 warehouses, on a 3 node 16 cpu roachtest cluster. At peak the batched throughput is at 350k messages per second, and at peak the unbatched throughput is 61k.

This is a similar graph for the old webhook, 18k and 3.75k for batches of 100 and unbatched respectively.

Since Matt's talked about a new significance being placed on Feature flagging new work to avoid need for technical advisories, I placed this new implementation under the changefeed.new_webhook_sink_enabled setting and defaulted it to be disabled.

Right now its sink_webhook_v2 just to keep sink_webhook.go unchanged so that this review is easier to do. I may move sink_webhook to deprecated_sink_webhook and sink_webhook_v2 to just be sink_webhook prior to merging.

Release note (performance improvement): the webhook sink is now able to handle a drastically higher maximum throughput by enabling the "changefeed.new_webhook_sink_enabled" cluster setting.

cockroach-teamcity · 2023-03-21T02:25:59Z

This change is

miretskiy

Reviewed 1 of 12 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @samiskin)

pkg/ccl/changefeedccl/changefeedbase/settings.go line 284 at r1 (raw file):

	false,
)

nit: not sure if this bi-furcation of the code is needed; but okay.
Please do move this low level setting right next to where we create the sink and keep it un-exported.
I don't think temporary setting raises up to the changefeed wide level.

pkg/ccl/changefeedccl/changefeedbase/settings.go line 292 at r1 (raw file):

	util.ConstantWithMetamorphicTestBool("changefeed.new_webhook_sink_enabled", false),
)

comments on exported settings.

miretskiy

Reviewable is having hard time w/ so many commits; can you squash them?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @samiskin)

miretskiy

Reviewed 6 of 12 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @samiskin)

pkg/ccl/changefeedccl/batching_sink.go line 39 at r1 (raw file):

// to the sink.
type BatchWriter interface {
	AppendKV(key []byte, value []byte, topic string)

nit: perhaps just append? Cause it's not KV we are appending?
Also, short comment on this would be nice. i kinda like previous name (buffer) -- where append has good
meaning. With batchWriter things are less clear.

pkg/ccl/changefeedccl/batching_sink.go line 41 at r1 (raw file):

	AppendKV(key []byte, value []byte, topic string)
	ShouldFlush() bool
	Close() (SinkPayload, error)

This is a strange close signature -- but i'll keep reading.

pkg/ccl/changefeedccl/batching_sink.go line 69 at r1 (raw file):

	// claimed and freed from object pools.
	eventPool sync.Pool
	batchPool sync.Pool

you want pools to be global.
make it so, plus add helpers to get objects of appropriate type.

pkg/ccl/changefeedccl/batching_sink.go line 90 at r1 (raw file):

}

type kvEvent struct {

nit: i know it's "key value" ... but I just have hard time not reading "kvEvent" as something KV layer specific.
can we have a better name? encodedEvent? simply event? emitEvent? anything other than kv, really.

pkg/ccl/changefeedccl/batching_sink.go line 191 at r1 (raw file):

// sinkBatchBuffer stores an in-progress/complete batch of messages, along with
// metadata related to the batch.

nit: (feel free to ignore): we are already inside sink specific code. I think batchBuffer might be sufficient as a name.

pkg/ccl/changefeedccl/batching_sink.go line 194 at r1 (raw file):

type sinkBatchBuffer struct {
	writer  BatchWriter
	payload SinkPayload // payload is nil until FinalizePayload has been called

curious why we need to store it then, instead of returning it from FinalizePayload? Not sure, but, I'll keep reading.

pkg/ccl/changefeedccl/batching_sink.go line 198 at r1 (raw file):

	numMessages int
	numKVBytes  int // the total amount of uncompressed kv data in the batch
	keys        intsets.Fast

keys deserves a comment.

pkg/ccl/changefeedccl/batching_sink.go line 239 at r1 (raw file):

}

// Append adds the contents of a kvEvent to the batch, merging its alloc pool

nit: missing period.

pkg/ccl/changefeedccl/batching_sink.go line 264 at r1 (raw file):

}

func (bs *batchingSink) newBatchBuffer() *sinkBatchBuffer {

nit: (or a joke) have hard time reading code that starts with "bs" -- if you know what I mean.

pkg/ccl/changefeedccl/batching_sink.go line 294 at r1 (raw file):

		defer func() {
			batchBuffer = bs.newBatchBuffer()
		}()

perhaps we should be a bit more explicit (and 1 line shorter):

toFlush := batchBuffer
batchBuffer = bs.newBatchBuffer()

pkg/ccl/changefeedccl/batching_sink.go line 305 at r1 (raw file):

		for {
			select {
			case <-ctx.Done():

I find it hard to reason about functions that seem to swallow error (ctx.Err() in this case).
Yes, I understand that this checking happens somewhere else, but -- that's what I mean by "hard to reason" -- I have
to look elsewhere to convince myself that this is correct. Wouldn't it be better to just have this function return an error?

pkg/ccl/changefeedccl/parallel_io.go line 19 at r1 (raw file):

)

// parallelIO allows submitting requests to do blocking "IOHandler" calls on

nit: perhaps "parallelIO allows to perform blocking "IOHandler" calls in parallel?

pkg/ccl/changefeedccl/parallel_io.go line 26 at r1 (raw file):

// until [a,b] completes, then [c,d] will block until [b,c] completes. If [c,d]
// errored, [b,c] would never be sent, and SetError would be called on [c,d]
// prior to it being returned on resultCh.

what about SetError on batches that were already in flight?

pkg/ccl/changefeedccl/parallel_io.go line 77 at r1 (raw file):

// Close stops all workers immediately and returns once they shut down. Inflight
// requests sent to requestCh may never result in being sent to resultCh.
func (pe *parallelIO) Close() {

what does e stand for in pe?

pkg/ccl/changefeedccl/parallel_io.go line 100 at r1 (raw file):

	for i := 0; i < numEmitWorkers; i++ {
		pe.wg.GoCtx(func(ctx context.Context) error {

I wonder... one of the reasons why e.g. event processing used fixed size parallelism is because it had to do it to ensure correct ordering...
Do we need to do it here? We keep track of inflight set of keys. Is that sufficient to ensure ordering? Do we still need to have
fixed size worker pool? Could we use something else (perhaps variable pool?)

pkg/ccl/changefeedccl/parallel_io.go line 104 at r1 (raw file):

				err := emitWithRetries(ctx, req)
				if err != nil {
					req.SetError(err)

It's not clear to me (yet) why we need SetError at all? We have tried to do IO, we have tried to do
it with retries. What else can we do for this request? Wouldn't it be correct to just bail out here?
Return this error, have the whole thing torn down?

pkg/ccl/changefeedccl/parallel_io.go line 120 at r1 (raw file):

					case emitSuccessCh <- req:
					}
				}

Would be nice to avoid almost duplication:

var resultCh <- chan IORequest
if err == nil {
   resultCh = emitSuccessCh
} else {
   req.SetError(err)
   resultCh = pe.resultCh
}
select {
  ...
  case resultCh <- req
}

pkg/ccl/changefeedccl/parallel_io.go line 126 at r1 (raw file):

	}

	var handleSuccess func(IORequest)

no need to declare this here?

pkg/ccl/changefeedccl/parallel_io.go line 129 at r1 (raw file):

	var pendingResults []IORequest

	sendToWorker := func(ctx context.Context, req IORequest) {

nit: would something like submitIO be a better name? maybe startIO?

pkg/ccl/changefeedccl/parallel_io.go line 159 at r1 (raw file):

		var stillPending = pending[:0] // Reuse underlying space
		for _, pendingReq := range pending {
			// If no intersection, nothing changed for this request's validity

nit: . at the end of a comment? (below as well?)

pkg/ccl/changefeedccl/parallel_io.go line 160 at r1 (raw file):

		for _, pendingReq := range pending {
			// If no intersection, nothing changed for this request's validity
			if !req.Keys().Intersects(pendingReq.Keys()) {

Do we need this check? We've cleared req.Keys from inflight; wouldn't the inflight check below be sufficient?

pkg/ccl/changefeedccl/parallel_io.go line 166 at r1 (raw file):

			// If it is now free to send, send it
			if !inflight.Intersects(pendingReq.Keys()) {

nit: perhaps swap conditions to make it positive?

pkg/ccl/changefeedccl/parallel_io.go line 193 at r1 (raw file):

		pendingResults = nil
		for _, res := range unhandled {
			handleSuccess(res)

same comment as sendWorker below: I think this function should return error.

pkg/ccl/changefeedccl/parallel_io.go line 198 at r1 (raw file):

		select {
		case req := <-pe.requestCh:
			if !inflight.Intersects(req.Keys()) {

nit: do you think flipping if conditions (if inflightIntersects {} else {}) would improve readability?
I personally find it easier to read if I don't have to negate things.

pkg/ccl/changefeedccl/parallel_io.go line 200 at r1 (raw file):

			if !inflight.Intersects(req.Keys()) {
				inflight.UnionWith(req.Keys())
				sendToWorker(ctx, req)

looking at sendToWorker function -- would it be better to have it return error when ctx.Done()?
I know it's a bit more typing, but relying on this loop to terminate might not be idea.
We will be processing previous unhandled results, right? And those submit more stuff to other channels?
Wouldn't it be better to just return an error right away?

pkg/ccl/changefeedccl/changefeedbase/settings.go line 290 at r1 (raw file):

	"if enabled, this setting enables a new implementation of the webhook sink"+
		" that allows for a much higher throughput",
	util.ConstantWithMetamorphicTestBool("changefeed.new_webhook_sink_enabled", false),

should this be an env variable instead? Every setting we add, will have to be retired as well...

pkg/ccl/changefeedccl/changefeedbase/settings.go line 295 at r1 (raw file):

var SinkParallelism = settings.RegisterIntSetting(
	settings.TenantWritable,
	"changefeed.sink_parallelism",

would changefeed.sink.io_workers be a better name?

pkg/cmd/roachtest/tests/cdc.go line 1228 at r1 (raw file):

			ct.runTPCCWorkload(tpccArgs{warehouses: 100, duration: "30m"})

			if _, err := ct.DB().Exec("SET CLUSTER SETTING changefeed.new_webhook_sink_enabled = true;"); err != nil {

Why not keep ti metamorphic?

pkg/ccl/changefeedccl/helpers_test.go line 990 at r1 (raw file):

	sinkType := randomSinkTypeWithOptions(options)
	if sinkType == "" {
		return

that means what? sinklesss?

pkg/ccl/changefeedccl/sink_webhook_test.go line 601 at r1 (raw file):

		appendCount := 0
		batchingSink.knobs.OnAppend = func(event *kvEvent) {
			appendCount += 1

is this thread safe?

miretskiy

Sending first set of comments... There is a lot of code here, so I'll definitely need to spend more time on this.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @samiskin)

samiskin

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

pkg/ccl/changefeedccl/batching_sink.go line 69 at r1 (raw file):