Skip to content
This repository has been archived by the owner on Apr 2, 2024. It is now read-only.

Add batching for traces #1554

Merged
merged 1 commit into from
Aug 29, 2022
Merged

Add batching for traces #1554

merged 1 commit into from
Aug 29, 2022

Conversation

niksajakovljevic
Copy link
Contributor

@niksajakovljevic niksajakovljevic commented Aug 10, 2022

Batching will help to achieve better ingest performance, especially if
traces are sent one by one.

A batch is produced when one of two conditions is met: batch size or timeout.

This PR also adds async support for traces meaning that client doesn't need to wait for DB write. This increases ingest performance with a small risk of data loss.

Added 3 new CLI flags:

  • Two flags to control batch size: tracing.max-batch-size and tracing.batch-timeout.
  • Flag for async writes tracing.async-acks

@niksajakovljevic niksajakovljevic added the epic/jaeger-grpc-write Jaeger gRPC based write integration label Aug 10, 2022
@niksajakovljevic niksajakovljevic requested a review from a team August 10, 2022 11:44
@niksajakovljevic niksajakovljevic self-assigned this Aug 10, 2022
@niksajakovljevic niksajakovljevic removed the request for review from a team August 10, 2022 11:45
@niksajakovljevic niksajakovljevic force-pushed the niksa/batch-traces branch 5 times, most recently from ca3872f to 6853fd1 Compare August 10, 2022 13:45
@niksajakovljevic niksajakovljevic marked this pull request as ready for review August 10, 2022 13:46
@niksajakovljevic niksajakovljevic requested review from a team as code owners August 10, 2022 13:46
@niksajakovljevic niksajakovljevic force-pushed the niksa/batch-traces branch 2 times, most recently from a7546c1 to 149872d Compare August 10, 2022 16:53
@niksajakovljevic
Copy link
Contributor Author

The local benchmarks that I've added are pointing that batching should provide better performance, however I still need to run full blown benchmark with more realistic data loads.

pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
defaultMaxBufferedBatches = 200 // arbitrary picked. We want to avoid wait on batches
)

var defaultBatchWorkers = runtime.NumCPU() / 2 // we only take half so other half can be used for writers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is primarily IO driven, is it possible to size this based on number of db connections instead of no.of CPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look closer batching and writing batches is separated thus creating batches is mostly CPU driven - batches are only written to a channel and picked up by batch writers.

pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
Copy link
Member

@Harkishen-Singh Harkishen-Singh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to see perf difference with and without batching in case of Jaeger ingestion.

CHANGELOG.md Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/ingestor.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/ingestor.go Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Show resolved Hide resolved
Comment on lines 154 to 175
case <-ticker.C:
batcherSpan.AddEvent("Batch timeout reached")
if !batch.isEmpty() {
batch = flushBatch(batch)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work as expected.
It will tick even if the very previous moment we did the item := <-b.in case and then flushed the batch. This is because the ticker is just created once, so it does not care about most recent event.

We should use <-time.After(config.BatchTimeout) in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that's true. If item is received and batch is full we will flush it and reset the ticker. If batch is not full we still can reach the timeout and flush it which is what we want. Does it now make sense?

pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_dispatcher.go Outdated Show resolved Hide resolved
@niksajakovljevic niksajakovljevic force-pushed the niksa/batch-traces branch 3 times, most recently from 2500299 to 07dbd59 Compare August 16, 2022 15:07
@niksajakovljevic
Copy link
Contributor Author

Would be good to see perf difference with and without batching in case of Jaeger ingestion.

Yes I will be soon publishing benchmark numbers in this PR.

pkg/pgclient/client.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/ingestor.go Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_dispatcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
Copy link
Member

@Harkishen-Singh Harkishen-Singh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. However, some changes are needed on the metrics side.

pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
b.bufferedBatches <- batchCp
batcherSpan.AddEvent("Batch sent to buffer")
batcherSpan.AddEvent("New Batch")
return NewBatch(b.config.MaxBatchSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong suggestion. My aim was more towards readability. But, let's ignore this.

pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/metrics/ingest.go Outdated Show resolved Hide resolved
pkg/pgmodel/metrics/ingest.go Outdated Show resolved Hide resolved
Copy link
Contributor

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good but needs some changes

pkg/pgclient/client.go Outdated Show resolved Hide resolved
pkg/pgclient/client.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/copier.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
)

const (
defaultReqBufferSize = 100000 // buffer for incoming requests, especially important for async acks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these defaults be derived from other defaults

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...not sure about this one. My idea was to have enough buffer for spikes. 100K might be too much. In the recent benchmark runs I did I've never seen this going above 1K. Maybe we can set it to 5K for now and tune as we go. Frankly fine tuning these requires time and a lots of benchmark runs... Please let me know if you have a better idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tweak this to be MaxBatchSize * 3. I also split incoming request into multiple queues (we have a separate queue for each batcher). I added a separate commit for easier review.

pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Show resolved Hide resolved
Copy link
Member

@Harkishen-Singh Harkishen-Singh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit remaining, otherwise LGTM 👍🏻

pkg/pgmodel/ingestor/trace/trace_batcher.go Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
pkg/pgmodel/ingestor/trace/trace_batcher.go Outdated Show resolved Hide resolved
// If it's only one span we shard by it's TraceID so spans with the same TraceID end up in the same batcher.
// Otherwise we roun-robin between batchers
func (td *Dispatcher) getBatcherIdx(ctx context.Context, traces ptrace.Traces) (int, error) {
numberOfBatchers := td.batcher.config.Batchers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that the dispatcher needs to care about the number of batchers but this may be unavoidable...

Batching will help to achieve better ingest performance, especially if
traces are sent one by one (which is the case for Jaeger collector).

Batch is flushed either on timeout or when full.

Adds async support for traces meaning that client doesn't
need to wait for DB write. This increases ingest performance with a small
risk of data loss. New CLI flag `tracing.async-acks` added.

Flags to control batch size: `tracing.max-batch-size` and `tracing.batch-timeout`.
Flags to control batch workers: `tracing.batch-workers`
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
epic/jaeger-grpc-write Jaeger gRPC based write integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants