-
Notifications
You must be signed in to change notification settings - Fork 168
Conversation
ca3872f
to
6853fd1
Compare
a7546c1
to
149872d
Compare
The local benchmarks that I've added are pointing that batching should provide better performance, however I still need to run full blown benchmark with more realistic data loads. |
defaultMaxBufferedBatches = 200 // arbitrary picked. We want to avoid wait on batches | ||
) | ||
|
||
var defaultBatchWorkers = runtime.NumCPU() / 2 // we only take half so other half can be used for writers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is primarily IO driven, is it possible to size this based on number of db connections instead of no.of CPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look closer batching and writing batches is separated thus creating batches is mostly CPU driven - batches are only written to a channel and picked up by batch writers.
149872d
to
e205cfa
Compare
0e7e316
to
ed765e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to see perf difference with and without batching in case of Jaeger ingestion.
case <-ticker.C: | ||
batcherSpan.AddEvent("Batch timeout reached") | ||
if !batch.isEmpty() { | ||
batch = flushBatch(batch) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work as expected.
It will tick even if the very previous moment we did the item := <-b.in
case and then flushed the batch. This is because the ticker is just created once, so it does not care about most recent event.
We should use <-time.After(config.BatchTimeout)
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that's true. If item is received and batch is full we will flush it and reset the ticker. If batch is not full we still can reach the timeout and flush it which is what we want. Does it now make sense?
2500299
to
07dbd59
Compare
Yes I will be soon publishing benchmark numbers in this PR. |
07dbd59
to
c2a70b4
Compare
c2a70b4
to
4859065
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. However, some changes are needed on the metrics side.
b.bufferedBatches <- batchCp | ||
batcherSpan.AddEvent("Batch sent to buffer") | ||
batcherSpan.AddEvent("New Batch") | ||
return NewBatch(b.config.MaxBatchSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong suggestion. My aim was more towards readability. But, let's ignore this.
f2ac414
to
ba9f7d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good but needs some changes
) | ||
|
||
const ( | ||
defaultReqBufferSize = 100000 // buffer for incoming requests, especially important for async acks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't these defaults be derived from other defaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...not sure about this one. My idea was to have enough buffer for spikes. 100K
might be too much. In the recent benchmark runs I did I've never seen this going above 1K. Maybe we can set it to 5K for now and tune as we go. Frankly fine tuning these requires time and a lots of benchmark runs... Please let me know if you have a better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tweak this to be MaxBatchSize * 3. I also split incoming request into multiple queues (we have a separate queue for each batcher). I added a separate commit for easier review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit remaining, otherwise LGTM 👍🏻
ba9f7d2
to
9cc518c
Compare
24eb8b4
to
b6528f6
Compare
// If it's only one span we shard by it's TraceID so spans with the same TraceID end up in the same batcher. | ||
// Otherwise we roun-robin between batchers | ||
func (td *Dispatcher) getBatcherIdx(ctx context.Context, traces ptrace.Traces) (int, error) { | ||
numberOfBatchers := td.batcher.config.Batchers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that the dispatcher needs to care about the number of batchers but this may be unavoidable...
Batching will help to achieve better ingest performance, especially if traces are sent one by one (which is the case for Jaeger collector). Batch is flushed either on timeout or when full. Adds async support for traces meaning that client doesn't need to wait for DB write. This increases ingest performance with a small risk of data loss. New CLI flag `tracing.async-acks` added. Flags to control batch size: `tracing.max-batch-size` and `tracing.batch-timeout`. Flags to control batch workers: `tracing.batch-workers`
b6528f6
to
85556ae
Compare
Batching will help to achieve better ingest performance, especially if
traces are sent one by one.
A batch is produced when one of two conditions is met: batch size or timeout.
This PR also adds
async
support for traces meaning that client doesn't need to wait for DB write. This increases ingest performance with a small risk of data loss.Added 3 new CLI flags:
tracing.max-batch-size
andtracing.batch-timeout
.tracing.async-acks