Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race on BatchSpanProcessor #4249

Closed
peolivei2 opened this issue Jun 23, 2023 · 4 comments · Fixed by #4353
Closed

Data race on BatchSpanProcessor #4249

peolivei2 opened this issue Jun 23, 2023 · 4 comments · Fixed by #4353
Labels
area:trace Part of OpenTelemetry tracing bug Something isn't working help wanted Extra attention is needed pkg:SDK Related to an SDK package

Comments

@peolivei2
Copy link

We're seeing the data race below for SDK 1.14.0:

image

@peolivei2 peolivei2 added the bug Something isn't working label Jun 23, 2023
@dmathieu
Copy link
Member

Could you provide a reproduction example?

@peolivei2
Copy link
Author

@dmathieu the problem here is the approach the SDK is taking for handling the close/send of the queue channel. It's not guaranteeing that it won't send to the channel after closing, but instead relying on a recover after a panic. That's a valid data race as documented here:

It doesn't repro consistently because of another race condition in the SDK:
image

My suggestion here would be to not close the queue channel at all, then combine the selects above so you just atop adding to the queue after the shutdown.

@dmathieu
Copy link
Member

Combining the selects wouldn't be a reliable solution either, as Go doesn't run select/case statements in the order they are specified, but in shuffle order.
So every time, we could be appending to the queue again and returning true, or detect the stopCh has been closed, and return false.

See https://medium.com/a-journey-with-go/go-ordering-in-select-statements-fd0ff80fd8d6

@pellared pellared added area:trace Part of OpenTelemetry tracing pkg:SDK Related to an SDK package help wanted Extra attention is needed labels Jul 14, 2023
@pellared
Copy link
Member

I quickly looked at the code and it uses channels improperly.

The "consumer" goroutine is closing the channel:

// drainQueue awaits the any caller that had added to bsp.stopWait
// to finish the enqueue, then exports the final batch.
func (bsp *batchSpanProcessor) drainQueue() {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
for {
select {
case sd := <-bsp.queue:
if sd == nil {
if err := bsp.exportSpans(ctx); err != nil {
otel.Handle(err)
}
return
}
bsp.batchMutex.Lock()
bsp.batch = append(bsp.batch, sd)
shouldExport := len(bsp.batch) == bsp.o.MaxExportBatchSize
bsp.batchMutex.Unlock()
if shouldExport {
if err := bsp.exportSpans(ctx); err != nil {
otel.Handle(err)
}
}
default:
close(bsp.queue)
}
}
}

Moreover, the "producer" goroutine has some hacks to prevent panics if it sends to a closed channel.:

func recoverSendOnClosedChan() {
x := recover()
switch err := x.(type) {
case nil:
return
case runtime.Error:
if err.Error() == "send on closed channel" {
return
}
}
panic(x)
}
func (bsp *batchSpanProcessor) enqueueBlockOnQueueFull(ctx context.Context, sd ReadOnlySpan) bool {
if !sd.SpanContext().IsSampled() {
return false
}
// This ensures the bsp.queue<- below does not panic as the
// processor shuts down.
defer recoverSendOnClosedChan()
select {
case <-bsp.stopCh:
return false
default:
}
select {
case bsp.queue <- sd:
return true
case <-ctx.Done():
return false
}
}

I guess there may be more problems, but it looks clear that the synchronization is not implemented properly.

Reference: https://go101.org/article/channel-closing.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:trace Part of OpenTelemetry tracing bug Something isn't working help wanted Extra attention is needed pkg:SDK Related to an SDK package
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants