Support persistent batch processor to prevent telemetry data loss #6940

xhyzzZ · 2024-12-11T00:20:20Z

Is your feature request related to a problem? Please describe.
We are using BatchSpanProcessor and have a scenario that where the traffic burst, we will lose some of the spans because we can't always tune the processor configs perfectly at time. Hence I am thinking if there is a way we can persist the data to make sure there is no data loss when traffic spikes.

Reference: If the configs are not tuned well, it will drop spans here silently only with limited metrics: https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java#L238

Describe the solution you'd like

Replacing the in memory queue implementation with some local persisitent queue solution like https://github.com/bulldog2011/bigqueue?
Using MpscUnboundedArrayQueue instead of bounded MpscArrayQueue?

opentelemetry-java/sdk/trace-shaded-deps/src/main/java/io/opentelemetry/sdk/trace/internal/JcTools.java

Line 32 in efdacc1

public static <T> Queue<T> newFixedSizeQueue(int capacity) {

Describe alternatives you've considered
N/A

Additional context
N/A

The text was updated successfully, but these errors were encountered:

jack-berg · 2024-12-12T16:46:27Z

The behavior of batch span processor is dictated by the specification, so changing its behavior to have a persistent queue would require changing the spec, which I expect would be an involved process.

I have heard a variety of people interested in increasing the reliability of telemetry delivery. This requires several design changes:

We need a solution for handling bursts of spans that overwhelm the buffer contained in batch span processor. This could be local persistence on disk, but this is not without problems. Something still has to serialize to disk and deserialize later. Can this serialization keep up with bursts of spans? The disk itself needs to be thought through. What happens if an app suddenly is OOM killed while un-exported spans remain on disk? Is the disk ephemeral or persistent across app restarts? If an app comes back up and has a bunch of unexported spans on disk, what's the priority between exporting those spans and new spans?
The OTLP protocol itself needs to be enhanced for more reliability. Right now, OTLP is prone to duplicate data, since if an export request is transmitted and the success response never makes it to the client, the client will typically retry.

These are not trivial challenges. There's currently a proposal to start an Audit logging SIG. On the surface, this seems unrelated to your request. But when you dig deeper, one of the primary challenges the audit SIG would face is improving the reliability around delivery of OpenTelemetry data. Whatever improvements they make for reliable delivery of audit logs will also likely be available as an opt in feature for traces, metrics, and logs. I think progress on this request is most likely to come from that area so I encourage you to check it out and comment.

xhyzzZ added the Feature Request Suggest an idea for this project label Dec 11, 2024

jack-berg added the blocked:spec blocked on open or unresolved spec label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support persistent batch processor to prevent telemetry data loss #6940

Support persistent batch processor to prevent telemetry data loss #6940

xhyzzZ commented Dec 11, 2024 •

edited

Loading

jack-berg commented Dec 12, 2024

Support persistent batch processor to prevent telemetry data loss #6940

Support persistent batch processor to prevent telemetry data loss #6940

Comments

xhyzzZ commented Dec 11, 2024 • edited Loading

jack-berg commented Dec 12, 2024

xhyzzZ commented Dec 11, 2024 •

edited

Loading