You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We are using BatchSpanProcessor and have a scenario that where the traffic burst, we will lose some of the spans because we can't always tune the processor configs perfectly at time. Hence I am thinking if there is a way we can persist the data to make sure there is no data loss when traffic spikes.
The behavior of batch span processor is dictated by the specification, so changing its behavior to have a persistent queue would require changing the spec, which I expect would be an involved process.
I have heard a variety of people interested in increasing the reliability of telemetry delivery. This requires several design changes:
We need a solution for handling bursts of spans that overwhelm the buffer contained in batch span processor. This could be local persistence on disk, but this is not without problems. Something still has to serialize to disk and deserialize later. Can this serialization keep up with bursts of spans? The disk itself needs to be thought through. What happens if an app suddenly is OOM killed while un-exported spans remain on disk? Is the disk ephemeral or persistent across app restarts? If an app comes back up and has a bunch of unexported spans on disk, what's the priority between exporting those spans and new spans?
The OTLP protocol itself needs to be enhanced for more reliability. Right now, OTLP is prone to duplicate data, since if an export request is transmitted and the success response never makes it to the client, the client will typically retry.
These are not trivial challenges. There's currently a proposal to start an Audit logging SIG. On the surface, this seems unrelated to your request. But when you dig deeper, one of the primary challenges the audit SIG would face is improving the reliability around delivery of OpenTelemetry data. Whatever improvements they make for reliable delivery of audit logs will also likely be available as an opt in feature for traces, metrics, and logs. I think progress on this request is most likely to come from that area so I encourage you to check it out and comment.
Is your feature request related to a problem? Please describe.
We are using
BatchSpanProcessor
and have a scenario that where the traffic burst, we will lose some of the spans because we can't always tune the processor configs perfectly at time. Hence I am thinking if there is a way we can persist the data to make sure there is no data loss when traffic spikes.Reference: If the configs are not tuned well, it will drop spans here silently only with limited metrics: https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java#L238
Describe the solution you'd like
opentelemetry-java/sdk/trace-shaded-deps/src/main/java/io/opentelemetry/sdk/trace/internal/JcTools.java
Line 32 in efdacc1
Describe alternatives you've considered
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: