Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persistent batch processor to prevent telemetry data loss #6940

Open
xhyzzZ opened this issue Dec 11, 2024 · 1 comment
Open

Support persistent batch processor to prevent telemetry data loss #6940

xhyzzZ opened this issue Dec 11, 2024 · 1 comment
Labels
blocked:spec blocked on open or unresolved spec Feature Request Suggest an idea for this project

Comments

@xhyzzZ
Copy link

xhyzzZ commented Dec 11, 2024

Is your feature request related to a problem? Please describe.
We are using BatchSpanProcessor and have a scenario that where the traffic burst, we will lose some of the spans because we can't always tune the processor configs perfectly at time. Hence I am thinking if there is a way we can persist the data to make sure there is no data loss when traffic spikes.

Reference: If the configs are not tuned well, it will drop spans here silently only with limited metrics: https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessor.java#L238

Describe the solution you'd like

  1. Replacing the in memory queue implementation with some local persisitent queue solution like https://github.com/bulldog2011/bigqueue?
  2. Using MpscUnboundedArrayQueue instead of bounded MpscArrayQueue?
    public static <T> Queue<T> newFixedSizeQueue(int capacity) {

Describe alternatives you've considered
N/A

Additional context
N/A

@xhyzzZ xhyzzZ added the Feature Request Suggest an idea for this project label Dec 11, 2024
@jack-berg
Copy link
Member

The behavior of batch span processor is dictated by the specification, so changing its behavior to have a persistent queue would require changing the spec, which I expect would be an involved process.

I have heard a variety of people interested in increasing the reliability of telemetry delivery. This requires several design changes:

  • We need a solution for handling bursts of spans that overwhelm the buffer contained in batch span processor. This could be local persistence on disk, but this is not without problems. Something still has to serialize to disk and deserialize later. Can this serialization keep up with bursts of spans? The disk itself needs to be thought through. What happens if an app suddenly is OOM killed while un-exported spans remain on disk? Is the disk ephemeral or persistent across app restarts? If an app comes back up and has a bunch of unexported spans on disk, what's the priority between exporting those spans and new spans?
  • The OTLP protocol itself needs to be enhanced for more reliability. Right now, OTLP is prone to duplicate data, since if an export request is transmitted and the success response never makes it to the client, the client will typically retry.

These are not trivial challenges. There's currently a proposal to start an Audit logging SIG. On the surface, this seems unrelated to your request. But when you dig deeper, one of the primary challenges the audit SIG would face is improving the reliability around delivery of OpenTelemetry data. Whatever improvements they make for reliable delivery of audit logs will also likely be available as an opt in feature for traces, metrics, and logs. I think progress on this request is most likely to come from that area so I encourage you to check it out and comment.

@jack-berg jack-berg added the blocked:spec blocked on open or unresolved spec label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked:spec blocked on open or unresolved spec Feature Request Suggest an idea for this project
Projects
None yet
Development

No branches or pull requests

2 participants