Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki exporter : Lot of out of order errors #3090

Closed
gillg opened this issue Apr 13, 2021 · 14 comments · Fixed by #3705
Closed

Loki exporter : Lot of out of order errors #3090

gillg opened this issue Apr 13, 2021 · 14 comments · Fixed by #3705
Assignees
Labels
bug Something isn't working

Comments

@gillg
Copy link
Contributor

gillg commented Apr 13, 2021

Describe the bug
When a use Loki exporter with a unique set of labels I get a lot of out of order dropped messages (resulting in a full OTEL collector queue).
My labels are correct accross logs to be uniques by main producers processes (but I can't seriously introduce thread ID as label).

Is there a workarond for now ?

Steps to reproduce
Create an app which sends logs through OTLP to a collector.
Different threads of the app emits logs, so timestamps can be a little shuffled when OTEL Collector receive them.

What did you expect to see?
Especialy for Loki exporter (or as a new processor), we should be able to buffer logs for a short time and order them before send them to Loki.

What did you see instead?
OTEL Collector queue is filling a lot with retries for loki 400 bad resquests responses (out of order).

What version did you use?
otel-collector-contrib:523e417c06215ffdd5ce1d36ad4fa657f4dd525c

What config did you use?

receivers:
  otlp:
    protocols:
      http:
      grpc:

exporters:
  logging:
    loglevel: debug
  loki:
    endpoint: http://localhost:3100/loki/api/v1/push
    labels:
      attributes:
        [... mappings for otel attributes...]

processors:
  batch:

extensions:
  health_check:

service:
  extensions: [health_check]
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, loki]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]

Additional context
Could be fixed by grafana/loki#1544

@gillg gillg added the bug Something isn't working label Apr 13, 2021
@gillg
Copy link
Contributor Author

gillg commented Apr 13, 2021

I tried to remove the batch processor in hope to bypass the queue and maybe skip faster out of order logs to avoid losing too musch next messages. But it seems worse.

@gramidt
Copy link
Member

gramidt commented Apr 13, 2021

Thank you for opening this, @gillg! I too have ran into this and have started having conversations with other users to come up with a solution. This sounds like a an excellent use case for a buffer/ordering processor that can be reused amongst exporters. Would you like to take a stab at creating this processor? We should bring this constraint up in the Slack channels too.

@gramidt
Copy link
Member

gramidt commented Apr 13, 2021

The other main issue is when you have multiple exporters receiving logs. Even with local buffering and ordering, you would still run into this issue. You could also have one collector per label set, but also has issues and failure conditions to think through.

@gramidt
Copy link
Member

gramidt commented Apr 13, 2021

I just messaged the Loki creator to get their thoughts on this issue.

@gillg
Copy link
Contributor Author

gillg commented Apr 13, 2021

Thank you for opening this, @gillg! I too have ran into this and have started having conversations with other users to come up with a solution. This sounds like a an excellent use case for a buffer/ordering processor that can be reused amongst exporters. Would you like to take a stab at creating this processor? We should bring this constraint up in the Slack channels too.

This issue is definitely blocking to adopt OpenTelemetry and Loki.
I would help but I'm really not comfortable in Go. I can make some minor changes but not write a full processor with a sort algorithm :/
I can try something but this kind of processor is really complex to write to avoid performance issues.

I put my hopes in Loki team and the issue I linked. It's one of their top priority subject...

@gramidt
Copy link
Member

gramidt commented Apr 13, 2021

I was talking with Ed (Loki lead) and he said it was near the top of their list as you mentioned. Here's the document with solutions that Owen from Grafana starting thinking about ( https://github.com/grafana/loki/blob/master/docs/sources/design-documents/2021-01-Ordering-Constraint-Removal.md ). The downside is that this work has not been started and does not have a definitive date when it will be completed. (Worth noting, I do not work for Grafana Labs).

One approach we could take is the processor approach where it buffers and orders based on the timestamp, then we add in an additional label identifying the collector source. This could help solve some issues, but not all of the issues and could drastically increase cardinality.

Another approach is adding the ability to enable rewriting the timestamp of the logs and add an additional label identifying the collector source. I'm not a fan of this approach, but it's sort of what's done with other Loki clients that fan in and it is also a pretty straightforward solution. This approach solves the most amount of issues, but the timestamps wouldn't be accurate and could drastically increase cardinality.

@gramidt
Copy link
Member

gramidt commented Apr 13, 2021

@bogdandrutu @tigrannajaryan - Please assign to me.

@gillg
Copy link
Contributor Author

gillg commented Apr 13, 2021

Yeah I saw this document and I proposed a solution to Owen (another core dev team) just before seeing it grafana/loki#1544 (comment)

I have hope on their team but I'm not fan to wait for... weeks ? months ? :/
Anyway, the processor needs some design to be efficient and buffer correctly logs (trace + metrics also ?). So I don't know what could be faster.

The last workaround is intrusive... But could be a temporary solution. I'm definitly not fan to add this kind of intrusion in Loki exporter but it could be very easy to override existing timestamps by a new one juste before sending logs.
The additional attribute could be optional, I think the attribute processor could do the job ?

@gramidt
Copy link
Member

gramidt commented Apr 13, 2021

Agreed. I know there's a small group that have been thinking through buffering (in-memory and persistent options), but I haven't heard any updates on this recently. This processor should re-use shared components created by this group. This approach could still be months away if significant work hasn't already been started.

I really DISLIKE the last workaround and agree it's too intrusive for my taste, but it may be the only reasonable short time solution to this problem. Correct, the attribute processor could be used for adding the additional label. We could also use Resource attributes as labels and use the resourceprocessor ( https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/resourceprocessor ).

@gillg
Copy link
Contributor Author

gillg commented Apr 14, 2021

We could also use Resource attributes as labels and use the resourceprocessor ( https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/resourceprocessor ).

I integrated it yesterday ! So after that I would complete my previous PR about raw log + attributes merging and add resources merging (with the same ability to map them to labels).

I think in my case I will use an homemade version with the forced timestamp.
We could improve the public exporter with attributes and resources in a final json message for Loki (or another format ?)

@LouisStAmour
Copy link
Contributor

LouisStAmour commented Jun 5, 2021

I was thinking of making a ring buffer that used binary insertion sort to sort these logs, and it would output the stable part of the ring at regular intervals or the bottom % of the ring if the ring is near full.

But then I realized that it might be more efficient to group the logs by stream first, and then use binary insertion sort for each stream, as there would be fewer log entries to sort once split out like that. The downside of course is that you'd need a buffer for each stream and that could get exponentially expensive as you add more stream labels. Another option would be to bucket by timestamp instead of by stream, but then we end up with a problem where we have to maintain as many buckets as we need to cover the duration of timestamps or we need to use a dynamic number of buckets or dynamic intervals, neither of which sounds nice.

In the end, all of the above is probably overkill, especially as we don't actually have a reason right now to group logs by stream. It's ever so slightly annoying to implement this kind of feature here knowing it's something that could be handled more efficiently by Loki itself after splitting logs by stream... grafana/loki#1544

The only other thought I have is to make a groupandsortlogsprocessor that could be chained with a router or an exporter to try and more efficiently route data, perhaps to modules inside of Cortex/Loki more directly?

It's strange for example that Cortex Ingesters claim to reduce TCO by batching and compressing in memory before writing, but apparently can't afford to sort? Hmm. Cortex gives a reason over here: cortexproject/cortex#3411 (comment) I wish they could store metrics uncompressed as they batch to allow for insertion sorting, then compress before writing more permanently?

Alternatively if we assume sorting will never get added to Cortex, I wonder if maybe we should have something generic to sort both metrics and logs before sending to cortex and/or loki then? It would help support option 3 on this blog post: https://aws.amazon.com/blogs/opensource/building-a-prometheus-remote-write-exporter-for-the-opentelemetry-go-sdk/

Or we'll need to take the Prometheus route of applying our own timestamps as we send the data, which doesn't feel as clean but obviously works as long as it's relatively safe to ignore that data is, in fact, out of order and therefore wrong. That said sorting data implies that timestamps are correct and identical relative to one another - which is largely true when streams to be sorted come from places with the same source of timestamp. The more I think about it, the more sorting after splitting by stream makes sense, it means it's more likely to already be in the correct order. There's also an argument that if your data needs sorting, you're either missing streams or need to pay more attention to your source of timestamp for the data.

Maybe insertion timestamp override is for the best, for logs (not metrics) and take any out of order logs as they are. I think for logs sorting by timestamp might not be relevant especially if sending multi line logs using fluentd or OTLP. That way you don't have to worry about the log entry being slightly out of order affecting your ability to read multi line logs. It would be most serious for inserting metrics from logs, but even then you'd face the same problem of metrics potentially being received/processed out of order such that they would need to be sorted before being compressed, and that could be done after reading the metrics from the out of order logs but before sending metrics to Cortex...

All this comes with the caveat that if your logs are used for analytics or auditing where order is critical, you wouldn't be able to trust the Loki timestamps but would have to offline sort the logs before using them for other purposes.

I end up wishing that Cortex/Loki had two modes: one which saves in real-time by using insertion timestamp or by dropping out of order data and produces a real-time but slightly incorrect dataset and a second mode which reviews the previous data (including any dropped data) after some time (minutes?), sorts it correctly, and saves it with the correct timestamps for archival needs, such as historical (minutes ago or earlier) data.

@LouisStAmour
Copy link
Contributor

https://questdb.io/blog/2021/05/10/questdb-release-6-0-tsbs-benchmark/ highlights that receiving data "out-of-order" is often a problem solved within databases. I'm tempted to suggest that any sorting logic be added to the Loki exporter or to a hypothetical Cortex/Prometheus exporter until we can point to another exporter with similar limitations...

@gillg
Copy link
Contributor Author

gillg commented Jun 7, 2021

Interesting analysis. For now on my side I prefer "lose" the order and not logs themselves so I made a timestamp override to now(), hard coded in a custom branch...
If really need, in my structured log entry I have the original timestamp (set by the emitter, not by my override).
It could be a Loki exporter option but definitely not ideal.

Another thing that could be studied, is the Thanos receiver approach. I don't know if it's different than cortex/Loki but probably.
https://github.com/thanos-io/thanos/tree/main/pkg/receive. If you are comfortable to reverse engineering it and seeing something interesting 😅 I suppose they buffer and sort the wal, but in theory a Prometheus server can sent metrics with multiple retries some times later, so I don't know the limit.

@gillg
Copy link
Contributor Author

gillg commented Jun 28, 2021

Implementation of out-of-order toleration underway on Loki side ! 🎉

alexperez52 referenced this issue in open-o11y/opentelemetry-collector-contrib Aug 18, 2021
* added macos arm64 build

* added darwin/arm64 build to verify dist files

* added comment

* minor change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants