-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outputs block inputs when batch size is reached #2919
Comments
Thanks for the write up, this is indeed the case. I would like to get this fixed but of course we need to be very careful to do it correctly. There is one more issue that complicates things, we need to ensure that inputs reading from queues do not drop messages just because the output cannot keep up. |
Hmm yea didn't think of that and I can I see how that complicates things, since currently the input will block and it will stop reading from queues if it can't write to the outputs. If I submit my proposal that will definitely break that flow and could cause metrics to be dropped. It's almost list blocking is needed for inputs that are queues and internal buffering is needed for inputs that do collection. |
We need a method for inputs to be notified when their metrics have been emitted, this will allow durable transfer of queue items since we can delay acking those messages as well as better flow control. Of course it needs to be performant too. |
Sorry if I'm highjacking here.. but I have a similar problem and was wondering if it wouldn't be beneficial to augment the logs to tell if the I have this in my telegraf logs... over three days of logs saying
nothing about outputs. https://gist.github.com/djsly/6cf0e61ae440772f4c103f0898c8e62a
and when I go in the telegraf docker container, I can easily fetch the metrics without issues.
I'm running telegraf 1.2.1
NB: if you prefer a new issue let me know :) thanks! |
It would be nice to improve this, but I don't think we can be confident that it is the output causing this. In general any part of the input could cause this error since the http request is just one part of the collection. Even if the input is definitely blocked on the channel it is not always the outputs fault. One thing maybe we could do is log a warning if it takes longer than some amount of time to add a metric to the channel, or perhaps this is something we should store an internal metric for. |
I was giving more thought to this today and realized that I don't think that contract is kept with the way the agent is written today. Specifically in the event that the output is down or slow (takes longer than a timeout) and the write to the output fails. In that case the agent will push the batch into the fail buffer This will allow the inputs to add metrics until they fill a batch again. That batch would then try to write, but wouldn't be able to and get added to the fail buffer. This will keep happening until the fail buffer is full at which point the agent starts dropping metrics. As for the comment about different logging, I would agree it would be better if the logging was a little different, although not sure the best way though. The way it is know it makes it look like something is hanging on the system that is preventing it from collecting the metrics, which is what the belief was in #2183 as well. In reality there is just something blocked within telegraf (in this case). I was focused on that path for a while until I started looking at the internal metrics of how long the writes were taking. It feels there should be some indication that the collection of time of the metrics was ok. I think it makes sense to add the logging statement of how long it took to write to a channel at the debug layer, this way its easier to see that it's not actually the collection taking forever. Although maybe if this gets fixed there won't be a need to log that as writing to the channel won't block. |
You are right about the queue handling when the output is down, metrics are dropped. When the outputs are slow but working, if I remember correctly it will block the inputs and slow consumption down, but eventually again you are right and items can be dropped. Even though the queue handling is not completely correct, if we decouple the output completely it would cause a lot more day to day problems for queue users, so it needs to be solved before we decouple things. |
Just to add a little context from my experimentation this week. I've been seeing 2017-06-13T23:07:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s) I tracked down the same path as the original submitter (and how I came across this issue report). Though at the beginning I suspected an issue with the statsd plugin due to a mutex dead lock that occurred between Statsd.Gather() and the statsd line parsing/aggregation. In this case the statsd plugin eventually filled its incoming metric queue and starts dropping new metrics. So after an extensive rewrite to remove the mutex locking; I was able to resolve the queue full error. However the deadlock as described above in the issue remains. (statsd internal dead lock is resolved) Unfortunately once telegraf gets here it never recovers till we restart. On a side note if there is interest in the statsd changes I made I can start getting approvals on my side for code donation. |
Always interested in bug fixes, can you start by opening a new issue with a description of the symptoms? |
Some additional info as I saw this recently: if the output, at least the InfluxDB output, cannot send a request, the same exact scenario occurs. This had a severe impact on our current system as I am using Telegraf as a metrics proxy for our apps. In our case, it was a configuration issue that was resolved quickly once I found this ticket, but it's not always going to be some thing this simple. Now, this can be mitigated by using something like Kafka to buffer the requests and having queue inputs only ack messages that are successfully committed to a data store. However, not everyone is going to want to do this. So, this means that you may want to consider an embedded solution. The best I can come up with is having an option to buffer messages to a file when it detects that there are downstream issues. This assumes that there will be enough disk space and with good monitoring it should be noisy enough to trigger an alert. I'm not sure if we would want this on a per-output basis or have a single buffer that feeds all outputs. There are pros/cons to both. Granted, this idea needs to consider if we want to drop metrics after a specified amount or time period. If the period is short enough, then we wouldn't need the file. The file idea just allows for significantly more metrics to be pushed through before we start losing them, which may be important to some people. |
@danielnelson any update for this blocking issue? or an idea on how this could be fixed |
I have not yet had time to work on this directly, though it is still on my long term roadmap as one of the major design issues that need to be addressed. I think the first step towards fixing this issue is creating a method for throttling the various queue consumer inputs. These inputs should limit the messages they have in flight, perhaps using a callback when a message is output. This is complicated by the fact that metrics can be modified or duplicated depending on the plugin layout. |
@buckleypm I believe I might be seeing the same issue as you describe. I am seeing three telegraf servers randomly stop publishing metrics and my stack traces look similar; searching for the cause led me to this thread (but there's at least 4 other bugs that also looked similar) Logs and Stack traces I am seeing:
If this looks like a different issue, please let me know |
Bug report
If one of the output plugins become slow it can cause back pressure all the way up through the input plugins and start to block the input plugins from running.
The reason for this is the channels between the input and output plugins only have a buffer of 100. This should be ok since most of the operations are in memory. The problem ends up being here where the buffer gets written if the batch size is full. This ends up happening on the same thread that reads from the
outMetricC
channel in the agentRelevant telegraf.conf:
System info:
Telegraf version 1.3.1
Steps to reproduce:
Expected behavior:
The expected behavior is for the outputs and the inputs to be independent of each other, meaning all the inputs should write to a buffer and the outputs read from that buffer. This way the outputs can't block the inputs.
Actual behavior:
Most of the time the expected behavior happens, except for when the batch size is reached. When the batch size is reached the output happens on that thread which ends up blocking all the inputs. This causes the inputs to take a long time to collect as they get blocked waiting on the metricC channel
Additional info:
I believe this is the code path that causes the blocking.
It seems that the flusher should just write to the buffer and trigger the go routine here to write instead of doing the write inline. This way the flow can still continue and metrics just get buffered
Sample python server
This is the sample python http server used for testing that does a sleep. This simulates an output taking too long.
The text was updated successfully, but these errors were encountered: