-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Regression 2.4] Dispatcher / Aggregations broken. #242
Comments
Odd, those changes I'd consider harmless for what you see. Validate would be more problematic, but I agree it shouldn't be triggered at all. Question, do you use TCP or UDP traffic? |
I use TCP. I did look at netstat, and there's no closing/opening of connections, or any reported closed connections from either collectd or the relay. |
If you give me about 30 minutes, I have a test server up receiving 10k~ metrics per second, just running a bisect on master. |
Incase it's useful information, the relay is started with these command-line arguments:
|
that would be most helpful, I'll await your results |
Actually, this'll have to wait until tomorrow. Either the configuration is different, or I'm just not sending enough metrics to it (it's getting between around a third of what production is handling), but it seems to be not reproducible in testing. |
Hmmm, that's annoying. I'll see if I can find anything suspicious in the commits that went in. Thanks so far for trying to reproduce. |
6a17865 is the first bad commit
:100644 100644 424bc72a80593d8ff93b635142b92d88ce2bd0fa 5ba3eb42771482e81cf3bfa0ede9fe41ca778405 M dispatcher.c |
The relay is listening on udp, but no traffic is being sent to it. |
Adding the following line:
The comparison between logging before and after is:
vs.
Looks like it is stuck in a loop, sleeping for long periods of time. |
Yep, it's calling
|
Hmmm, food for thought. I'll look into it ASAP. |
The problem is that I falsely treated aggregator metrics as udp metrics. This is due to some sneaky implicit usage of a variable, which is wrong cheapness on my part. |
Both aggregator and UDP connections have the noexpire bit set, but we don't want to treat aggregator connections as all-in-one packet connections. So, instead of messing around with len (loosing the ability to say something about whether work was done or not) and making some implicit assumptions, introduce an isudp bit, which we can use instead to trigger the incomplete message discard (6a17865). This is likely the fix for issue #242, because it uses aggregations, and the aggregator gets heavily blocked since all work the dispatchers did were flagged as "no-work" leading to sleeps.
Just confirming that this looks like it's working fine:
|
that was indeed the first thing I did, then I realised it isn't a complete fix, because it would keep udp connections vulnerable to this still. |
actually, the work being done fix is 254b32b |
It looks ok from here, initially however the number of queued metrics was still quite erratic, and there are still reported dropped metrics from the aggregator, however only a handful instead of thousands. After restarting again it cleared up however, so maybe something else happened or I managed to trigger another performance bottleneck that is unrelated to this issue. |
just to be clear: you to latest git, started that after stopping v2.3 and then you saw that behaviour, correct? |
I've released v2.5, I'll wait with closing this to hear back from you if erratic behaviour still happens from time to time. |
Yeah, here's the graph showing what happened at the time, I've also annotated with what I did. And here's what the queues were looking like. Now, it may just be that I didn't wait long enough, and that the relay may eventually smoothen out. The explanation would be that during the time that the relay is down, collectd would be keeping metrics in its own cache until a time when the relay comes back up. It probably overwhelmed the relay, to the point which it can never quite recover. I would expect this to be reproducible on v2.3 if given the right scenario. The second and third restart of the service may have just been chance observations. And when retrying on the fifth time, it never recurred. |
I think this can be closed, what I saw above is likely unrelated, and didn't actually affect the relay in a way that degraded performance / affects stats. |
I'll let the graphs speak for themselves about what I saw after upgrading from v2.3 to v2.4.
After the first restart, the relay wasn't receiving any metrics. On the send restart, it was reading in a very bursty manner.
The only changes I can see that might be related are 6a17865 or 5b1e04d
Everything else seems to be related to the new
validate
clause. However probably shouldn't rule them out.The text was updated successfully, but these errors were encountered: