Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression 2.4] Dispatcher / Aggregations broken. #242

Closed
iain-buclaw-sociomantic opened this issue Jan 5, 2017 · 21 comments
Closed

[Regression 2.4] Dispatcher / Aggregations broken. #242

iain-buclaw-sociomantic opened this issue Jan 5, 2017 · 21 comments

Comments

@iain-buclaw-sociomantic
Copy link
Contributor

iain-buclaw-sociomantic commented Jan 5, 2017

I'll let the graphs speak for themselves about what I saw after upgrading from v2.3 to v2.4.

relay-dispatcher

relay-sent

After the first restart, the relay wasn't receiving any metrics. On the send restart, it was reading in a very bursty manner.

The only changes I can see that might be related are 6a17865 or 5b1e04d

Everything else seems to be related to the new validate clause. However probably shouldn't rule them out.

@grobian
Copy link
Owner

grobian commented Jan 5, 2017

Odd, those changes I'd consider harmless for what you see. Validate would be more problematic, but I agree it shouldn't be triggered at all.

Question, do you use TCP or UDP traffic?

@iain-buclaw-sociomantic
Copy link
Contributor Author

I use TCP. I did look at netstat, and there's no closing/opening of connections, or any reported closed connections from either collectd or the relay.

@iain-buclaw-sociomantic
Copy link
Contributor Author

If you give me about 30 minutes, I have a test server up receiving 10k~ metrics per second, just running a bisect on master.

@iain-buclaw-sociomantic
Copy link
Contributor Author

Incase it's useful information, the relay is started with these command-line arguments:

-i 127.0.0.1 -w 2 -b 50000 -q 1500000 -S 30 -m -p 2003

@grobian
Copy link
Owner

grobian commented Jan 5, 2017

that would be most helpful, I'll await your results

@iain-buclaw-sociomantic
Copy link
Contributor Author

iain-buclaw-sociomantic commented Jan 5, 2017

Actually, this'll have to wait until tomorrow. Either the configuration is different, or I'm just not sending enough metrics to it (it's getting between around a third of what production is handling), but it seems to be not reproducible in testing.

@grobian
Copy link
Owner

grobian commented Jan 5, 2017

Hmmm, that's annoying. I'll see if I can find anything suspicious in the commits that went in. Thanks so far for trying to reproduce.

@iain-buclaw-sociomantic
Copy link
Contributor Author

6a17865 is the first bad commit
commit 6a17865
Author: Fabian Groffen grobian@gentoo.org
Date: Sun Dec 25 23:34:54 2016 +0100

dispatch_connection: treat UDP packets as fully contained

As reported in issue #241, the relay was concatenating UDP packets until
it found a newline to terminate a line.  Since UDP packets don't
necessarily arrive (or in order), one can't rely on sending partial
fragments.  So instead, make sure we treat each received UDP packet as
a full message (and all we're going to get).

:100644 100644 424bc72a80593d8ff93b635142b92d88ce2bd0fa 5ba3eb42771482e81cf3bfa0ede9fe41ca778405 M dispatcher.c

@iain-buclaw-sociomantic
Copy link
Contributor Author

# netstat -anop | grep relay
tcp        0      0 127.0.0.1:2003          0.0.0.0:*               LISTEN      26570/relay      off (0.00/0/0)
tcp        0      0 127.0.0.1:40792         127.0.0.1:2103          ESTABLISHED 26570/relay      off (0.00/0/0)
tcp        0      0 127.0.0.1:41004         127.0.0.1:2403          ESTABLISHED 26570/relay      off (0.00/0/0)
tcp        0      0 127.0.0.1:40854         127.0.0.1:2203          ESTABLISHED 26570/relay      off (0.00/0/0)
tcp   5971128      0 127.0.0.1:2003          127.0.0.1:57796         ESTABLISHED 26570/relay      off (0.00/0/0)
tcp        0      0 127.0.0.1:47942         127.0.0.1:2303          ESTABLISHED 26570/relay      off (0.00/0/0)
udp        0      0 127.0.0.1:2003          0.0.0.0:*                           26570/relay      off (0.00/0/0)
unix  2      [ ACC ]     STREAM     LISTENING     1222862868 26570/relay         /tmp/.s.carbon-c-relay.2003

The relay is listening on udp, but no traffic is being sent to it.

@iain-buclaw-sociomantic
Copy link
Contributor Author

Adding the following line:

        /* "release" this connection again */
logout("dispatch[%d:%d]: releasing connection", self->fd, conn->takenby);
        __sync_bool_compare_and_swap(&(conn->takenby), self->id, 0);

The comparison between logging before and after is:

[2017-01-06 15:11:45] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:46] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:46] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:46] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:47] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:47] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:47] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:49] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:50] (MSG) dispatch[2:2]: releasing connection

vs.

[2017-01-06 15:13:10] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:11] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:12] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:13] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:14] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:14] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:14] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:15] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:15] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:15] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:16] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:16] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:17] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:19] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:24] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:36] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:40] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:41] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:14:05] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:14:34] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:14:35] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:14:42] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:15:05] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:15:35] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:16:05] (MSG) dispatch[1:1]: releasing connection

Looks like it is stuck in a loop, sleeping for long periods of time.

@iain-buclaw-sociomantic
Copy link
Contributor Author

Yep, it's calling usleep a bunch of times.

[2017-01-06 15:31:11] (MSG) dispatcher[2:2]: releasing connection
[2017-01-06 15:31:11] (MSG) Sleeping for 123000 usecs
[2017-01-06 15:31:11] (MSG) Sleeping for 133000 usecs
[2017-01-06 15:31:11] (MSG) Sleeping for 234000 usecs
[2017-01-06 15:31:11] (MSG) Sleeping for 280000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 181000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 125000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 299000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 200000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 280000 usecs
[2017-01-06 15:31:13] (MSG) Sleeping for 282000 usecs
[2017-01-06 15:31:13] (MSG) Sleeping for 262000 usecs
[2017-01-06 15:31:13] (MSG) Sleeping for 171000 usecs
[2017-01-06 15:31:13] (MSG) Sleeping for 213000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 192000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 143000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 117000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 103000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 122000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 218000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 257000 usecs
[2017-01-06 15:31:15] (MSG) Sleeping for 167000 usecs
[2017-01-06 15:31:15] (MSG) Sleeping for 220000 usecs
[2017-01-06 15:31:15] (MSG) Sleeping for 294000 usecs
[2017-01-06 15:31:15] (MSG) Sleeping for 177000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 133000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 100000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 225000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 175000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 108000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 117000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 216000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 119000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 254000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 135000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 196000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 210000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 289000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 136000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 185000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 122000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 297000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 240000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 294000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 135000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 190000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 133000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 244000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 193000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 110000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 174000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 261000 usecs
[2017-01-06 15:31:21] (MSG) Sleeping for 138000 usecs
[2017-01-06 15:31:21] (MSG) Sleeping for 268000 usecs
[2017-01-06 15:31:21] (MSG) Sleeping for 263000 usecs
[2017-01-06 15:31:21] (MSG) dispatch[2:2]: releasing connection

@grobian
Copy link
Owner

grobian commented Jan 6, 2017

Hmmm, food for thought. I'll look into it ASAP.

@grobian
Copy link
Owner

grobian commented Jan 6, 2017

The problem is that I falsely treated aggregator metrics as udp metrics. This is due to some sneaky implicit usage of a variable, which is wrong cheapness on my part.

grobian added a commit that referenced this issue Jan 6, 2017
Both aggregator and UDP connections have the noexpire bit set, but we
don't want to treat aggregator connections as all-in-one packet
connections.
So, instead of messing around with len (loosing the ability to say
something about whether work was done or not) and making some implicit
assumptions, introduce an isudp bit, which we can use instead to trigger
the incomplete message discard (6a17865).

This is likely the fix for issue #242, because it uses aggregations, and
the aggregator gets heavily blocked since all work the dispatchers did
were flagged as "no-work" leading to sleeps.
@iain-buclaw-sociomantic
Copy link
Contributor Author

Just confirming that this looks like it's working fine:

                if (conn->noexpire && !conn->isaggr)
                        len = -1;  /* simulate EOF for UDP packet */

@grobian
Copy link
Owner

grobian commented Jan 6, 2017

that was indeed the first thing I did, then I realised it isn't a complete fix, because it would keep udp connections vulnerable to this still.

@grobian
Copy link
Owner

grobian commented Jan 6, 2017

actually, the work being done fix is 254b32b

@iain-buclaw-sociomantic
Copy link
Contributor Author

It looks ok from here, initially however the number of queued metrics was still quite erratic, and there are still reported dropped metrics from the aggregator, however only a handful instead of thousands.

After restarting again it cleared up however, so maybe something else happened or I managed to trigger another performance bottleneck that is unrelated to this issue.

@grobian
Copy link
Owner

grobian commented Jan 7, 2017

just to be clear: you to latest git, started that after stopping v2.3 and then you saw that behaviour, correct?

@grobian
Copy link
Owner

grobian commented Jan 9, 2017

I've released v2.5, I'll wait with closing this to hear back from you if erratic behaviour still happens from time to time.

@iain-buclaw-sociomantic
Copy link
Contributor Author

Yeah, here's the graph showing what happened at the time, I've also annotated with what I did.

relay-sent

And here's what the queues were looking like.

relay-queues

Now, it may just be that I didn't wait long enough, and that the relay may eventually smoothen out.

The explanation would be that during the time that the relay is down, collectd would be keeping metrics in its own cache until a time when the relay comes back up. It probably overwhelmed the relay, to the point which it can never quite recover.

I would expect this to be reproducible on v2.3 if given the right scenario. The second and third restart of the service may have just been chance observations. And when retrying on the fifth time, it never recurred.

@iain-buclaw-sociomantic
Copy link
Contributor Author

I think this can be closed, what I saw above is likely unrelated, and didn't actually affect the relay in a way that degraded performance / affects stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants