[Regression 2.4] Dispatcher / Aggregations broken. #242

iain-buclaw-sociomantic · 2017-01-05T15:51:15Z

I'll let the graphs speak for themselves about what I saw after upgrading from v2.3 to v2.4.

After the first restart, the relay wasn't receiving any metrics. On the send restart, it was reading in a very bursty manner.

The only changes I can see that might be related are 6a17865 or 5b1e04d

Everything else seems to be related to the new validate clause. However probably shouldn't rule them out.

The text was updated successfully, but these errors were encountered:

grobian · 2017-01-05T18:34:50Z

Odd, those changes I'd consider harmless for what you see. Validate would be more problematic, but I agree it shouldn't be triggered at all.

Question, do you use TCP or UDP traffic?

iain-buclaw-sociomantic · 2017-01-05T18:43:03Z

I use TCP. I did look at netstat, and there's no closing/opening of connections, or any reported closed connections from either collectd or the relay.

iain-buclaw-sociomantic · 2017-01-05T18:49:58Z

If you give me about 30 minutes, I have a test server up receiving 10k~ metrics per second, just running a bisect on master.

iain-buclaw-sociomantic · 2017-01-05T18:51:30Z

Incase it's useful information, the relay is started with these command-line arguments:

-i 127.0.0.1 -w 2 -b 50000 -q 1500000 -S 30 -m -p 2003

grobian · 2017-01-05T18:51:53Z

that would be most helpful, I'll await your results

iain-buclaw-sociomantic · 2017-01-05T19:43:02Z

Actually, this'll have to wait until tomorrow. Either the configuration is different, or I'm just not sending enough metrics to it (it's getting between around a third of what production is handling), but it seems to be not reproducible in testing.

grobian · 2017-01-05T19:47:21Z

Hmmm, that's annoying. I'll see if I can find anything suspicious in the commits that went in. Thanks so far for trying to reproduce.

iain-buclaw-sociomantic · 2017-01-06T14:37:35Z

6a17865 is the first bad commit
commit 6a17865
Author: Fabian Groffen grobian@gentoo.org
Date: Sun Dec 25 23:34:54 2016 +0100

dispatch_connection: treat UDP packets as fully contained

As reported in issue #241, the relay was concatenating UDP packets until
it found a newline to terminate a line.  Since UDP packets don't
necessarily arrive (or in order), one can't rely on sending partial
fragments.  So instead, make sure we treat each received UDP packet as
a full message (and all we're going to get).

:100644 100644 424bc72a80593d8ff93b635142b92d88ce2bd0fa 5ba3eb42771482e81cf3bfa0ede9fe41ca778405 M dispatcher.c

iain-buclaw-sociomantic · 2017-01-06T14:44:35Z

# netstat -anop | grep relay
tcp        0      0 127.0.0.1:2003          0.0.0.0:*               LISTEN      26570/relay      off (0.00/0/0)
tcp        0      0 127.0.0.1:40792         127.0.0.1:2103          ESTABLISHED 26570/relay      off (0.00/0/0)
tcp        0      0 127.0.0.1:41004         127.0.0.1:2403          ESTABLISHED 26570/relay      off (0.00/0/0)
tcp        0      0 127.0.0.1:40854         127.0.0.1:2203          ESTABLISHED 26570/relay      off (0.00/0/0)
tcp   5971128      0 127.0.0.1:2003          127.0.0.1:57796         ESTABLISHED 26570/relay      off (0.00/0/0)
tcp        0      0 127.0.0.1:47942         127.0.0.1:2303          ESTABLISHED 26570/relay      off (0.00/0/0)
udp        0      0 127.0.0.1:2003          0.0.0.0:*                           26570/relay      off (0.00/0/0)
unix  2      [ ACC ]     STREAM     LISTENING     1222862868 26570/relay         /tmp/.s.carbon-c-relay.2003

The relay is listening on udp, but no traffic is being sent to it.

iain-buclaw-sociomantic · 2017-01-06T15:24:37Z

Adding the following line:

        /* "release" this connection again */
logout("dispatch[%d:%d]: releasing connection", self->fd, conn->takenby);
        __sync_bool_compare_and_swap(&(conn->takenby), self->id, 0);

The comparison between logging before and after is:

[2017-01-06 15:11:45] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:45] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:46] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:46] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:46] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:47] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:47] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:47] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:48] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:11:49] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:11:50] (MSG) dispatch[2:2]: releasing connection

vs.

[2017-01-06 15:13:10] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:11] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:12] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:13] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:14] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:14] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:14] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:15] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:15] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:15] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:16] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:16] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:17] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:19] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:24] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:36] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:13:40] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:13:41] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:14:05] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:14:34] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:14:35] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:14:42] (MSG) dispatch[2:2]: releasing connection
[2017-01-06 15:15:05] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:15:35] (MSG) dispatch[1:1]: releasing connection
[2017-01-06 15:16:05] (MSG) dispatch[1:1]: releasing connection

Looks like it is stuck in a loop, sleeping for long periods of time.

iain-buclaw-sociomantic · 2017-01-06T15:33:58Z

Yep, it's calling usleep a bunch of times.

[2017-01-06 15:31:11] (MSG) dispatcher[2:2]: releasing connection
[2017-01-06 15:31:11] (MSG) Sleeping for 123000 usecs
[2017-01-06 15:31:11] (MSG) Sleeping for 133000 usecs
[2017-01-06 15:31:11] (MSG) Sleeping for 234000 usecs
[2017-01-06 15:31:11] (MSG) Sleeping for 280000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 181000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 125000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 299000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 200000 usecs
[2017-01-06 15:31:12] (MSG) Sleeping for 280000 usecs
[2017-01-06 15:31:13] (MSG) Sleeping for 282000 usecs
[2017-01-06 15:31:13] (MSG) Sleeping for 262000 usecs
[2017-01-06 15:31:13] (MSG) Sleeping for 171000 usecs
[2017-01-06 15:31:13] (MSG) Sleeping for 213000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 192000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 143000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 117000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 103000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 122000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 218000 usecs
[2017-01-06 15:31:14] (MSG) Sleeping for 257000 usecs
[2017-01-06 15:31:15] (MSG) Sleeping for 167000 usecs
[2017-01-06 15:31:15] (MSG) Sleeping for 220000 usecs
[2017-01-06 15:31:15] (MSG) Sleeping for 294000 usecs
[2017-01-06 15:31:15] (MSG) Sleeping for 177000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 133000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 100000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 225000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 175000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 108000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 117000 usecs
[2017-01-06 15:31:16] (MSG) Sleeping for 216000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 119000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 254000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 135000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 196000 usecs
[2017-01-06 15:31:17] (MSG) Sleeping for 210000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 289000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 136000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 185000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 122000 usecs
[2017-01-06 15:31:18] (MSG) Sleeping for 297000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 240000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 294000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 135000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 190000 usecs
[2017-01-06 15:31:19] (MSG) Sleeping for 133000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 244000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 193000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 110000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 174000 usecs
[2017-01-06 15:31:20] (MSG) Sleeping for 261000 usecs
[2017-01-06 15:31:21] (MSG) Sleeping for 138000 usecs
[2017-01-06 15:31:21] (MSG) Sleeping for 268000 usecs
[2017-01-06 15:31:21] (MSG) Sleeping for 263000 usecs
[2017-01-06 15:31:21] (MSG) dispatch[2:2]: releasing connection

grobian · 2017-01-06T17:49:35Z

Hmmm, food for thought. I'll look into it ASAP.

grobian · 2017-01-06T18:35:18Z

The problem is that I falsely treated aggregator metrics as udp metrics. This is due to some sneaky implicit usage of a variable, which is wrong cheapness on my part.

Both aggregator and UDP connections have the noexpire bit set, but we don't want to treat aggregator connections as all-in-one packet connections. So, instead of messing around with len (loosing the ability to say something about whether work was done or not) and making some implicit assumptions, introduce an isudp bit, which we can use instead to trigger the incomplete message discard (6a17865). This is likely the fix for issue #242, because it uses aggregations, and the aggregator gets heavily blocked since all work the dispatchers did were flagged as "no-work" leading to sleeps.

iain-buclaw-sociomantic · 2017-01-06T18:46:55Z

Just confirming that this looks like it's working fine:

                if (conn->noexpire && !conn->isaggr)
                        len = -1;  /* simulate EOF for UDP packet */

grobian · 2017-01-06T18:47:37Z

that was indeed the first thing I did, then I realised it isn't a complete fix, because it would keep udp connections vulnerable to this still.

grobian · 2017-01-06T18:51:39Z

actually, the work being done fix is 254b32b

iain-buclaw-sociomantic · 2017-01-06T19:31:43Z

It looks ok from here, initially however the number of queued metrics was still quite erratic, and there are still reported dropped metrics from the aggregator, however only a handful instead of thousands.

After restarting again it cleared up however, so maybe something else happened or I managed to trigger another performance bottleneck that is unrelated to this issue.

grobian · 2017-01-07T13:37:27Z

just to be clear: you to latest git, started that after stopping v2.3 and then you saw that behaviour, correct?

grobian · 2017-01-09T10:20:48Z

I've released v2.5, I'll wait with closing this to hear back from you if erratic behaviour still happens from time to time.

iain-buclaw-sociomantic · 2017-01-09T10:24:12Z

Yeah, here's the graph showing what happened at the time, I've also annotated with what I did.

And here's what the queues were looking like.

Now, it may just be that I didn't wait long enough, and that the relay may eventually smoothen out.

The explanation would be that during the time that the relay is down, collectd would be keeping metrics in its own cache until a time when the relay comes back up. It probably overwhelmed the relay, to the point which it can never quite recover.

I would expect this to be reproducible on v2.3 if given the right scenario. The second and third restart of the service may have just been chance observations. And when retrying on the fifth time, it never recurred.

iain-buclaw-sociomantic · 2017-01-09T10:28:23Z

I think this can be closed, what I saw above is likely unrelated, and didn't actually affect the relay in a way that degraded performance / affects stats.

grobian added bug performance labels Jan 6, 2017

grobian closed this as completed Jan 9, 2017

grobian mentioned this issue Feb 18, 2017

carbon-c-relay hanging with no indication of an issue #250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Regression 2.4] Dispatcher / Aggregations broken. #242

[Regression 2.4] Dispatcher / Aggregations broken. #242

iain-buclaw-sociomantic commented Jan 5, 2017 •

edited

Loading

grobian commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 5, 2017

grobian commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 5, 2017 •

edited

Loading

grobian commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

grobian commented Jan 6, 2017

grobian commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

grobian commented Jan 6, 2017

grobian commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

grobian commented Jan 7, 2017

grobian commented Jan 9, 2017

iain-buclaw-sociomantic commented Jan 9, 2017

iain-buclaw-sociomantic commented Jan 9, 2017

[Regression 2.4] Dispatcher / Aggregations broken. #242

[Regression 2.4] Dispatcher / Aggregations broken. #242

Comments

iain-buclaw-sociomantic commented Jan 5, 2017 • edited Loading

grobian commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 5, 2017

grobian commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 5, 2017 • edited Loading

grobian commented Jan 5, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

grobian commented Jan 6, 2017

grobian commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

grobian commented Jan 6, 2017

grobian commented Jan 6, 2017

iain-buclaw-sociomantic commented Jan 6, 2017

grobian commented Jan 7, 2017

grobian commented Jan 9, 2017

iain-buclaw-sociomantic commented Jan 9, 2017

iain-buclaw-sociomantic commented Jan 9, 2017

iain-buclaw-sociomantic commented Jan 5, 2017 •

edited

Loading

iain-buclaw-sociomantic commented Jan 5, 2017 •

edited

Loading