A slow cluster should not affect other clusters #172

liyichao · 2016-04-29T13:24:59Z

We have two cluster one is graphite, the other is influxdb 0.8. Most of our graph uses graphite. But, once in a while, we will see one or two points missing in our graph. After we see the stats of carbon-c-relay, we found that the influxdb cluster queue is full and the stalls are high, but the graphite part is working normally. If carbon-c-relay do not stall the client, the graphite part will work, carbon-c-relay should not play a role in traffic control by stalling client, it should just buffer the data to influxdb and if it is overflowing, just throw it away. That way, the good consumer (graphite) will work even though influxdb can not.

grobian · 2016-04-29T13:29:12Z

Yes, this sounds like #154. Do you know if you're running a version including that fix?

liyichao · 2016-04-29T13:33:26Z

It seems not, my version is:

commit a0152321a5d98d2eb2b1072a3e5f4d6a4845f929
Author: Fabian Groffen <grobian@gentoo.org>
Date:   Thu Jan 14 09:11:49 2016 +0100

liyichao · 2016-04-29T14:12:11Z

But my version seems to include some fix. The issue seems not fix it completely, carbon-c-relay reports incomplete write to influxdb and then influxdb ok, because influxdb is not down, but fail to catch up. The relay should not stall the client at all.

grobian · 2016-04-29T15:25:26Z

I think you want to increase your queuesize in that case. It can be argued that the stalling should be configurable, which is ok, but be aware that stalling is the only way to inform an upstream writer that it should slow down.

grobian · 2016-04-29T15:27:33Z

Your version is ANCIENT (v1.5). You don't have the fix included I mentioned.

liyichao · 2016-04-30T08:34:26Z

Yeah, now we have updated to align with the master and make influxdb faster, temporarily solves the problem.

As to the stalling, we really do not want carbon-c-relay to stall the client (it is a statsd server), the stats is generated every interval seconds (for example, 10s), if stalled, the stats are just leaved in the statsd server, just in another queue, and because statsd server receives udp packet from all servers, if it is stalled, the udp packet is just abandoned. The stall just makes the problem worse, because the stats is abandoned even though the fast cluster will work.

grobian · 2016-04-30T10:49:04Z

so in that case the metric will just be dropped in the relay, instead of statsd

grobian · 2016-04-30T10:52:19Z

Of course it would be possible to create some flag to disable stalling, for situations that demand that behaviour. Again, I don't know about your queuesize, but you may want to increase your queuesize. If influx cannot keep up with the inbound flow at all, then from a relay point of view something could perhaps be devised, but the application itself is useless of course.

For some scenarios, stalling may be undesirable, or just in a different amount than the hardwired default. Therefore, allow to control the number of stalls before dropping metrics. The setting 0 is allowed, disabling stalls.

grobian · 2016-05-08T09:55:07Z

I've added an option for you to disable the stalling, use -L 0

grobian closed this as completed May 8, 2016

grobian mentioned this issue May 8, 2016

Can the relay refuse connections if the queue is full? #90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A slow cluster should not affect other clusters #172

A slow cluster should not affect other clusters #172

liyichao commented Apr 29, 2016

grobian commented Apr 29, 2016

liyichao commented Apr 29, 2016

liyichao commented Apr 29, 2016

grobian commented Apr 29, 2016

grobian commented Apr 29, 2016

liyichao commented Apr 30, 2016

grobian commented Apr 30, 2016

grobian commented Apr 30, 2016

grobian commented May 8, 2016

A slow cluster should not affect other clusters #172

A slow cluster should not affect other clusters #172

Comments

liyichao commented Apr 29, 2016

grobian commented Apr 29, 2016

liyichao commented Apr 29, 2016

liyichao commented Apr 29, 2016

grobian commented Apr 29, 2016

grobian commented Apr 29, 2016

liyichao commented Apr 30, 2016

grobian commented Apr 30, 2016

grobian commented Apr 30, 2016

grobian commented May 8, 2016