Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A slow cluster should not affect other clusters #172

Closed
liyichao opened this issue Apr 29, 2016 · 9 comments
Closed

A slow cluster should not affect other clusters #172

liyichao opened this issue Apr 29, 2016 · 9 comments

Comments

@liyichao
Copy link

We have two cluster one is graphite, the other is influxdb 0.8. Most of our graph uses graphite. But, once in a while, we will see one or two points missing in our graph. After we see the stats of carbon-c-relay, we found that the influxdb cluster queue is full and the stalls are high, but the graphite part is working normally. If carbon-c-relay do not stall the client, the graphite part will work, carbon-c-relay should not play a role in traffic control by stalling client, it should just buffer the data to influxdb and if it is overflowing, just throw it away. That way, the good consumer (graphite) will work even though influxdb can not.

@grobian
Copy link
Owner

grobian commented Apr 29, 2016

Yes, this sounds like #154. Do you know if you're running a version including that fix?

@liyichao
Copy link
Author

It seems not, my version is:

commit a0152321a5d98d2eb2b1072a3e5f4d6a4845f929
Author: Fabian Groffen <grobian@gentoo.org>
Date:   Thu Jan 14 09:11:49 2016 +0100

@liyichao
Copy link
Author

But my version seems to include some fix. The issue seems not fix it completely, carbon-c-relay reports incomplete write to influxdb and then influxdb ok, because influxdb is not down, but fail to catch up. The relay should not stall the client at all.

@grobian
Copy link
Owner

grobian commented Apr 29, 2016

I think you want to increase your queuesize in that case. It can be argued that the stalling should be configurable, which is ok, but be aware that stalling is the only way to inform an upstream writer that it should slow down.

@grobian
Copy link
Owner

grobian commented Apr 29, 2016

Your version is ANCIENT (v1.5). You don't have the fix included I mentioned.

@liyichao
Copy link
Author

Yeah, now we have updated to align with the master and make influxdb faster, temporarily solves the problem.

As to the stalling, we really do not want carbon-c-relay to stall the client (it is a statsd server), the stats is generated every interval seconds (for example, 10s), if stalled, the stats are just leaved in the statsd server, just in another queue, and because statsd server receives udp packet from all servers, if it is stalled, the udp packet is just abandoned. The stall just makes the problem worse, because the stats is abandoned even though the fast cluster will work.

@grobian
Copy link
Owner

grobian commented Apr 30, 2016

so in that case the metric will just be dropped in the relay, instead of statsd

@grobian
Copy link
Owner

grobian commented Apr 30, 2016

Of course it would be possible to create some flag to disable stalling, for situations that demand that behaviour. Again, I don't know about your queuesize, but you may want to increase your queuesize. If influx cannot keep up with the inbound flow at all, then from a relay point of view something could perhaps be devised, but the application itself is useless of course.

grobian added a commit that referenced this issue May 8, 2016
For some scenarios, stalling may be undesirable, or just in a different
amount than the hardwired default.  Therefore, allow to control the
number of stalls before dropping metrics.  The setting 0 is allowed,
disabling stalls.
@grobian
Copy link
Owner

grobian commented May 8, 2016

I've added an option for you to disable the stalling, use -L 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants