-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed process #337
Comments
I think we can blame the bug fixed by edf2c6e I'll try to do a release soon. |
Thank you for the quick answer, I'll make a try with master branch. |
Hello, I have some thoughts about the code/patch : |
We haven't been able to reproduce this bug with pypy which we use, so I don't really know. |
Ok, I'll try to trace those events. Can you give me some clues about the rational behind it? |
This was original written by @unbrice . I think the idea was that we had async functions and that in some cases we wanted a sync versions of these. We want synchronous versions of functions like This is being used here: https://github.com/criteo/biggraphite/blob/master/biggraphite/plugins/carbon.py#L61 |
Thank you for the explanation. I added 2 metrics to track the issue : https://github.com/hjdr4/biggraphite/blob/track-event-timeouts/biggraphite/accessor.py#L81-L106 The timeout is regularly triggered so it seems that synchronous writes throttling won't be as efficient as before. |
Which python version are you using ? I starting to think that there must be something quite wrong in biggraphite/biggraphite/drivers/cassandra.py Line 1565 in f2f3558
Could you try running with pypy to validate that this affects only CPython (which is the current theory)? |
Hello, Over the weekend, the values stabilized after one Cassandra node stopped consuming 100%CPU. This is not related to the original issue, Cassandra was ok at this time. For the last graph, the timeouts occurred because I put too much load importing production datapoints and one node Cassandra took hours to stabilize resource usage after the flow was shutdown. But that shows that timeouts can occur when Cassandra is loaded, and having a hard-coded timeout can reduce throttling efficiency. Just to recap:
Now in real life, I will be ok with this workaround. |
I think @vmiszczak-teads is right:
Indeed, edf2c6e only displaces the leak. It does so at the expense of removing error propagation, limiting back pressure and introducing a race condition To understand the problem, see that these are two work queues. One of the Cassandra Driver "cQ", one of BigGraphite "bgQ". edf2c6e is also bad because it prevents exceptions from bubling up, and because it removes the protection on I think the solution should instead be to bound the queue sizes. This being said, we are really speaking of graceful degradation here. It is likely that another issue in your setup is causing cQ (and bgQ) not to empty fast enough. I would check system health for the specific nodes running BG (FDs, ....) and health of the Cassandra cluster (do you have probes?). |
#296 and #337 both occurred without edf2c6e. About #296, is there a metric that shows cQ/bgQ? I could graph at the moment the problem occurred, comparing different processes/nodes. My IMHO the real issue is that an async call can never return(this is the problem I had). The async call should ensure/promise it will return in a bounded time. I think a timeout is necessary but it should not be on waiting for the async call to complete, but rather on the async call itself to avoid leaking anything. Also the synchronous writes does not seem to take care about pending async writes, so I'm not even sure the original throttle design did the job. Maybe a @unbrice I did not understand your point on queue sizes, queues can be full only if they are bounded.
|
Hum, I don't have access to a prod instance to see which exisiting metrics could be used to visualise these unfortunatelly (I changed employers after participating to this project). Maybe this help for cQ? bgQ would be a carbon-level metric, maybe
I see... So it seems working as configured.
My understanding is different, as per above. If you had bound the cache size but set USE_FLOW_CONTROL=True, the slow machines would stop accepting points rather than droping them. Assuming you have a load balancer or redundant writers it should work.
What you suggest amounts to having cQ discard points when it is full. Not only it would discard points, but it would also prevent backpressure from working for those who bound the cache size. |
You are right, I meant clients (relay with bounded queues in my case) would start dropping points.
My understanding of the current code is that |
FYI: https://github.com/criteo/biggraphite/blob/master/biggraphite/drivers/cassandra.py#L528 should already limit the size of the Cassandra queue. |
Hello, i made some analyzes of the memory consumption of biggraphite and want to share the current state. WriterFor roughly 4 hours i counted every object in a 60s interval on one of my writer. So writing into cassandra is more or less stable and memory consumption is ok. Until you reach a point where cassandra cannot handle the high write throughput anymore. Here is a backref graph of one Stage object: My Retention is: "60s:14d,5m:365d", does it mean the downsampler will hold the data until we reach one year? So one metric consists of:
x is the amount of retentions, for my retention it is 2. For 42k metrics i see an increase of 400MB memory usage in 7 days. I think this will lead at some point to an oom. ReaderLets have a look into the reader. Memory usage is increasing very fast and this is most likely the django cache i would say. Im running graphite-web with uwsgi and there im able to reinit worker after some time and/or after a specified amount of requests to have some kind of garbage collection and to stablelize the memory usage. |
Hello, thanks for the detailed report, I'll take a look tomorrow morning :). |
So WriterYou are right for most of your analysis. To answer some of your questions:
Some things we never really looked at, but should (cc @Thib17, @adericbourg):
Note that we run replicated carbon-caches running in cgroups, so we do have leaks but we don't notice them that much. ReaderWe always ran wsgi with reload-rss (and max-requests or something like that) to make sure leaks would not kill our machines. If you use Django's cache, you should use it with memcached to make sure your workers are all sharing the same data. But all the cache items set by BigGraphite should have a ttl, so it shouldn't leak too much. @Thib17 can you paste the wsgi.conf we use in the wiki or here ? Thanks |
@Thib17 : ping :) |
Hello,
I'm running Biggraphite 0.11 and got weird behavior recently:
One of the bg-carbon-cache stopped reporting metrics. Then, it leaked memory until system OOM killer killed it after some hours. I cannot tell if it continued to write data points during this time. Log file does not have anything particular.
I'm running with PyPy 6.0.0 for Python 2.7,
BG_CACHE = memory
.The system is receiving ~150K points/s.
Other processes on the same and other machines didn't have any issue, Cassandra was OK.
Any help fixing this kind of issue would be appreciated.
May 24 04:50:18 biggraphite-2 kernel: [131633.221088] Killed process 14650 (bg-carbon-cache) total-vm:13337580kB, anon-rss:12265424kB, file-rss:0kB, shmem-rss:0kB
The text was updated successfully, but these errors were encountered: