-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bg-carbon-cache processes stop working #296
Comments
Between 0.8.9 and 0.8.8 there should not be many differences appart form the new On top of updates.log carbon should log a few more things, could you redirect them either to syslog ( For the propagation errors, things should work fine as long as tables are created in advance, that's quite annoying that Cassandra doesn't provide that out the box. I'll see what can be done to improve this, but it's hard without centralized locking .. maybe a tool that reads your carbon.conf and creates all the tables on startup. It's weird because we neither really had issues with that. If it can help, we can schedule some time to debug that together on slack/irc. |
appreciate the offer and we would like that if you have a slack channel you could invite us, we can provide more details of the setup. We are running everything within docker but plan to test installing BG in a vm see if that helps with our problem. |
We can discuss on https://gitter.im/biggraphite/, I'll be there tomorrow from 9am to 6pm CET. |
So were able to stop bg-carbon-cache from stopping by removing all roll-up retentions from storage-schemas.conf and not processes are stable, We are not sure which particular retention could be problematic as we have quite a few. I am pasting here our unique retention policies: retentions = 120s:30d,1200s:1y so from those we left the 1st one of the list (if more than one) and carbon process works as expected. We testd this by only leaving 3 rules out of the many and carbon always stopped when multiple retentions per rule |
I ended up restarting stuck carbon instances with this script (jinja2 template for ansible) running every minute:
|
I still haven't been able to reproduce this. |
Hey I am also struggeling with this issue. Setup3 biggraphite server on top of 3 node cassandra cluster.
here are some gdb python stacks: py-bt
py-list
thread apply all py-list
Mostly python is hanging around in threading module (Thread 2,3,4,6,7,8), one thread in asyncore.py (Thread 5), one thread in twisted (Thread 1). If more information needed dont hestitate to ask for it. |
I dived a little bit deeper into the thread frames and found:
|
Can you try to run that in a venv with a recent version of pypy ? # You can also get
$ wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy-5.9-linux_x86_64-portable.tar.bz2
$ tar -xf pypy-5.9-linux_x86_64-portable.tar.bz2
$ ./pypy-5.9-linux_x86_64-portable/bin/virtualenv-pypy venv
New pypy executable in /tmp/venv/bin/pypy
Installing setuptools, pip, wheel...done.
$ export GRAPHITE_NO_PREFIX=true
$ source venv/bin/activate
(venv) $ pip install biggraphite From the stacktrace it looks like it's waiting for an asynchronous cassandra operation to finish. This should not stay there indefinitely because the cassandra driver eventually sends an exception back (except if something is wrong with the driver). I guess a timeout could be added to https://github.com/criteo/biggraphite/blob/master/biggraphite/accessor.py#L92 but if the cassandra driver doesn't pop exceptions up, this isn't going to fix much. I'd really recommend to try to reproduce that with a recent version of either pypy or python 2.7. |
So after few days of debugging:
|
Ok, looks like we never got bitten by this because we use pypy (which is highly recommended, see https://datastax.github.io/python-driver/performance.html). I'll be happy to merge a patch that adds a default timeout to event.wait() (probably a few seconds). |
Sounds good, timeout with 3 seconds was at least good for my setup. Didn't knew that with pypy and when im looking to my metrics the performance is much better. Will switch to pypy. But i think a timeout on a blocking function is always a good idea. ;) |
👍 for this bug, we see the same with cPython. Setting the timeout to 3 seconds alleviated the problem for us. Now testing with PyPy. I would be willing to create a patch for this, but I'm not sure what changes this would need, despite passing timeout=3 to event.wait(). |
Adding a non-zero timeout for event.wait() is probably the correct thing to do. |
Hello guys,
I am running BigGraphite 0.8.9 with following:
2 cache servers with 6 bg-carbon-cache processes, 1 relay process each
1 relay server with 1 relay processes
4 cassandra nodes (docker image) with replication factor 1 (tried with RF=3 got progation errors)
An another BigGraphite with 0.8.8 with:
4 cache servers with 6 bg-carbon-cache processes, 1 relay process each
1 relay server with 1 relay processes
12 cassandra nodes rpm install with replication factor 1 (tried with RF=2 got progation errors)
each cluster is receiving an aprox of 400K-500K metrics per minute and after a while bg-caches processes stop working, relays keep sendin data but daemons stop processing data (no updates.log) although the processes itself if up this is more common with 0.8.8 as soon as the first process stops the others follow.
in dev (0.8.9) I've had only 1 process per cache server stop, but still it stops for no apparent reason, as I said there are no exceptions on logs for graphite nor cassandra.
Any ideas on what could be going on?
The text was updated successfully, but these errors were encountered: