bg-carbon-cache processes stop working #296

ghost · 2017-10-03T19:09:12Z

Hello guys,

I am running BigGraphite 0.8.9 with following:

2 cache servers with 6 bg-carbon-cache processes, 1 relay process each
1 relay server with 1 relay processes
4 cassandra nodes (docker image) with replication factor 1 (tried with RF=3 got progation errors)

An another BigGraphite with 0.8.8 with:
4 cache servers with 6 bg-carbon-cache processes, 1 relay process each
1 relay server with 1 relay processes
12 cassandra nodes rpm install with replication factor 1 (tried with RF=2 got progation errors)

each cluster is receiving an aprox of 400K-500K metrics per minute and after a while bg-caches processes stop working, relays keep sendin data but daemons stop processing data (no updates.log) although the processes itself if up this is more common with 0.8.8 as soon as the first process stops the others follow.

in dev (0.8.9) I've had only 1 process per cache server stop, but still it stops for no apparent reason, as I said there are no exceptions on logs for graphite nor cassandra.

Any ideas on what could be going on?

iksaif · 2017-10-03T20:07:04Z

Between 0.8.9 and 0.8.8 there should not be many differences appart form the new created_at field in metrics_metadata.

On top of updates.log carbon should log a few more things, could you redirect them either to syslog (--syslog), or to stdout (--debug --nodaemon).

For the propagation errors, things should work fine as long as tables are created in advance, that's quite annoying that Cassandra doesn't provide that out the box. I'll see what can be done to improve this, but it's hard without centralized locking .. maybe a tool that reads your carbon.conf and creates all the tables on startup. It's weird because we neither really had issues with that.

If it can help, we can schedule some time to debug that together on slack/irc.

ghost · 2017-10-04T15:32:18Z

appreciate the offer and we would like that if you have a slack channel you could invite us, we can provide more details of the setup. We are running everything within docker but plan to test installing BG in a vm see if that helps with our problem.

ghost · 2017-10-04T20:32:55Z

iksaif · 2017-10-05T08:01:53Z

We can discuss on https://gitter.im/biggraphite/, I'll be there tomorrow from 9am to 6pm CET.

ghost · 2017-10-12T18:21:30Z

So were able to stop bg-carbon-cache from stopping by removing all roll-up retentions from storage-schemas.conf and not processes are stable, We are not sure which particular retention could be problematic as we have quite a few. I am pasting here our unique retention policies:

retentions = 120s:30d,1200s:1y
retentions = 120s:90d
retentions = 1296000s:90d, 2592000s:5y
retentions = 15m:100d, 1d:5y
retentions = 1800s:30d
retentions = 1d:2y
retentions = 300s:1y,900s:2y
retentions = 300s:30d,900s:180d
retentions = 300s:90d,900s:1y,3600s:5y
retentions = 300s:90d,900s:2y
retentions = 300s:90d,900s:2y,3600s:5y
retentions = 3600s:90d,7200s:2y
retentions = 5m:90d,900s:2y
retentions = 5s:1d,60s:15d,300s:90d
retentions = 600s:90d,1200s:2y
retentions = 600s:90d,1800s:2y
retentions = 60s:1y, 3600s:2y
retentions = 60s:1y, 900s:2y
retentions = 60s:28d,300s:90d,900s:2y,3600s:5y
retentions = 60s:30d,300s:90d,900s:2y
retentions = 60s:35d, 5m:100d, 15m:395d, 1h:5y
retentions = 60s:45d
retentions = 60s:6d, 300s:30d, 900s:90d
retentions = 60s:90d
retentions = 60s:90d, 1h:2y
retentions = 60s:90d,900s:2y
retentions = 900s:30d,1800s:90d
retentions = 900s:90d,1800s:2y

so from those we left the 1st one of the list (if more than one) and carbon process works as expected. We testd this by only leaving 3 rules out of the many and carbon always stopped when multiple retentions per rule

ilya-epifanov · 2017-10-24T11:30:49Z

I ended up restarting stuck carbon instances with this script (jinja2 template for ansible) running every minute:

#!/bin/sh

METRICS=$(curl --retry 3 --retry-delay 5 -s 'http://127.0.0.1:{{ graphite_web_proxy_port }}/render?target=carbon.agents.{{ inventory_hostname }}-*.metricsReceived&format=json&from=-2minutes&noNullPoints=true')

if [ {% raw %}${#METRICS}{% endraw %} -lt 5 ]; then
  docker restart carbon
  curl -X POST "http://127.0.0.1:{{ graphite_web_proxy_port }}/events/" \
    -d "{ \"what\": \"Carbon restart - {{ inventory_hostname }}\", \"tags\": [\"carbon\"], \"when\": $(date +%s) }"
fi

iksaif · 2017-10-24T11:39:45Z

I still haven't been able to reproduce this.
Would be great to try to connect a python aware debugger to see what's how things are stuck internally. I suspect it's dependent on the system and version of python.

NasusOptimus · 2017-11-02T16:52:02Z

Hey I am also struggeling with this issue.
bg-carbon-cache stops processing data and i see nothing in my logs. It is kind of reproducible for me, needs 2-4h until i hit this issue.

Setup

3 biggraphite server on top of 3 node cassandra cluster.
All biggraphite server running with:

CentOS 7.4.1708
Python 2.7.5
biggraphite 0.8.10
graphite-web 1.0.2
carbon 1.0.2
cassandra-driver 3.11.0
Twisted 17.9.0

here are some gdb python stacks:

py-bt

#3 Frame 0x3b91630, for file /usr/lib64/python2.7/site-packages/twisted/internet/epollreactor.py, line 218, in doPoll (self=<EPollReactor(waker=<_UnixWaker(i=4, fileno=<function at remote 0x36f3b90>, reactor=<...>, o=5) at remote 0x36ec390>, _poller=<select.epoll at remote 0x7feb37f45f90>, threadpoolShutdownID=('shutdown', ('during', <instancemethod at remote 0x37042d0>, (), {})), _justStopped=False, _newTimedCalls=[], _pendingTimedCalls=[<DelayedCall(resetter=<instancemethod at remote 0x7feafd7f53c0>, seconds=<built-in function time>, args=(...), canceller=<instancemethod at remote 0x7feafd7f52d0>, delayed_time=0, kw={}, func=<LoopingCall(a=(...), interval=60, clock=<...>, f=<instancemethod at remote 0x36eb370>, running=True, kw={}, starttime=<float at remote 0x3712178>, call=<...>, _runAtStart=False, _deferred=<Deferred(_canceller=None, callbacks=[]) at remote 0x3789050>) at remote 0x3788550>, time=<float at remote 0x3741aa0>, cancelled=0, called=0) at remote 0x7feafc91a998>, <DelayedCall(resetter=<instancemet...(truncated)
    l = self._poller.poll(timeout, len(self._selectables))
#7 Frame 0x3b491b0, for file /usr/lib64/python2.7/site-packages/twisted/internet/base.py, line 1255, in mainLoop (self=<EPollReactor(waker=<_UnixWaker(i=4, fileno=<function at remote 0x36f3b90>, reactor=<...>, o=5) at remote 0x36ec390>, _poller=<select.epoll at remote 0x7feb37f45f90>, threadpoolShutdownID=('shutdown', ('during', <instancemethod at remote 0x37042d0>, (), {})), _justStopped=False, _newTimedCalls=[], _pendingTimedCalls=[<DelayedCall(resetter=<instancemethod at remote 0x7feafd7f53c0>, seconds=<built-in function time>, args=(...), canceller=<instancemethod at remote 0x7feafd7f52d0>, delayed_time=0, kw={}, func=<LoopingCall(a=(...), interval=60, clock=<...>, f=<instancemethod at remote 0x36eb370>, running=True, kw={}, starttime=<float at remote 0x3712178>, call=<...>, _runAtStart=False, _deferred=<Deferred(_canceller=None, callbacks=[]) at remote 0x3789050>) at remote 0x3788550>, time=<float at remote 0x3741aa0>, cancelled=0, called=0) at remote 0x7feafc91a998>, <DelayedCall(resetter=<instancemethod a...(truncated)
    self.doIteration(t)
#11 Frame 0x3b436e0, for file /usr/lib64/python2.7/site-packages/twisted/internet/base.py, line 1243, in run (self=<EPollReactor(waker=<_UnixWaker(i=4, fileno=<function at remote 0x36f3b90>, reactor=<...>, o=5) at remote 0x36ec390>, _poller=<select.epoll at remote 0x7feb37f45f90>, threadpoolShutdownID=('shutdown', ('during', <instancemethod at remote 0x37042d0>, (), {})), _justStopped=False, _newTimedCalls=[], _pendingTimedCalls=[<DelayedCall(resetter=<instancemethod at remote 0x7feafd7f53c0>, seconds=<built-in function time>, args=(...), canceller=<instancemethod at remote 0x7feafd7f52d0>, delayed_time=0, kw={}, func=<LoopingCall(a=(...), interval=60, clock=<...>, f=<instancemethod at remote 0x36eb370>, running=True, kw={}, starttime=<float at remote 0x3712178>, call=<...>, _runAtStart=False, _deferred=<Deferred(_canceller=None, callbacks=[]) at remote 0x3789050>) at remote 0x3788550>, time=<float at remote 0x3741aa0>, cancelled=0, called=0) at remote 0x7feafc91a998>, <DelayedCall(resetter=<instancemethod at rem...(truncated)
    self.mainLoop()
#15 Frame 0x3b47d50, for file /usr/lib64/python2.7/site-packages/twisted/application/app.py, line 311, in runReactorWithLogging (config=<ServerOptions(shortOpt='oenl:p:f:y:s:d:u:g:br:', subCommand='carbon-cache', loadedPlugins={'carbon-aggregator': <CarbonAggregatorServiceMaker at remote 0x36fc9d0>, 'carbon-cache': <CarbonCacheServiceMaker at remote 0x36ec890>}, subOptions=<CarbonCacheOptions(parent=<...>, shortOpt='c:', synonyms={'blacklist': 'blacklist', 'instance': 'instance', 'c': 'config', 'version': 'version', 'logdir': 'logdir', 'debug': 'debug', 'whitelist': 'whitelist', 'config': 'config', 'help': 'help'}, defaults={'blacklist': None, 'instance': 'a', 'logdir': None, 'debug': 0, 'whitelist': None, 'config': None}, docs={'blacklist': 'List of metric patterns to disallow.', 'instance': 'Manage a specific carbon instance.', 'version': '\n        Display Twisted version and exit.\n        ', 'logdir': 'Write logs to the given directory.', 'debug': 'Run in debug mode.', 'whitelist': 'List of metric patterns t...(truncated)
    reactor.run()
#19 Frame 0x3b42860, for file /usr/lib64/python2.7/site-packages/twisted/application/app.py, line 396, in startReactor (self=<UnixApplicationRunner(oldstdout=<file at remote 0x7feb37fa0150>, oldstderr=<file at remote 0x7feb37fa01e0>, profiler=<AppProfiler(run=<instancemethod at remote 0x3782690>, profiler='cprofile') at remote 0x36e9d10>, application=<Componentized(_adapterCache={'twisted.python.log.ILogObserver': <CarbonLogObserver(observer=<instancemethod at remote 0x36ca280>, logdir='/opt/graphite/storage/log/carbon-cache/carbon-cache-a', custom_logs={'listener': <CarbonLogFile(name='listener.log', _file=<file at remote 0x3a8ff60>, lastDate=(2017, 11, 2), defaultMode=493, closed=False, directory='/opt/graphite/storage/log/carbon-cache/carbon-cache-a', path='/opt/graphite/storage/log/carbon-cache/carbon-cache-a/listener.log', _threadable_lock=<XLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x3028b70>, _RLock__count=0) at remote 0x3b4db10>, enableRotation=True) at remote ...(truncated)
    self.config, oldstdout, oldstderr, self.profiler, reactor)
#23 Frame 0x3b3d890, for file /usr/lib64/python2.7/site-packages/twisted/scripts/_twistd_unix.py, line 262, in postApplication (self=<UnixApplicationRunner(oldstdout=<file at remote 0x7feb37fa0150>, oldstderr=<file at remote 0x7feb37fa01e0>, profiler=<AppProfiler(run=<instancemethod at remote 0x3782690>, profiler='cprofile') at remote 0x36e9d10>, application=<Componentized(_adapterCache={'twisted.python.log.ILogObserver': <CarbonLogObserver(observer=<instancemethod at remote 0x36ca280>, logdir='/opt/graphite/storage/log/carbon-cache/carbon-cache-a', custom_logs={'listener': <CarbonLogFile(name='listener.log', _file=<file at remote 0x3a8ff60>, lastDate=(2017, 11, 2), defaultMode=493, closed=False, directory='/opt/graphite/storage/log/carbon-cache/carbon-cache-a', path='/opt/graphite/storage/log/carbon-cache/carbon-cache-a/listener.log', _threadable_lock=<XLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x3028b70>, _RLock__count=0) at remote 0x3b4db10>, enableRotation=True) at...(truncated)
    self.startReactor(None, self.oldstdout, self.oldstderr)
#27 Frame 0x37426b0, for file /usr/lib64/python2.7/site-packages/twisted/application/app.py, line 384, in run (self=<UnixApplicationRunner(oldstdout=<file at remote 0x7feb37fa0150>, oldstderr=<file at remote 0x7feb37fa01e0>, profiler=<AppProfiler(run=<instancemethod at remote 0x3782690>, profiler='cprofile') at remote 0x36e9d10>, application=<Componentized(_adapterCache={'twisted.python.log.ILogObserver': <CarbonLogObserver(observer=<instancemethod at remote 0x36ca280>, logdir='/opt/graphite/storage/log/carbon-cache/carbon-cache-a', custom_logs={'listener': <CarbonLogFile(name='listener.log', _file=<file at remote 0x3a8ff60>, lastDate=(2017, 11, 2), defaultMode=493, closed=False, directory='/opt/graphite/storage/log/carbon-cache/carbon-cache-a', path='/opt/graphite/storage/log/carbon-cache/carbon-cache-a/listener.log', _threadable_lock=<XLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x3028b70>, _RLock__count=0) at remote 0x3b4db10>, enableRotation=True) at remote 0x3b4ce18...(truncated)
    self.postApplication()
#31 Frame 0x2e4c160, for file /usr/lib64/python2.7/site-packages/twisted/scripts/twistd.py, line 25, in runApp (config=<ServerOptions(shortOpt='oenl:p:f:y:s:d:u:g:br:', subCommand='carbon-cache', loadedPlugins={'carbon-aggregator': <CarbonAggregatorServiceMaker at remote 0x36fc9d0>, 'carbon-cache': <CarbonCacheServiceMaker at remote 0x36ec890>}, subOptions=<CarbonCacheOptions(parent=<...>, shortOpt='c:', synonyms={'blacklist': 'blacklist', 'instance': 'instance', 'c': 'config', 'version': 'version', 'logdir': 'logdir', 'debug': 'debug', 'whitelist': 'whitelist', 'config': 'config', 'help': 'help'}, defaults={'blacklist': None, 'instance': 'a', 'logdir': None, 'debug': 0, 'whitelist': None, 'config': None}, docs={'blacklist': 'List of metric patterns to disallow.', 'instance': 'Manage a specific carbon instance.', 'version': '\n        Display Twisted version and exit.\n        ', 'logdir': 'Write logs to the given directory.', 'debug': 'Run in debug mode.', 'whitelist': 'List of metric patterns to allow.', 'confi...(truncated)
    _SomeApplicationRunner(config).run()
#35 Frame 0x1b5aeb0, for file /opt/graphite/lib/carbon/util.py, line 98, in run_twistd_plugin (filename='carbon-cache', get_parser=<function at remote 0x2e24c08>, ServerOptions=<type at remote 0x2ca6fb0>, bin_dir='/root', root_dir='/', program='carbon-cache', parser=<OptionParser(process_default_values=True, allow_interspersed_args=True, _long_opt={'--logdir': <Option(_long_opts=['--logdir'], help='Write logs in the given directory', callback_args=None, callback=None, default=None, nargs=1, choices=None, dest='logdir', container=<...>, _short_opts=[], action='store', const=None, callback_kwargs=None, type='string', metavar=None) at remote 0x2e21c20>, '--profiler': <Option(_long_opts=['--profiler'], help='Specify the profiler to use', callback_args=None, callback=None, default=('NO', 'DEFAULT'), nargs=1, choices=None, dest='profiler', container=<...>, _short_opts=[], action='store', const=None, callback_kwargs=None, type='string', metavar=None) at remote 0x2e21710>, '--config': <Option(_long_opts=['--config'], hel...(truncated)
    runApp(config)
#38 Frame 0x23ded90, for file /usr/lib/python2.7/site-packages/biggraphite/cli/bg_carbon_cache.py, line 41, in main (_executable='/usr/bin/bg-carbon-cache', _sys_path=['/usr/lib', '/usr/bin', '/opt/graphite/lib', '/opt/graphite/webapp', '/usr/lib64/python27.zip', '/usr/lib64/python2.7', '/usr/lib64/python2.7/plat-linux2', '/usr/lib64/python2.7/lib-tk', '/usr/lib64/python2.7/lib-old', '/usr/lib64/python2.7/lib-dynload', '/usr/lib64/python2.7/site-packages', '/usr/lib/python2.7/site-packages'], carbon_util=<module at remote 0x2963f30>, carbon_exceptions=<module at remote 0x2c3fc90>, unused_carbon=<module at remote 0x2cc6328>)
    carbon_util.run_twistd_plugin("carbon-cache")
#42 Frame 0x1b59e10, for file /usr/bin/bg-carbon-cache, line 11, in <module> ()
    load_entry_point('biggraphite==0.8.10', 'console_scripts', 'bg-carbon-cache')()

py-list

 213            try:
 214                # Limit the number of events to the number of io objects we're
 215                # currently tracking (because that's maybe a good heuristic) and
 216                # the amount of time we block to the value specified by our
 217                # caller.
>218                l = self._poller.poll(timeout, len(self._selectables))
 219            except IOError as err:
 220                if err.errno == errno.EINTR:
 221                    return
 222                # See epoll_wait(2) for documentation on the other conditions
 223                # under which this can fail.  They can only be due to a serious

thread apply all py-list

Thread 8 (Thread 0x7feb25b26700 (LWP 32085)):
 357                        if balancing:
 358                            delay = min(delay * 2, remaining, 0.05)
 359                        else:
 360                            delay = remaining
 361                            endtime = _time() + remaining
>362                        _sleep(delay)
 363                    if not gotit:
 364                        if __debug__:
 365                            self._note("%s.wait(%s): timed out", self, timeout)
 366                        try:
 367                            self.__waiters.remove(waiter)
Thread 7 (Thread 0x7feb25325700 (LWP 32086)):
 334            waiter.acquire()
 335            self.__waiters.append(waiter)
 336            saved_state = self._release_save()
 337            try:    # restore state no matter what (e.g., KeyboardInterrupt)
 338                if timeout is None:
>339                    waiter.acquire()
 340                    if __debug__:
 341                        self._note("%s.wait(): got it", self)
 342                else:
 343                    # Balancing act:  We can't afford a pure busy loop, so we
 344                    # have to sleep; but if we sleep the whole timeout time,
Thread 6 (Thread 0x7feb1ffff700 (LWP 32100)):
 334            waiter.acquire()
 335            self.__waiters.append(waiter)
 336            saved_state = self._release_save()
 337            try:    # restore state no matter what (e.g., KeyboardInterrupt)
 338                if timeout is None:
>339                    waiter.acquire()
 340                    if __debug__:
 341                        self._note("%s.wait(): got it", self)
 342                else:
 343                    # Balancing act:  We can't afford a pure busy loop, so we
 344                    # have to sleep; but if we sleep the whole timeout time,
Thread 5 (Thread 0x7feb1f7fe700 (LWP 32101)):
 187                    # Only check for exceptions if object was either readable
 188                    # or writable.
 189                    flags |= select.POLLERR | select.POLLHUP | select.POLLNVAL
 190                    pollster.register(fd, flags)
 191            try:
>192                r = pollster.poll(timeout)
 193            except select.error, err:
 194                if err.args[0] != EINTR:
 195                    raise
 196                r = []
 197            for fd, flags in r:
Thread 4 (Thread 0x7feb1effd700 (LWP 32102)):
 357                        if balancing:
 358                            delay = min(delay * 2, remaining, 0.05)
 359                        else:
 360                            delay = remaining
 361                            endtime = _time() + remaining
>362                        _sleep(delay)
 363                    if not gotit:
 364                        if __debug__:
 365                            self._note("%s.wait(%s): timed out", self, timeout)
 366                        try:
 367                            self.__waiters.remove(waiter)
Thread 3 (Thread 0x7feb1e7fc700 (LWP 32103)):
 334            waiter.acquire()
 335            self.__waiters.append(waiter)
 336            saved_state = self._release_save()
 337            try:    # restore state no matter what (e.g., KeyboardInterrupt)
 338                if timeout is None:
>339                    waiter.acquire()
 340                    if __debug__:
 341                        self._note("%s.wait(): got it", self)
 342                else:
 343                    # Balancing act:  We can't afford a pure busy loop, so we
 344                    # have to sleep; but if we sleep the whole timeout time,
Thread 2 (Thread 0x7feb1dffb700 (LWP 32104)):
 334            waiter.acquire()
 335            self.__waiters.append(waiter)
 336            saved_state = self._release_save()
 337            try:    # restore state no matter what (e.g., KeyboardInterrupt)
 338                if timeout is None:
>339                    waiter.acquire()
 340                    if __debug__:
 341                        self._note("%s.wait(): got it", self)
 342                else:
 343                    # Balancing act:  We can't afford a pure busy loop, so we
 344                    # have to sleep; but if we sleep the whole timeout time,
Thread 1 (Thread 0x7feb37fc5740 (LWP 32084)):
 213            try:
 214                # Limit the number of events to the number of io objects we're
 215                # currently tracking (because that's maybe a good heuristic) and
 216                # the amount of time we block to the value specified by our
 217                # caller.
>218                l = self._poller.poll(timeout, len(self._selectables))
 219            except IOError as err:
 220                if err.errno == errno.EINTR:
 221                    return
 222                # See epoll_wait(2) for documentation on the other conditions
 223                # under which this can fail.  They can only be due to a serious

Mostly python is hanging around in threading module (Thread 2,3,4,6,7,8), one thread in asyncore.py (Thread 5), one thread in twisted (Thread 1).

If more information needed dont hestitate to ask for it.
I can also try things out for you in my setup to fix this issue.

NasusOptimus · 2017-11-02T23:10:30Z

I dived a little bit deeper into the thread frames and found:

Thread 7 (Thread 0x7fe16bfff700 (LWP 30457)):
#4 Waiting for a lock (e.g. GIL)
#5 Waiting for a lock (e.g. GIL)
#7 Frame 0x7fe149267d50, for file /usr/lib64/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe1600d73d0>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe1600d73d0>, _Condition__waiters=[<thread.lock at remote 0x7fe149165d10>], release=<built-in method release of thread.lock object at remote 0x7fe1600d73d0>) at remote 0x7fe16c658a50>, timeout=None, balancing=True, waiter=<thread.lock at remote 0x7fe149165d10>, saved_state=None)
    waiter.acquire()
#11 Frame 0x7fe148a66fb0, for file /usr/lib64/python2.7/threading.py, line 622, in wait (self=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe1600d73d0>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe1600d73d0>, _Condition__waiters=[<thread.lock at remote 0x7fe149165d10>], release=<built-in method release of thread.lock object at remote 0x7fe1600d7
3d0>) at remote 0x7fe16c658a50>) at remote 0x5223cd0>, timeout=None, balancing=True)
    self.__cond.wait(timeout, balancing)
#15 Frame 0x7fe164088890, for file /usr/lib/python2.7/site-packages/biggraphite/accessor.py, line 92, in _wait_async_call (async_function=<instancemethod at remote 0x52003c0>, args=(), kwargs={'metric': <Metric at remote 0x7fe167453440>, 'datapoints': [(1509662307, <float at remote 0x7fe149c84e48>)]}, on_done=<function at remote 0x7fe1600d6230>)
    event.wait()
#19 Frame 0x7fe1640884e0, for file /usr/lib/python2.7/site-packages/biggraphite/accessor.py, line 846, in insert_points (self=<_CassandraAccessor(_CassandraAccessor__read_on_sampling_rate=<float at remote 0x1f4d718>, cache_data_ttl=None, max_queries_per_pattern=42, _CassandraAccessor__glob_parser=<GraphiteGlobParser(_sequence='', _component=[], _glob='', _parsed=[]) at remote 0x35cca28>, _CassandraAccessor__compression=False, auth_provider=None, _CassandraAccessor__trace=False, _CassandraAccessor__delete_directory=<PreparedStatement(result_metadata=[], keyspace=None, is_idempotent=False, consistency_level=4, custom_payload=None, query_string='DELETE FROM "biggraphite_metadata".directories WHERE name=?;', query_id='_\tV6\xdc\xc6\xc2\t\xb7\x9a\xaa\xf1\x928i\x1c', protocol_version=4, routing_key_indexes=[0], column_metadata=[<ColumnMetadata at remote 0x35e0290>]) at remote 0x35e2510>, _CassandraAccessor__select_metric_metadata_statement=<PreparedStatement(result_metadata=[(u'biggraphite_metadata', u'metrics_metadata...(truncated)
    _wait_async_call(self.insert_points_async, metric=metric, datapoints=datapoints)
#23 Frame 0x7fe164087390, for file /usr/lib/python2.7/site-packages/biggraphite/plugins/carbon.py, line 115, in write (self=<BigGraphiteDatabase(_settings=<Settings(CACHE_SIZE_LOW_WATERMARK=<float at remote 0x318b378>) at remote 0x28667b0>, _sync_countdown=0, _accessor=<_CassandraAccessor(_CassandraAccessor__read_on_sampling_rate=<float at remote 0x1f4d718>, cache_data_ttl=None, max_queries_per_pattern=42, _CassandraAccessor__glob_parser=<GraphiteGlobParser(_sequence='', _component=[], _glob='', _parsed=[]) at remote 0x35cca28>, _CassandraAccessor__compression=False, auth_provider=None, _CassandraAccessor__trace=False, _CassandraAccessor__delete_directory=<PreparedStatement(result_metadata=[], keyspace=None, is_idempotent=False, consistency_level=4, custom_payload=None, query_string='DELETE FROM "biggraphite_metadata".directories WHERE name=?;', query_id='_\tV6\xdc\xc6\xc2\t\xb7\x9a\xaa\xf1\x928i\x1c', protocol_version=4, routing_key_indexes=[0], column_metadata=[<ColumnMetadata at remote 0x35e0290>]) at remote 0...(truncated)
    self.accessor.insert_points(metric=metric, datapoints=datapoints)
#28 Frame 0x7fe164086fd0, for file /usr/lib/python2.7/site-packages/prometheus_client/core.py, line 904, in wrapped (func=<function at remote 0x2825488>, args=(<BigGraphiteDatabase(_settings=<Settings(CACHE_SIZE_LOW_WATERMARK=<float at remote 0x318b378>) at remote 0x28667b0>, _sync_countdown=0, _accessor=<_CassandraAccessor(_CassandraAccessor__read_on_sampling_rate=<float at remote 0x1f4d718>, cache_data_ttl=None, max_queries_per_pattern=42, _CassandraAccessor__glob_parser=<GraphiteGlobParser(_sequence='', _component=[], _glob='', _parsed=[]) at remote 0x35cca28>, _CassandraAccessor__compression=False, auth_provider=None, _CassandraAccessor__trace=False, _CassandraAccessor__delete_directory=<PreparedStatement(result_metadata=[], keyspace=None, is_idempotent=False, consistency_level=4, custom_payload=None, query_string='DELETE FROM "biggraphite_metadata".directories WHERE name=?;', query_id='_\tV6\xdc\xc6\xc2\t\xb7\x9a\xaa\xf1\x928i\x1c', protocol_version=4, routing_key_indexes=[0], column_metadata=[<ColumnMetadat...(truncated)
    return func(*args, **kwargs)
#32 Frame 0x7fe164086de0, for file <decorator-gen-1>, line 2, in write (self=<BigGraphiteDatabase(_settings=<Settings(CACHE_SIZE_LOW_WATERMARK=<float at remote 0x318b378>) at remote 0x28667b0>, _sync_countdown=0, _accessor=<_CassandraAccessor(_CassandraAccessor__read_on_sampling_rate=<float at remote 0x1f4d718>, cache_data_ttl=None, max_queries_per_pattern=42, _CassandraAccessor__glob_parser=<GraphiteGlobParser(_sequence='', _component=[], _glob='', _parsed=[]) at remote 0x35cca28>, _CassandraAccessor__compression=False, auth_provider=None, _CassandraAccessor__trace=False, _CassandraAccessor__delete_directory=<PreparedStatement(result_metadata=[], keyspace=None, is_idempotent=False, consistency_level=4, custom_payload=None, query_string='DELETE FROM "biggraphite_metadata".directories WHERE name=?;', query_id='_\tV6\xdc\xc6\xc2\t\xb7\x9a\xaa\xf1\x928i\x1c', protocol_version=4, routing_key_indexes=[0], column_metadata=[<ColumnMetadata at remote 0x35e0290>]) at remote 0x35e2510>, _CassandraAccessor__select_metric_me...(truncated)
Python Exception <type 'exceptions.IOError'> [Errno 2] No such file or directory: '<decorator-gen-1>':
Error occurred in Python command: [Errno 2] No such file or directory: '<decorator-gen-1>'

iksaif · 2017-11-04T16:24:39Z

Can you try to run that in a venv with a recent version of pypy ?
I'd like to make sure we don't simply debug something that's broken in Python 2.7.5.

# You can also get 
$ wget https://bitbucket.org/squeaky/portable-pypy/downloads/pypy-5.9-linux_x86_64-portable.tar.bz2
$ tar -xf pypy-5.9-linux_x86_64-portable.tar.bz2
$ ./pypy-5.9-linux_x86_64-portable/bin/virtualenv-pypy venv
New pypy executable in /tmp/venv/bin/pypy
Installing setuptools, pip, wheel...done.
$ export GRAPHITE_NO_PREFIX=true
$ source venv/bin/activate
(venv) $ pip install biggraphite

From the stacktrace it looks like it's waiting for an asynchronous cassandra operation to finish. This should not stay there indefinitely because the cassandra driver eventually sends an exception back (except if something is wrong with the driver). I guess a timeout could be added to https://github.com/criteo/biggraphite/blob/master/biggraphite/accessor.py#L92 but if the cassandra driver doesn't pop exceptions up, this isn't going to fix much.

I'd really recommend to try to reproduce that with a recent version of either pypy or python 2.7.

NasusOptimus · 2017-11-08T10:32:23Z

So after few days of debugging:

python 2.7.14
same issue like before after some time event.wait() is blocking
pypy-5.9
is running since 2 days without problems
python 2.7.5 and 2.7.14 with event.wait(timeout=3)
is also running since 2 days without problems, it could be that some data points are not available but so far, i didn't saw this in my data

iksaif · 2017-11-08T10:48:27Z

Ok, looks like we never got bitten by this because we use pypy (which is highly recommended, see https://datastax.github.io/python-driver/performance.html).

I'll be happy to merge a patch that adds a default timeout to event.wait() (probably a few seconds).

NasusOptimus · 2017-11-08T12:34:59Z

Sounds good, timeout with 3 seconds was at least good for my setup.

Didn't knew that with pypy and when im looking to my metrics the performance is much better. Will switch to pypy. But i think a timeout on a blocking function is always a good idea. ;)

ofosos · 2018-04-18T10:04:02Z

👍 for this bug, we see the same with cPython. Setting the timeout to 3 seconds alleviated the problem for us. Now testing with PyPy.

I would be willing to create a patch for this, but I'm not sure what changes this would need, despite passing timeout=3 to event.wait().

iksaif · 2018-04-18T10:55:53Z

Adding a non-zero timeout for event.wait() is probably the correct thing to do.

iksaif self-assigned this Oct 13, 2017

ofosos mentioned this issue Apr 23, 2018

Add non-zero timeout to event.wait() #333

Merged

iksaif closed this as completed May 9, 2018

unbrice mentioned this issue May 28, 2018

Failed process #337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bg-carbon-cache processes stop working #296

bg-carbon-cache processes stop working #296

ghost commented Oct 3, 2017

iksaif commented Oct 3, 2017 •

edited

Loading

ghost commented Oct 4, 2017

ghost commented Oct 4, 2017

iksaif commented Oct 5, 2017

ghost commented Oct 12, 2017 •

edited by ghost

Loading

ilya-epifanov commented Oct 24, 2017

iksaif commented Oct 24, 2017

NasusOptimus commented Nov 2, 2017

NasusOptimus commented Nov 2, 2017

iksaif commented Nov 4, 2017 •

edited

Loading

NasusOptimus commented Nov 8, 2017

iksaif commented Nov 8, 2017

NasusOptimus commented Nov 8, 2017

ofosos commented Apr 18, 2018

iksaif commented Apr 18, 2018

bg-carbon-cache processes stop working #296

bg-carbon-cache processes stop working #296

Comments

ghost commented Oct 3, 2017

iksaif commented Oct 3, 2017 • edited Loading

ghost commented Oct 4, 2017

ghost commented Oct 4, 2017

iksaif commented Oct 5, 2017

ghost commented Oct 12, 2017 • edited by ghost Loading

ilya-epifanov commented Oct 24, 2017

iksaif commented Oct 24, 2017

NasusOptimus commented Nov 2, 2017

Setup

py-bt

py-list

thread apply all py-list

NasusOptimus commented Nov 2, 2017

iksaif commented Nov 4, 2017 • edited Loading

NasusOptimus commented Nov 8, 2017

iksaif commented Nov 8, 2017

NasusOptimus commented Nov 8, 2017

ofosos commented Apr 18, 2018

iksaif commented Apr 18, 2018

iksaif commented Oct 3, 2017 •

edited

Loading

ghost commented Oct 12, 2017 •

edited by ghost

Loading

iksaif commented Nov 4, 2017 •

edited

Loading