Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed process #337

Open
github-vincent-miszczak opened this issue May 24, 2018 · 18 comments
Open

Failed process #337

github-vincent-miszczak opened this issue May 24, 2018 · 18 comments

Comments

@github-vincent-miszczak

Hello,

I'm running Biggraphite 0.11 and got weird behavior recently:
One of the bg-carbon-cache stopped reporting metrics. Then, it leaked memory until system OOM killer killed it after some hours. I cannot tell if it continued to write data points during this time. Log file does not have anything particular.

I'm running with PyPy 6.0.0 for Python 2.7, BG_CACHE = memory.
The system is receiving ~150K points/s.

Other processes on the same and other machines didn't have any issue, Cassandra was OK.

Any help fixing this kind of issue would be appreciated.

May 24 04:50:18 biggraphite-2 kernel: [131633.221088] Killed process 14650 (bg-carbon-cache) total-vm:13337580kB, anon-rss:12265424kB, file-rss:0kB, shmem-rss:0kB

image

image

@iksaif
Copy link
Contributor

iksaif commented May 24, 2018

I think we can blame the bug fixed by edf2c6e

I'll try to do a release soon.

@github-vincent-miszczak
Copy link
Author

Thank you for the quick answer, I'll make a try with master branch.

@github-vincent-miszczak
Copy link
Author

Hello,
I'm now running with a timeout (I patched the line from 0.11 directly rather than running master).

I have some thoughts about the code/patch :
Is this patch ok? Won't it leak memory if the timeout is triggered multiple times and async functions never returns?

@iksaif
Copy link
Contributor

iksaif commented May 25, 2018

We haven't been able to reproduce this bug with pypy which we use, so I don't really know.
You might end up leaking some memory on missed timeouts, it would be interesting to add a metric for these events to know if they really happen a lot or not

@github-vincent-miszczak
Copy link
Author

Ok, I'll try to trace those events.
I'm discovering the code and I was also wondering the purpose of _wait_async_call function.
Generally speaking you don't want to wait for async functions, you pass a callback if some work needs to be done on completion or block on a event in another thread. Without explanation it look like a complicated way to just call a function.

Can you give me some clues about the rational behind it?

@iksaif
Copy link
Contributor

iksaif commented May 25, 2018

This was original written by @unbrice . I think the idea was that we had async functions and that in some cases we wanted a sync versions of these.

We want synchronous versions of functions like insert_points() to have some (minimal) backpressure mechanism (when cassandra becomes slow, we slow down too).

This is being used here: https://github.com/criteo/biggraphite/blob/master/biggraphite/plugins/carbon.py#L61

@github-vincent-miszczak
Copy link
Author

github-vincent-miszczak commented May 25, 2018

Thank you for the explanation.

I added 2 metrics to track the issue :
image

https://github.com/hjdr4/biggraphite/blob/track-event-timeouts/biggraphite/accessor.py#L81-L106

The timeout is regularly triggered so it seems that synchronous writes throttling won't be as efficient as before.

@iksaif
Copy link
Contributor

iksaif commented May 26, 2018

Which python version are you using ?

I starting to think that there must be something quite wrong in

def insert_downsampled_points_async(self, metric, datapoints, on_done=None):
but it would be great to be able to reproduce it in a unit test.

Could you try running with pypy to validate that this affects only CPython (which is the current theory)?

@github-vincent-miszczak
Copy link
Author

Hello,
As I posted in my first comment, I'm running PyPy 6.0.0 for Python 2.7 already.

Over the weekend, the values stabilized after one Cassandra node stopped consuming 100%CPU. This is not related to the original issue, Cassandra was ok at this time.

For the last graph, the timeouts occurred because I put too much load importing production datapoints and one node Cassandra took hours to stabilize resource usage after the flow was shutdown. But that shows that timeouts can occur when Cassandra is loaded, and having a hard-coded timeout can reduce throttling efficiency.
Looking at PENDING_EVENTS, it looks like I got no race during the weekend, ratio kept between 0 and 1 (meaning every async call returned).

Just to recap:

  • Using PyPy 6.0.0, a race can occur if no timeout is provided for event.wait()
  • When a timeout is set, we can mitigate the race by not blocking the whole application but this may reduce throttling efficiency when Cassandra is loaded.

Now in real life, I will be ok with this workaround.
For now, I will try to add some metrics to graph the synchronous call time.

image

image

@unbrice
Copy link
Contributor

unbrice commented May 28, 2018

I think @vmiszczak-teads is right:

Is this patch ok? Won't it leak memory if the timeout is triggered multiple times and async functions never returns?

Indeed, edf2c6e only displaces the leak. It does so at the expense of removing error propagation, limiting back pressure and introducing a race condition

To understand the problem, see that these are two work queues. One of the Cassandra Driver "cQ", one of BigGraphite "bgQ".
edf2c6e causes bgQ not to wait for cQ to consume items.
#296 occured when cQ is full. But with edf2c6e, bgQ gets full instead. So we have #337.

edf2c6e is also bad because it prevents exceptions from bubling up, and because it removes the protection on exception_box .

I think the solution should instead be to bound the queue sizes.
This can be achieved by propagating back pressure further up (if we can) or by explicitly dropping points (if we cannot). Then BG will degrade more gracefully.
I see carbon has USE_FLOW_CONTROL, but I haven't checked if it does the right thing. Is it set to true in your setup?

This being said, we are really speaking of graceful degradation here. It is likely that another issue in your setup is causing cQ (and bgQ) not to empty fast enough. I would check system health for the specific nodes running BG (FDs, ....) and health of the Cassandra cluster (do you have probes?).

@github-vincent-miszczak
Copy link
Author

#296 and #337 both occurred without edf2c6e.
I was running vanilla 0.11 with PyPy when the bug triggered. Also it happened on one process among 48 (12 per node over 4 nodes) and Cassandra was OK at this time. To monitor Cassandra, I'm using advanced dashboards on Datadog to ensure eveything was ok (CPU ok, no hints, no dropped messages, low task queues, latency ok, no weird anything).

About #296, is there a metric that shows cQ/bgQ? I could graph at the moment the problem occurred, comparing different processes/nodes.

My USE_FLOW_CONTROL is True, but I have MAX_CACHE_SIZE = inf. When the bug triggered, back pressure was put on Carbon itself, filling up the memory. If I had bounded the cache size, Carbon would had dropped points forever and the process would had never crashed and be restarted by systemd (even worst).

IMHO the real issue is that an async call can never return(this is the problem I had). The async call should ensure/promise it will return in a bounded time. I think a timeout is necessary but it should not be on waiting for the async call to complete, but rather on the async call itself to avoid leaking anything.

Also the synchronous writes does not seem to take care about pending async writes, so I'm not even sure the original throttle design did the job. Maybe a flush() method waiting for all pending async tasks to complete would have more sense, also it would be appropriate for the back-pressure mechanism.

@unbrice I did not understand your point on queue sizes, queues can be full only if they are bounded.

#296 occured when cQ is full. But with edf2c6e, bgQ gets full instead

I think the solution should instead to bound the queue sizes.

@unbrice
Copy link
Contributor

unbrice commented May 28, 2018

About #296, is there a metric that shows cQ/bgQ? I could graph at the moment the problem occurred, comparing different processes/nodes.

Hum, I don't have access to a prod instance to see which exisiting metrics could be used to visualise these unfortunatelly (I changed employers after participating to this project). Maybe this help for cQ? bgQ would be a carbon-level metric, maybe carbon.agents.<instance>.cache.size

My USE_FLOW_CONTROL is True, but I have MAX_CACHE_SIZE = inf. When the bug triggered, back pressure was put on Carbon itself, filling up the memory.

I see... So it seems working as configured.
Fundamentally, you must pick your poison between:

  • MAX_CACHE_SIZE=some_number, USE_FLOW_CONTROL=True: points may be rejected, storage usage is bound
  • MAX_CACHE_SIZE=some_number, USE_FLOW_CONTROL=False: points may be lost, storage usage is bound
  • MAX_CACHE_SIZE=inf: points are never rejected or lost, but storage usage is not bound. This allows for "interesting" snowballing effects when eg: the GC slows down.

If I had bounded the cache size, Carbon would had dropped points forever and the process would had never crashed and be restarted by systemd (even worst).

My understanding is different, as per above. If you had bound the cache size but set USE_FLOW_CONTROL=True, the slow machines would stop accepting points rather than droping them. Assuming you have a load balancer or redundant writers it should work.
If you don't have a LB, yet if you care about not droping points, then I'd love to discuss your setup IRL next time I pass by Montpelliers. But that's probably beyond the scope of this bug ;-)

The async call should ensure/promise it will return in a bounded time. I think a timeout is necessary but it should not be on waiting for the async call to complete, but rather on the async call itself to avoid leaking anything.

What you suggest amounts to having cQ discard points when it is full. Not only it would discard points, but it would also prevent backpressure from working for those who bound the cache size.

@github-vincent-miszczak
Copy link
Author

If you had bound the cache size but set USE_FLOW_CONTROL=True, the slow machines would stop accepting points rather than droping them.

You are right, I meant clients (relay with bounded queues in my case) would start dropping points.

What you suggest amounts to having cQ discard points when it is full. Not only it would discard points, but it would also prevent backpressure from working for those who bound the cache size.

My understanding of the current code is that insert_downsampled_points_async should already be time bounded by DEFAULT_TIMEOUT. For some reason (PyPy bug, Cassandra bug, BG bug ?) it is not 100% true.
Also my understanding of sync writes is that it waits for an async write to complete but does not care about other async writes so there is no pressure. I made minor changes to slow down until Cassandra is able to write all requests. It also bounds the number of in-flight async write requests. Finally I removed strict versioning of Cassandra driver as the bug described has been fixed. master...hjdr4:sync-writes

@iksaif
Copy link
Contributor

iksaif commented May 29, 2018

FYI: https://github.com/criteo/biggraphite/blob/master/biggraphite/drivers/cassandra.py#L528 should already limit the size of the Cassandra queue.
Because Carbon is mono-threaded waiting for one write from time to time should "mostly" work. But you are right, this is not a strict guarantee.

@NasusOptimus
Copy link

Hello,

i made some analyzes of the memory consumption of biggraphite and want to share the current state.
First of all i splitted my production servers into reader and writer.

Writer

For roughly 4 hours i counted every object in a 60s interval on one of my writer.
This writer was receiving around 42k metrics per seconds and is writing into a 6 node cassandra cluster.

biggraphite_graphs

So writing into cassandra is more or less stable and memory consumption is ok. Until you reach a point where cassandra cannot handle the high write throughput anymore.
At this point you see an increase of BytesIO objects which should be the write buffer of cassandra when im right. One thing found, if cassandra cannot handle the load than the cassandra driver (i think) is creating more and more BytesIO objects until oom. OK this is not a surprise. I stablelized cassandra and looked again what is happening.
Over a couple of weeks i made some objects counts because first i did not see an increase of memory usage. But what i found is that Stage objects are slowly increasing over time. Maybe this is worth to looking into it.

Here is a backref graph of one Stage object:

mem_profile_stage_li

My Retention is: "60s:14d,5m:365d", does it mean the downsampler will hold the data until we reach one year?
To be honest i dont understand the stage/downsampling stuff completely, maybe someone can explain it to me.

So one metric consists of:

  • 1 Metric object
  • 1 MetricAggregates object
  • x MetricMetadata objects
  • x Retention objects
  • z Stage objects

x is the amount of retentions, for my retention it is 2.
But z is unknown to me. How many Stages are created per metric?

For 42k metrics i see an increase of 400MB memory usage in 7 days. I think this will lead at some point to an oom.

Reader

Lets have a look into the reader. Memory usage is increasing very fast and this is most likely the django cache i would say.
A good (but old) explanation can be found here https://blog.gingerlime.com/2011/django-memory-leaks-part-i/

Im running graphite-web with uwsgi and there im able to reinit worker after some time and/or after a specified amount of requests to have some kind of garbage collection and to stablelize the memory usage.

@iksaif
Copy link
Contributor

iksaif commented Sep 12, 2018

Hello, thanks for the detailed report, I'll take a look tomorrow morning :).

@iksaif
Copy link
Contributor

iksaif commented Sep 13, 2018

So

Writer

You are right for most of your analysis. To answer some of your questions:

  • with your retention a metric will have only two stages (60s:14d and 5m:365d)
  • by default it will keep a few points at the first stage (10min worth of points if I'm not wrong + a count and sum for the second stage.)

Some things we never really looked at, but should (cc @Thib17, @adericbourg):

  • Most of the MetricMetadata are essentially all the same, even if these objects are small they should be deduplicated (keep a global dict of known MetricMetadata and use it when constructing new ones to share the reference)
  • After how much time do we drop references of unactive metrics? In theory this should happen after a few hours (once the downsampler, delayed writer and metadata cache have flushed them). To be tested an potentially fixed.

Note that we run replicated carbon-caches running in cgroups, so we do have leaks but we don't notice them that much.

Reader

We always ran wsgi with reload-rss (and max-requests or something like that) to make sure leaks would not kill our machines. If you use Django's cache, you should use it with memcached to make sure your workers are all sharing the same data. But all the cache items set by BigGraphite should have a ttl, so it shouldn't leak too much.

@Thib17 can you paste the wsgi.conf we use in the wiki or here ? Thanks

@iksaif
Copy link
Contributor

iksaif commented Sep 6, 2019

@Thib17 : ping :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants