Memory leak in ingester after it has failed to write indixes on memcache #1852

weeco · 2019-11-23T18:10:38Z

The problem:

Even though I run cortex v0.3.0 for a couple weeks already I face OOM killed ingester replicas more frequently in the recent time (roughly 1-2 times every day, which is quite a lot given that I only run 5 big ingesters).

Pod RAM request/limit in k8s: 40GB
Average RAM usage: 14-18GB
First OOM kill at: 11:29:45am
Second OOM kill at: 11:32:15am

Incident details (RAM usage, log messages etc.)

Go Heap allocated bytes before it gets OOM killed (in this screenshot you see two gaps because it got OOM killed twice):

Number of Go routines:

The number of log messages (seem to correlate with the restarts)

Log messages:

Since there is an unusual amount of log messages, I was interested what kind of log messages these were. I had to filter the noisy messages anyways to get to the actually interesting log messages:

4k of the 22k log messages were "context cancalled" (QueryStream error)
3k messsages because of "fingerprint collision detected, mapping to new fingerprint"
15k messages because of "sample timestamp out of order" (which was likely caused by this faulty ingester)

After filtering out these messages I had 31 messages left:

In short:

Every time I see these ingester OOM kills it seems to start with memcached timeouts. In this case it couldn't put it's indexes on the "index-cache-write" memcache because it timedout. Afterwards the memory seems to leak very quickly, which makes it hard to provide a heapdump. If the information provided in this issue is not sufficient to track the leak, I'd be happy to provide whatever is needed to figure it out - please let me know what else I can do.

bboreham · 2019-11-25T11:51:27Z

I have seen similar symptoms, and I don't know the cause.

I am planning to set up https://github.com/conprof/conprof to fetch heap and goroutine profiles, which should give us some solid leads. If you could set that up too, even better.

weeco · 2019-11-25T15:54:31Z

Ok I'll look into conprof and I hope I can provide the desired profiling results within this week.

weeco · 2019-12-01T16:12:16Z

Here's the report of another OOM kill, this time with a heap & go routine profiles as requested.

I took a profile (go routine + heap allocs) from 14.41h when everything was fine and another profile from 14.48h which is shortly before it gets oom killed by Kubernetes. I have attached it as zip.

OOM-Profiles.zip

Log messages for this pod around this time (after filtering the noisy log messages):

bboreham · 2019-12-03T12:02:38Z

From your zip:

go-profile-fine.htm has 268 goroutines; go-profile-burning.htm has 333.
45 of the extra are coming from gRPC, of which 11 are in Push_Handler and 25 in io.ReadFull.
heap-burning.htm shows 12,304MB allocated in grpc.recvAndDecompress.

~~So I would conclude all the growth is down to incoming Push calls being decompressed.~~ EDIT: This turned out to be wrong - see later comment.

There is an option -server.grpc-max-concurrent-streams which defaults to 100; in theory setting it lower (e.g. to 10?) would limit the number of buffers which could be allocated at the same time.
I'm not sure if this will also limit queries, and how much of an issue that might cause.

Dividing the two numbers out gives an average of 273MB in buffers for each call, which seems high - do you configure Prometheus to send thousands of samples at once? If you could look at the inuse_objects view it would show the number of buffers to cross-check (this really needs to be done in source-code view so we can see what matches where).

You should also see some benefit from the gRPC improvements brought in with #1801

I would link this to #665 - ingester could stop accepting new data when it got above a certain size. But this seems tricky to balance in a live environment.

_{PS: those graphs are really difficult to read; the raw profile would be much better. Feature request to conprof I guess.}

bboreham · 2019-12-10T11:23:39Z

I have new data: a heap dump from an ingester in the middle of rapid heap growth.
ingester.heap.gz

It's using 28GB in total of which 7.68GB in buffers allocated in recvAndDecompress.

However this time I can see this is in 63,996 buffers, giving 120KB per buffer: see #1897 for explanation of what was keeping them alive.

This ingester was limited to 20 incoming gRPC streams so that idea can be ruled out.

stale · 2020-02-09T12:27:11Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

weeco · 2020-02-25T12:45:19Z

@bboreham or @pracucci Could you please reopen this issue? Today using Cortex v0.6.1 this has happened again. Looking at the logs I do not necessarily believe that the memcache timeout is really causing it as this happens quite a couple times in the course of the day and hasn't caused issues for other ingesters.

@bboreham Have you seen these memory leaks on the ingester recently?
I am not sure what else I could provide. Do you have an idea?

sangfei · 2020-03-24T02:29:37Z

@bboreham or @pracucci Could you please reopen this issue? Today using Cortex v0.6.1 this has happened again. Looking at the logs I do not necessarily believe that the memcache timeout is really causing it as this happens quite a couple times in the course of the day and hasn't caused issues for other ingesters.

@bboreham Have you seen these memory leaks on the ingester recently?
I am not sure what else I could provide. Do you have an idea?

Today using Cortex v0.7.0 rc this has happened again.

stale · 2020-05-23T03:10:59Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

bboreham mentioned this issue Dec 10, 2019

Ensure metric name is not pointing into gRPC buffer #1897

Merged

stale bot added the stale label Feb 9, 2020

stale bot closed this as completed Feb 24, 2020

pracucci reopened this Feb 25, 2020

stale bot removed the stale label Feb 25, 2020

stale bot added the stale label May 23, 2020

stale bot closed this as completed Jun 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in ingester after it has failed to write indixes on memcache #1852

Memory leak in ingester after it has failed to write indixes on memcache #1852

weeco commented Nov 23, 2019

bboreham commented Nov 25, 2019

weeco commented Nov 25, 2019

weeco commented Dec 1, 2019

bboreham commented Dec 3, 2019 •

edited

Loading

bboreham commented Dec 10, 2019 •

edited

Loading

stale bot commented Feb 9, 2020

weeco commented Feb 25, 2020 •

edited

Loading

sangfei commented Mar 24, 2020

stale bot commented May 23, 2020

Memory leak in ingester after it has failed to write indixes on memcache #1852

Memory leak in ingester after it has failed to write indixes on memcache #1852

Comments

weeco commented Nov 23, 2019

bboreham commented Nov 25, 2019

weeco commented Nov 25, 2019

weeco commented Dec 1, 2019

bboreham commented Dec 3, 2019 • edited Loading

bboreham commented Dec 10, 2019 • edited Loading

stale bot commented Feb 9, 2020

weeco commented Feb 25, 2020 • edited Loading

sangfei commented Mar 24, 2020

stale bot commented May 23, 2020

bboreham commented Dec 3, 2019 •

edited

Loading

bboreham commented Dec 10, 2019 •

edited

Loading

weeco commented Feb 25, 2020 •

edited

Loading