Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in ingester after it has failed to write indixes on memcache #1852

Closed
weeco opened this issue Nov 23, 2019 · 9 comments
Closed
Labels

Comments

@weeco
Copy link
Contributor

weeco commented Nov 23, 2019

The problem:

Even though I run cortex v0.3.0 for a couple weeks already I face OOM killed ingester replicas more frequently in the recent time (roughly 1-2 times every day, which is quite a lot given that I only run 5 big ingesters).

Pod RAM request/limit in k8s: 40GB
Average RAM usage: 14-18GB
First OOM kill at: 11:29:45am
Second OOM kill at: 11:32:15am

Incident details (RAM usage, log messages etc.)

  1. Go Heap allocated bytes before it gets OOM killed (in this screenshot you see two gaps because it got OOM killed twice):

ram_cortex

  1. Number of Go routines:

go_routines

  1. The number of log messages (seem to correlate with the restarts)

log_messages

Log messages:

Since there is an unusual amount of log messages, I was interested what kind of log messages these were. I had to filter the noisy messages anyways to get to the actually interesting log messages:

4k of the 22k log messages were "context cancalled" (QueryStream error)
3k messsages because of "fingerprint collision detected, mapping to new fingerprint"
15k messages because of "sample timestamp out of order" (which was likely caused by this faulty ingester)

After filtering out these messages I had 31 messages left:

kibana-logs

In short:

Every time I see these ingester OOM kills it seems to start with memcached timeouts. In this case it couldn't put it's indexes on the "index-cache-write" memcache because it timedout. Afterwards the memory seems to leak very quickly, which makes it hard to provide a heapdump. If the information provided in this issue is not sufficient to track the leak, I'd be happy to provide whatever is needed to figure it out - please let me know what else I can do.

kibana-failed-memcached

@bboreham
Copy link
Contributor

I have seen similar symptoms, and I don't know the cause.

I am planning to set up https://github.com/conprof/conprof to fetch heap and goroutine profiles, which should give us some solid leads. If you could set that up too, even better.

@weeco
Copy link
Contributor Author

weeco commented Nov 25, 2019

Ok I'll look into conprof and I hope I can provide the desired profiling results within this week.

@weeco
Copy link
Contributor Author

weeco commented Dec 1, 2019

Here's the report of another OOM kill, this time with a heap & go routine profiles as requested.

Screenshot 2019-12-01 at 16 48 07

I took a profile (go routine + heap allocs) from 14.41h when everything was fine and another profile from 14.48h which is shortly before it gets oom killed by Kubernetes. I have attached it as zip.

OOM-Profiles.zip

Log messages for this pod around this time (after filtering the noisy log messages):

Screenshot 2019-12-01 at 17 11 01

@bboreham
Copy link
Contributor

bboreham commented Dec 3, 2019

From your zip:

  • go-profile-fine.htm has 268 goroutines; go-profile-burning.htm has 333.
    45 of the extra are coming from gRPC, of which 11 are in Push_Handler and 25 in io.ReadFull.
  • heap-burning.htm shows 12,304MB allocated in grpc.recvAndDecompress.

So I would conclude all the growth is down to incoming Push calls being decompressed. EDIT: This turned out to be wrong - see later comment.

There is an option -server.grpc-max-concurrent-streams which defaults to 100; in theory setting it lower (e.g. to 10?) would limit the number of buffers which could be allocated at the same time.
I'm not sure if this will also limit queries, and how much of an issue that might cause.

Dividing the two numbers out gives an average of 273MB in buffers for each call, which seems high - do you configure Prometheus to send thousands of samples at once? If you could look at the inuse_objects view it would show the number of buffers to cross-check (this really needs to be done in source-code view so we can see what matches where).

You should also see some benefit from the gRPC improvements brought in with #1801

I would link this to #665 - ingester could stop accepting new data when it got above a certain size. But this seems tricky to balance in a live environment.

PS: those graphs are really difficult to read; the raw profile would be much better. Feature request to conprof I guess.

@bboreham
Copy link
Contributor

bboreham commented Dec 10, 2019

I have new data: a heap dump from an ingester in the middle of rapid heap growth.
ingester.heap.gz

It's using 28GB in total of which 7.68GB in buffers allocated in recvAndDecompress.

However this time I can see this is in 63,996 buffers, giving 120KB per buffer: see #1897 for explanation of what was keeping them alive.

This ingester was limited to 20 incoming gRPC streams so that idea can be ruled out.

@stale
Copy link

stale bot commented Feb 9, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 9, 2020
@stale stale bot closed this as completed Feb 24, 2020
@weeco
Copy link
Contributor Author

weeco commented Feb 25, 2020

@bboreham or @pracucci Could you please reopen this issue? Today using Cortex v0.6.1 this has happened again. Looking at the logs I do not necessarily believe that the memcache timeout is really causing it as this happens quite a couple times in the course of the day and hasn't caused issues for other ingesters.

@bboreham Have you seen these memory leaks on the ingester recently?
I am not sure what else I could provide. Do you have an idea?

image
image
image

@pracucci pracucci reopened this Feb 25, 2020
@stale stale bot removed the stale label Feb 25, 2020
@sangfei
Copy link

sangfei commented Mar 24, 2020

@bboreham or @pracucci Could you please reopen this issue? Today using Cortex v0.6.1 this has happened again. Looking at the logs I do not necessarily believe that the memcache timeout is really causing it as this happens quite a couple times in the course of the day and hasn't caused issues for other ingesters.

@bboreham Have you seen these memory leaks on the ingester recently?
I am not sure what else I could provide. Do you have an idea?

image
image
image

Today using Cortex v0.7.0 rc this has happened again.

@stale
Copy link

stale bot commented May 23, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 23, 2020
@stale stale bot closed this as completed Jun 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants