Memory leak #1037

p0mvn · 2022-03-03T18:59:51Z

Background

Some validators have reported getting large RAM usage and, sometimes, getting OOM killed.
We should investigate this issue and document all the findings. If the issue stems from a functional error, a follow-up task needs to be created to address this.

Acceptance Criteria

investigate and document findings related to RAM usage

p0mvn · 2022-03-03T23:57:10Z

I have already run a node for over 24 hours and did not observe anything out of the ordinary. The node had 16Gb RAM, 16Gb swap, 2vCPUs, and 300GB disk. All of these are below recommended settings and were used to test under rigorous conditions.

The node was collecting Prometheus metrics which I visualized with a Grafana dashboard. I could see that during the epoch block processing, the virtual memory would spike but the resident memory not so much. However, eventually, the resident memory goes back to normal. The resident memory is not released to the OS. As a result, the top command in UNIX may show large usage for that process from the OS perspective. At the same time, the process is actually using twice less and just doesn't release the excess memory back to the OS.

pprof samples did not indicate the possibility of any leaked or blocked goroutines either.

Unfortunately, I didn't save those metrics. As a sanity check, I'm currently rerunning a v7.0.3 node over 24 hours to collect the samples again

faddat · 2022-03-04T19:30:44Z

@p0mvn It is almost certainly CosmWasm.

Why do I say this?

bostrom
terra
Juno (juno is the least affected)

p0mvn · 2022-03-04T21:39:20Z

@faddat Thanks for the info. Interestingly, I'm not seeing any RAM issues from my side

I've been running a node for over 24 hours, and it seems to be relatively stable in terms of resident memory used

p0mvn · 2022-03-04T21:41:17Z

Here, we can see 2 spikes in memory usage. They happened due to the epoch block processing. We can also observe that the resident memory goes back to normal after the epoch block. However, the virtual memory is never reclaimed by the OS. I think this is the reason why many people may get the impression that RAM is increasing

p0mvn · 2022-03-04T21:42:40Z

I see simalar spikes in latencies during the epoch block.:

I inspected the logs around these times and saw a line that we are processing an epoch

p0mvn · 2022-03-04T21:43:55Z

I will keep the node running for 24 more hours to monitor one more epoch. If everything is stable, I think we can mark this issue as resolved

ValarDragon · 2022-03-04T23:34:03Z

I wonder if theres anything related to queries that causes memory increases, should we post one of the nodes for people to query against?

p0mvn · 2022-03-05T01:55:13Z

Why do you think so? Is it from the correlation between "Query Count Increase" and RAM usage in the dashboard above?

I think that correlation is actually due to the epoch processing. "Query Count Increase" actually means the increase in "IAVL gets over the last minute". My choice of the name for that dashboard wasn't very accurate.

I'm guessing that we do more "IAVL gets" during the epoch processing than usual. In addition, I'm seeing in the logs that the execution flow slows during epochs (there are periods of > 1min where no logs can be observed). I suspect that might be affecting the graph as well.

If there is another reason, please let me know. I'm also happy to write a script sending a bunch of queries to my node to test.

ValarDragon · 2022-03-05T04:16:56Z

oh no reason to suspect it in particular, I was just wondering what could other peoples nodes be potentially doing that we're not seeing in the test environment

p0mvn · 2022-03-07T05:16:43Z

I wrote a simple perf-test tool to see how the node behaves under heavy query load:

https://github.com/p0mvn/perf-osmo

Right now, only 2 queries are supported. The tool continuously spams these 2 queries at random heights. I tested a validator node on testnet with 16Gb RAM, 16Gb swap.

The configuration is:

numConnections: 1000 # Number of separate TCP connections made  
numCallsPerConnection: 1000 # Number of RPC calls per connection
heightsToCover: 1000 # Latest height - heightsToCover = the minimum possible height to query at i.e. (latest height - heightsToCover) < possible height <= latest height

I also tried with 100 connections and 10000 RPC calls per connection. In both cases, the graphs look something like this:

Elevated "Query Count Increase" graph is when the script was enabled

So, memory usage does spike but eventually comes back to normal once the query load ends. The number of goroutines stabilizes as well and equals the number it was before the load started. The resident memory before the load started was 4.44 Gib. 40 minutes after it started - 4.23 Gib. According to pprof, there are no blocked goroutines after the load ends.

Based on these results, I don't think there are IAVL-specific memory leaks in queries. If there is one, it must be somewhere else in the system.

p0mvn · 2022-03-07T17:38:05Z

My mainnnet node has been running with no observed RAM issues for over 4 days.

I'm going to close this issue for now

p0mvn · 2022-03-09T15:54:18Z

About to turn down the mainnet node that has been running for almost a week. Here are the memory metrics over this entire period:

p0mvn mentioned this issue Mar 3, 2022

[Epic] Database layer performance and stability #1016

Closed

14 tasks

p0mvn added T:task ⚙️ A task belongs to a story validator-support Issues related to validators labels Mar 3, 2022

p0mvn closed this as completed Mar 7, 2022

p0mvn mentioned this issue Mar 24, 2022

Potential RAM Leak #1140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #1037

Memory leak #1037

p0mvn commented Mar 3, 2022 •

edited

Loading

p0mvn commented Mar 3, 2022 •

edited

Loading

faddat commented Mar 4, 2022

p0mvn commented Mar 4, 2022

p0mvn commented Mar 4, 2022

p0mvn commented Mar 4, 2022 •

edited

Loading

p0mvn commented Mar 4, 2022

ValarDragon commented Mar 4, 2022 •

edited

Loading

p0mvn commented Mar 5, 2022 •

edited

Loading

ValarDragon commented Mar 5, 2022

p0mvn commented Mar 7, 2022

p0mvn commented Mar 7, 2022

p0mvn commented Mar 9, 2022

Memory leak #1037

Memory leak #1037

Comments

p0mvn commented Mar 3, 2022 • edited Loading

p0mvn commented Mar 3, 2022 • edited Loading

faddat commented Mar 4, 2022

p0mvn commented Mar 4, 2022

p0mvn commented Mar 4, 2022

p0mvn commented Mar 4, 2022 • edited Loading

p0mvn commented Mar 4, 2022

ValarDragon commented Mar 4, 2022 • edited Loading

p0mvn commented Mar 5, 2022 • edited Loading

ValarDragon commented Mar 5, 2022

p0mvn commented Mar 7, 2022

p0mvn commented Mar 7, 2022

p0mvn commented Mar 9, 2022

p0mvn commented Mar 3, 2022 •

edited

Loading

p0mvn commented Mar 3, 2022 •

edited

Loading

p0mvn commented Mar 4, 2022 •

edited

Loading

ValarDragon commented Mar 4, 2022 •

edited

Loading

p0mvn commented Mar 5, 2022 •

edited

Loading