Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #1037

Closed
Tracked by #1016
p0mvn opened this issue Mar 3, 2022 · 12 comments
Closed
Tracked by #1016

Memory leak #1037

p0mvn opened this issue Mar 3, 2022 · 12 comments
Labels
T:task ⚙️ A task belongs to a story validator-support Issues related to validators

Comments

@p0mvn
Copy link
Member

p0mvn commented Mar 3, 2022

Background

Some validators have reported getting large RAM usage and, sometimes, getting OOM killed.
We should investigate this issue and document all the findings. If the issue stems from a functional error, a follow-up task needs to be created to address this.

Acceptance Criteria

  • investigate and document findings related to RAM usage
@p0mvn p0mvn added T:task ⚙️ A task belongs to a story validator-support Issues related to validators labels Mar 3, 2022
@p0mvn
Copy link
Member Author

p0mvn commented Mar 3, 2022

I have already run a node for over 24 hours and did not observe anything out of the ordinary. The node had 16Gb RAM, 16Gb swap, 2vCPUs, and 300GB disk. All of these are below recommended settings and were used to test under rigorous conditions.

The node was collecting Prometheus metrics which I visualized with a Grafana dashboard. I could see that during the epoch block processing, the virtual memory would spike but the resident memory not so much. However, eventually, the resident memory goes back to normal. The resident memory is not released to the OS. As a result, the top command in UNIX may show large usage for that process from the OS perspective. At the same time, the process is actually using twice less and just doesn't release the excess memory back to the OS.

pprof samples did not indicate the possibility of any leaked or blocked goroutines either.

Unfortunately, I didn't save those metrics. As a sanity check, I'm currently rerunning a v7.0.3 node over 24 hours to collect the samples again

@faddat
Copy link
Member

faddat commented Mar 4, 2022

@p0mvn It is almost certainly CosmWasm.

Why do I say this?

  • bostrom
  • terra
  • Juno (juno is the least affected)

@p0mvn
Copy link
Member Author

p0mvn commented Mar 4, 2022

@faddat Thanks for the info. Interestingly, I'm not seeing any RAM issues from my side

I've been running a node for over 24 hours, and it seems to be relatively stable in terms of resident memory used

@p0mvn
Copy link
Member Author

p0mvn commented Mar 4, 2022

image

Here, we can see 2 spikes in memory usage. They happened due to the epoch block processing. We can also observe that the resident memory goes back to normal after the epoch block. However, the virtual memory is never reclaimed by the OS. I think this is the reason why many people may get the impression that RAM is increasing

@p0mvn
Copy link
Member Author

p0mvn commented Mar 4, 2022

I see simalar spikes in latencies during the epoch block.:
image

I inspected the logs around these times and saw a line that we are processing an epoch

@p0mvn
Copy link
Member Author

p0mvn commented Mar 4, 2022

I will keep the node running for 24 more hours to monitor one more epoch. If everything is stable, I think we can mark this issue as resolved

@ValarDragon
Copy link
Member

ValarDragon commented Mar 4, 2022

I wonder if theres anything related to queries that causes memory increases, should we post one of the nodes for people to query against?

@p0mvn
Copy link
Member Author

p0mvn commented Mar 5, 2022

Why do you think so? Is it from the correlation between "Query Count Increase" and RAM usage in the dashboard above?

I think that correlation is actually due to the epoch processing. "Query Count Increase" actually means the increase in "IAVL gets over the last minute". My choice of the name for that dashboard wasn't very accurate.

I'm guessing that we do more "IAVL gets" during the epoch processing than usual. In addition, I'm seeing in the logs that the execution flow slows during epochs (there are periods of > 1min where no logs can be observed). I suspect that might be affecting the graph as well.

If there is another reason, please let me know. I'm also happy to write a script sending a bunch of queries to my node to test.

@ValarDragon
Copy link
Member

oh no reason to suspect it in particular, I was just wondering what could other peoples nodes be potentially doing that we're not seeing in the test environment

@p0mvn
Copy link
Member Author

p0mvn commented Mar 7, 2022

I wrote a simple perf-test tool to see how the node behaves under heavy query load:

https://github.com/p0mvn/perf-osmo

Right now, only 2 queries are supported. The tool continuously spams these 2 queries at random heights. I tested a validator node on testnet with 16Gb RAM, 16Gb swap.

The configuration is:

numConnections: 1000 # Number of separate TCP connections made  
numCallsPerConnection: 1000 # Number of RPC calls per connection
heightsToCover: 1000 # Latest height - heightsToCover = the minimum possible height to query at i.e. (latest height - heightsToCover) < possible height <= latest height

I also tried with 100 connections and 10000 RPC calls per connection. In both cases, the graphs look something like this:

  • Elevated "Query Count Increase" graph is when the script was enabled
    mem-perf2

iavl-perf2

So, memory usage does spike but eventually comes back to normal once the query load ends. The number of goroutines stabilizes as well and equals the number it was before the load started. The resident memory before the load started was 4.44 Gib. 40 minutes after it started - 4.23 Gib. According to pprof, there are no blocked goroutines after the load ends.

Based on these results, I don't think there are IAVL-specific memory leaks in queries. If there is one, it must be somewhere else in the system.

@p0mvn
Copy link
Member Author

p0mvn commented Mar 7, 2022

My mainnnet node has been running with no observed RAM issues for over 4 days.

I'm going to close this issue for now

@p0mvn p0mvn closed this as completed Mar 7, 2022
@p0mvn
Copy link
Member Author

p0mvn commented Mar 9, 2022

About to turn down the mainnet node that has been running for almost a week. Here are the memory metrics over this entire period:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:task ⚙️ A task belongs to a story validator-support Issues related to validators
Projects
Archived in project
Development

No branches or pull requests

3 participants