Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly outdated hardware requirements for running validator nodes #711

Open
sealbox opened this issue Feb 17, 2023 · 8 comments
Open

Possibly outdated hardware requirements for running validator nodes #711

sealbox opened this issue Feb 17, 2023 · 8 comments

Comments

@sealbox
Copy link

sealbox commented Feb 17, 2023

The webpage for Polkadot project states that in order to run validator nodes it is required to have 16 GB RAM. I have been running validator node for more than a year (and updating regularly - currently running v 0.9.36) on dedicated server with 32 GB RAM. Outside of the Polkadot nothing else is running there. Lately my node is crashing all the time. What happens is the node starts, the RAM usage is rising, rising and rising up to ~32GB RAM which is server hardware limit then the node crashes by being killed by kernel:

Feb 17 13:19:49 systemd[1]: pol.service: Main process exited, code=killed, status=9/KILL
Feb 17 13:19:49 systemd[1]: pol.service: Failed with result 'signal'.
Feb 17 13:21:49 systemd[1]: pol.service: Service hold-off time over, scheduling restart.
Feb 17 13:21:49 systemd[1]: pol.service: Scheduled restart job, restart counter is at 413.

This repeats every few minutes. Used to work fine before.

My question here is - is there possibility to still run node with 32GB server? Are there any specific settings that I should look into? Or is the documentation outdated and it should state the required value is 64GB currently not, 16GB, and I should do hardware update?

@bkchr
Copy link
Member

bkchr commented Feb 17, 2023

This is very likely some memory leak.

Please give more information on your setup. Some metrics would also be nice, CPU usage, memory usage around the time it crashes. Please also post the args you are using to run your node.

CC @koute

@sealbox
Copy link
Author

sealbox commented Feb 17, 2023

My polkadot current version is 0.9.36. I am running node with systemd:

[Unit]
Description=POL Node
After=network.target

[Service]
Type=simple
User=pol
Group=coin
ExecStart=/home/pol/node/target/release/polkadot --chain=polkadot --base-path=/home/pol/.chaindata --pruning=432000 --port=30333 --rpc-cors=all --rpc-external --ws-external --ws-max-connections=2048
Restart=always
RestartSec=120
LimitNOFILE=20480
WorkingDirectory=/home/pol

[Install]
WantedBy=multi-user.target

The full typical log looks like this:

Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 💻 Memory: 32071MB
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 💻 Kernel: 4.15.0-161-generic
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 💻 Linux distribution: Ubuntu 18.04.6 LTS
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 💻 Virtual machine: no
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 📦 Highest known block at #14272564
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 〽 Prometheus exporter started at 127.0.0.1:9615
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 Running JSON-RPC HTTP server: addr=0.0.0.0:12016, allowed origins=["*"]
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 Running JSON-RPC WS server: addr=0.0.0.0:9944, allowed origins=["*"]
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 🏁 CPU score: 1013.90 MiBs
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 🏁 Memory score: 14.11 GiBs
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 🏁 Disk score (seq. writes): 2.07 GiBs
Feb 17 14:40:19 polkadot[6883]: 2023-02-17 14:40:19 🏁 Disk score (rand. writes): 804.75 MiBs
Feb 17 14:40:20 polkadot[6883]: 2023-02-17 14:40:20 🔍 Discovered new external address for our node: /ip4/152.228.224.218/tcp/30333/ws/p2p/12D3KooWFaTo4zHoYgma5Z7VxMUya7ZXV59Pph6n6qTKdLrXihBB
Feb 17 14:40:22 ns3193383 polkadot[6883]: 2023-02-17 14:40:22 🔍 Discovered new external address for our node: /ip6/2001:41d0:203:a0da::/tcp/30333/ws/p2p/12D3KooWFaTo4zHoYgma5Z7VxMUya7ZXV59Pph6n6qTKdLrXihBB
Feb 17 14:40:24 polkadot[6883]: 2023-02-17 14:40:24 ⚙️  Syncing, target=#14288671 (21 peers), best: #14272564 (0xbdcb…95cc), finalized #14272414 (0xb8b8…ab9b), ⬇ 11.8MiB/s ⬆ 38.6kiB/s
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 1/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 2/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 2/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 3/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 4/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 5/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 6/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 7/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 8/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 9/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 10/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 2/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 3/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 4/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 5/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 6/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 7/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 8/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 9/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 10/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 11/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 2/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 3/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 2/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 3/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 4/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 5/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 6/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 2/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 3/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 2/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 3/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 4/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 5/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 6/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 7/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 8/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 9/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 10/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 11/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 12/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 13/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 14/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 15/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 16/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 17/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 18/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 19/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 20/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 21/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 22/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 23/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 24/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 25/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 26/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 27/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 28/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 29/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 30/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 31/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 28/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 4/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 5/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 4/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 5/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 6/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 7/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 8/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 9/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 10/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 11/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 12/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 13/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 14/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 15/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 16/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 17/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 18/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 19/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 20/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 21/100
Feb 17 14:40:26 polkadot[6883]: 2023-02-17 14:40:26 Accepting new connection 22/100
Feb 17 14:40:27 polkadot[6883]: 2023-02-17 14:40:27 🔍 Discovered new external address for our node: /ip4/51.159.8.12/tcp/30333/ws/p2p/12D3KooWFaTo4zHoYgma5Z7VxMUya7ZXV59Pph6n6qTKdLrXihBB
Feb 17 14:40:31 systemd[1]: pol.service: Main process exited, code=killed, status=9/KILL
Feb 17 14:40:31 systemd[1]: pol.service: Failed with result 'signal'.

It usually crashes at this moment - 22/100. When I observe whats happening with server resources through htop I see rising RAM use from around ~700MB that is has "fresh" up to the limit of ~32GB server has. The process killing the node is 'OOM killer'.

Let me know what else can I provide in order to help with the issue?

@koute
Copy link
Contributor

koute commented Feb 17, 2023

@sealbox Looks like you're running an RPC node and are getting RPC connections. Are those connections yours? Is this a public RPC server?

@sealbox
Copy link
Author

sealbox commented Feb 17, 2023

RPC ports stays behind firewall and allows connection only from my own apps.

@sealbox
Copy link
Author

sealbox commented Feb 17, 2023

Here's the log with RPC port disabled on firewall. Notice that thre are no connections now (because firewall is not allowing them), but the node still crashes the same way:

Feb 17 15:59:38 polkadot[12248]: 2023-02-17 15:59:38 Parity Polkadot
Feb 17 15:59:38 polkadot[12248]: 2023-02-17 15:59:38 ✌️  version 0.9.36-dc25abc712e
Feb 17 15:59:38 polkadot[12248]: 2023-02-17 15:59:38 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023
Feb 17 15:59:38 polkadot[12248]: 2023-02-17 15:59:38 📋 Chain specification: Polkadot
Feb 17 15:59:38 polkadot[12248]: 2023-02-17 15:59:38 🏷  Node name: rotten-cherry-8117
Feb 17 15:59:38 polkadot[12248]: 2023-02-17 15:59:38 👤 Role: FULL
Feb 17 15:59:38 polkadot[12248]: 2023-02-17 15:59:38 💾 Database: RocksDb at /home/pol/.chaindata/chains/polkadot/db/full
Feb 17 15:59:38 polkadot[12248]: 2023-02-17 15:59:38 ⛓  Native runtime: polkadot-9360 (parity-polkadot-0.tx19.au0)
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 🏷  Local node identity is: 12D3KooWFaTo4zHoYgma5Z7VxMUya7ZXV59Pph6n6qTKdLrXihBB
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 💻 Operating system: linux
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 💻 CPU architecture: x86_64
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 💻 Target environment: gnu
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 💻 CPU: Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 💻 CPU cores: 6
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 💻 Memory: 32071MB
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 💻 Kernel: 4.15.0-161-generic
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 💻 Linux distribution: Ubuntu 18.04.6 LTS
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 💻 Virtual machine: no
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 📦 Highest known block at #14272564
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 〽️ Prometheus exporter started at 127.0.0.1:9615
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 Running JSON-RPC HTTP server: addr=0.0.0.0:12016, allowed origins=["*"]
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 Running JSON-RPC WS server: addr=0.0.0.0:9944, allowed origins=["*"]
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 🏁 CPU score: 987.58 MiBs
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 🏁 Memory score: 13.76 GiBs
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 🏁 Disk score (seq. writes): 1.77 GiBs
Feb 17 16:02:12 polkadot[12248]: 2023-02-17 16:02:12 🏁 Disk score (rand. writes): 732.67 MiBs
Feb 17 16:02:13 polkadot[12248]: 2023-02-17 16:02:13 🔍 Discovered new external address for our node: /ip4/152.228.224.218/tcp/30333/ws/p2p/12D3KooWFaTo4zHoYgma5Z7VxMUya7ZXV59Pph6n6qTKdLrXihBB
Feb 17 16:02:15 polkadot[12248]: 2023-02-17 16:02:15 💔 The bootnode you want to connect to at `/ip4/51.159.8.12/tcp/30333/p2p/12D3KooWFfG1SQvcPoUK2N41cx7r52KYXKpRtZxfLZk8xtVzpp4d` provided a different peer ID `12D3KooWKFx8DC9QqeFwKkufScAAJNtHWSwGCqdUExiZLw1Tw8eq` than the one you expect `12D3KooWFfG1SQvcPoUK2N41cx7r52KYXKpRtZxfLZk8xtVzpp4d`.
Feb 17 16:02:16 polkadot[12248]: 2023-02-17 16:02:15 🔍 Discovered new external address for our node: /ip4/51.159.8.12/tcp/30333/ws/p2p/12D3KooWFaTo4zHoYgma5Z7VxMUya7ZXV59Pph6n6qTKdLrXihBB
Feb 17 16:02:17 polkadot[12248]: 2023-02-17 16:02:17 ⚙️  Syncing, target=#14289490 (20 peers), best: #14272564 (0xbdcb…95cc), finalized #14272414 (0xb8b8…ab9b), ⬇ 11.8MiB/s ⬆ 43.2kiB/s
Feb 17 16:02:24 systemd[1]: pol.service: Main process exited, code=killed, status=9/KILL

@bkchr
Copy link
Member

bkchr commented Feb 17, 2023

@sealbox why are you running with --pruning=432000?

@bkchr
Copy link
Member

bkchr commented Feb 17, 2023

This means that you are basically loading 432000 blocks into memory. This is the reason why your node is getting killed directly after startup. If you really need to run with such a huge pruning window, you should use ParityDB as this doesn't require keeping all this data in memory.

@bkchr
Copy link
Member

bkchr commented Feb 17, 2023

paritytech/substrate#13414 this should print some warning in the future to make the user aware.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
Co-authored-by: David Dunn <26876072+doubledup@users.noreply.github.com>
Co-authored-by: Alistair Singh <alistair.singh7@gmail.com>
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
* Move open_frontier_backend to fc_db

Signed-off-by: koushiro <koushiro.cqx@gmail.com>

* Some nits

Signed-off-by: koushiro <koushiro.cqx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants