Cluster node startup takes longer than 2 hours #2679
-
DescriptionCluster node restart takes longer than two hours without consuming system ressources. Reproduction
Rabbit MQ setupRabbit MQ-Version: 3.8.1 (Erlang 22.1.6) ExpectedNode starts within a few minutes ActualNode start always takes longer than 120 minutes. Reproduction setup
Log analysis
System load during startupCPU: < 25 % |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 5 replies
-
Unfortunately we don't know where to continue to dig deeper. Any help would be highly appreciated. Thanks in advance |
Beta Was this translation helpful? Give feedback.
-
Please provide a script we can run to load a 3-node cluster with your exact configuration and message load. The I'll also add that I'm not surprised startup is taking this long. What are the exact specifications of your CPUs and storage devices? |
Beta Was this translation helpful? Give feedback.
-
Enable debug logging and consider sharing actual logs. One possible reason with classic mirrored queues is eager sync they use. Use a quorum queue to compare since they only transfer the delta between leader and follower on node startup, if any. There can be any number of other reasons, a traffic capture and |
Beta Was this translation helpful? Give feedback.
-
Unfortunately that log is not helpful. We will need a way to reproduce this, please see my earlier comment - #2679 (comment) Also please see @michaelklishin 's comment - #2679 (comment) |
Beta Was this translation helpful? Give feedback.
-
We isolated the issue saving a state from every node and using an isolated container environment to recover the nodes. First node: Up within ~ 5 Minutes - no RAM peak Very few data was read from storage, but almost 3.2 GB of inter-node network traffic on every node during start up. |
Beta Was this translation helpful? Give feedback.
-
@mkuratczyk you may want to add extra info here. |
Beta Was this translation helpful? Give feedback.
-
Hi, I've identified the following two steps that slow down the startup process (and correspond to the "gaps" in your logs):
In both cases, lack of log messages is justified - these are very fast operations that are only slow because they are repeated hundreds of thousands of times (for each binding) and there is no point logging each execution even at To test the proposed changes you can deploy a cluster using |
Beta Was this translation helpful? Give feedback.
Unfortunately that log is not helpful. We will need a way to reproduce this, please see my earlier comment - #2679 (comment)
Also please see @michaelklishin 's comment - #2679 (comment)