Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test cluster made more stable by connecting wallet to a relay node instead of a pool node. #4170

Merged
merged 1 commit into from
Oct 19, 2023

Conversation

Unisay
Copy link
Contributor

@Unisay Unisay commented Oct 18, 2023

Problem

The following test cluster failure happened sporadically many times:

  • pool 2 reports TraceNoLedgerView;
  • test cluster stops producing blocks;

The investigation revealed the following re-occuring pattern in the logs:

[2023-10-17 11:07:54.70 UTC] [pool-2] IP LocalAddress "/tmp/test-b28eb3e2412a8eb5/pool-2/node.socket@3" ErrorPolicyUnhandledApplicationException (MuxError (MuxIOException writev: resource vanished (Broken pipe)) "(sendAll errored)")
[2023-10-17 11:08:27.23 UTC] [pool-2] IP 127.0.0.1:42639 ErrorPolicySuspendPeer (Just (ApplicationExceptionTrace (MuxError (MuxIOException Network.Socket.recvBuf: resource vanished (Connection reset by peer)) "(recv errored)"))) 1s 20s
[2023-10-17 11:08:57.22 UTC] [pool-1] IP 127.0.0.1:46529 ErrorPolicySuspendPeer (Just (ApplicationExceptionTrace (MuxError (MuxIOException Network.Socket.recvBuf: resource vanished (Connection reset by peer)) "(recv errored)"))) 1s 20s
[2023-10-17 11:09:05.76 UTC] [pool-4] IP 127.0.0.1:34473 ErrorPolicySuspendPeer (Just (ApplicationExceptionTrace (MuxError (MuxIOException Network.Socket.recvBuf: resource vanished (Connection reset by peer)) "(recv errored)"))) 1s 20s

The theory explaining it

  • Wallet instance connected to the pool node (2);
  • Wallet sends resource-consuming queries to the node and this causes the original MuxIOException writev: resource vanished (Broken pipe)) "(sendAll errored)"
  • For each of the pools its topology requires minimum 3 connections to other pools;
  • Because connection to the pool 2 has failed and wasn't recovered the block production stops and cluster becomes disfunctional;

Solution

Instead of connecting wallet instance directly to a block producing node we introduce another node in the cluster for the sole purpose of serving a wallet. Such "relay" node broadcasts transactions to pools but doesn't participate in the block production. Overloading it won't cause a cluster wide problem.

Bonus

All pools have equal pledge 100 mio ada to minimize a difference in block minting probability.

Issue Number

ADP-3137

@Unisay Unisay self-assigned this Oct 18, 2023
Copy link
Contributor

@HeinrichApfelmus HeinrichApfelmus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me! 😊 (Modulo compilation errors)

Let's hope that it helps.

lib/local-cluster/lib/Cardano/Wallet/Launch/Cluster.hs Outdated Show resolved Hide resolved
@Unisay Unisay force-pushed the yura/ADP-3137/local-cluster-unstuck-on-CI branch 2 times, most recently from e08e7cd to fb184de Compare October 19, 2023 07:18
@Unisay Unisay force-pushed the yura/ADP-3137/local-cluster-unstuck-on-CI branch from fb184de to eda4b9d Compare October 19, 2023 07:56
@Unisay Unisay enabled auto-merge October 19, 2023 07:56
@Unisay Unisay added this pull request to the merge queue Oct 19, 2023
Merged via the queue into master with commit 75df93d Oct 19, 2023
2 checks passed
@Unisay Unisay deleted the yura/ADP-3137/local-cluster-unstuck-on-CI branch October 19, 2023 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants