Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hermes start crashes on broadcast_tx_commit timeout #977

Closed
ebuchman opened this issue May 23, 2021 · 2 comments · Fixed by #1091
Closed

hermes start crashes on broadcast_tx_commit timeout #977

ebuchman opened this issue May 23, 2021 · 2 comments · Fixed by #1091
Assignees
Labels
A: bug Admin: something isn't working I: rpc Internal: related to (g)RPC
Milestone

Comments

@ebuchman
Copy link
Member

ebuchman commented May 23, 2021

hermes start will sometimes crash on the following:

Error: failed with underlying error: RPC error to endpoint https://rpc.cosmos.network/: RPC error to endpoint https://rpc.cosmos.network/: Internal error: timed out waiting for tx to be included in a block (code: -32603)

This error should be pretty easy to recover from since the tx will likely be committed, we're just not waiting long enough. This timeout is configured by the full node as timeout_broadcast_tx_commit and defaults to 10s, which isn't even enough for 2 blocks on the hub, but it's also not a parameter we can control. If we timeout, we just need to wait to see when the tx gets included in a block (maybe for this we can have a local timeout).

A more robust approach would to use broascast_tx_sync (which only waits for the tx to pass into the mempool) and then monitor for the tx to be committed in a block.

Here's the full log if it helps:

$ hermes start coindesk-testnet-1 cosmoshub-4 -p transfer -c channel-0
May 23 00:52:33.817  INFO ibc_relayer_cli::commands: Using default configuration from: '.hermes/config.toml'
May 23 00:52:33.910 TRACE ibc_relayer::event::monitor: subscribing to query: tm.event = 'Tx'
May 23 00:52:33.929 TRACE ibc_relayer::event::monitor: subscribing to query: tm.event = 'NewBlock'
May 23 00:52:33.951 TRACE ibc_relayer::event::monitor: subscribed to all queries
May 23 00:52:33.951  INFO ibc_relayer::event::monitor: starting event monitor chain.id=coindesk-testnet-1
May 23 00:52:35.136 TRACE ibc_relayer::event::monitor: subscribing to query: tm.event = 'Tx'
May 23 00:52:35.209 TRACE ibc_relayer::event::monitor: subscribing to query: tm.event = 'NewBlock'
May 23 00:52:35.281 TRACE ibc_relayer::event::monitor: subscribed to all queries
May 23 00:52:35.282  INFO ibc_relayer::event::monitor: starting event monitor chain.id=cosmoshub-4
May 23 00:52:35.321  INFO ibc_relayer::link: relaying packets on path coindesk-testnet-1 <-> cosmoshub-4 with delay of 0ns
May 23 00:52:38.616  WARN ibc_relayer::event::rpc: error while building event send_packet.packet_data
May 23 00:52:38.777  INFO ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] clearing pending packets from events before height He
ight { revision: 1, height: 1168 }
May 23 00:52:38.777  INFO ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] clearing old packets
May 23 00:52:38.826 DEBUG ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] packets that still have commitments on coindesk-testn
et-1: [1]
May 23 00:52:38.987 DEBUG ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] recv packets to send out to cosmoshub-4 of the ones w
ith commitments on source coindesk-testnet-1: [Sequence(1)]
May 23 00:52:39.013 DEBUG ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] received from query_txs [Sequence(1)]
May 23 00:52:39.013  INFO ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] generate messages from batch with 1 events
May 23 00:52:39.066 DEBUG ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] coindesk-testnet-1 => SendPacketEv(h:1-1167, seq:1, path:channel-0/transfer->channel-119/transfer, toh:4-6316327, tos:Timestamp(NoTimestamp)))
May 23 00:52:39.267 TRACE ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] built recv_packet msg seq:1, path:channel-0/transfer->channel-119/transfer, toh:4-6316327, tos:Timestamp(NoTimestamp)), proofs at height 1-1168
May 23 00:52:39.267 DEBUG ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] cosmoshub-4 <= /ibc.core.channel.v1.MsgRecvPacket from SendPacketEv(h:1-1167, seq:1, path:channel-0/transfer->channel-119/transfer, toh:4-6316327, tos:Timestamp(NoTimestamp)))
May 23 00:52:39.267  INFO ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] scheduling op. data with 1 msg(s) for Destination chain (height 1-1168)
May 23 00:52:39.324 DEBUG ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] packets that have acknowledgments on coindesk-testnet-1 [1, 2]
May 23 00:52:39.484 DEBUG ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] ack packets to send out to cosmoshub-4 of the ones with acknowledgments on coindesk-testnet-1: []
May 23 00:52:39.484  INFO ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] finished clearing pending packets
May 23 00:52:39.644  INFO ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] relay op. data to Destination, proofs height 1-1167, (delayed by: 376.742685ms) [try 1/10]
May 23 00:52:39.644  INFO ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] prepending Destination client update @ height 1-1168
May 23 00:52:40.256 DEBUG ibc_relayer::foreign_client: [coindesk-testnet-1 -> cosmoshub-4:07-tendermint-185] MsgUpdateAnyClient for target height 1-1168 and trusted height 1-1085
May 23 00:52:40.256  INFO ibc_relayer::link: [coindesk-testnet-1 -> cosmoshub-4] assembled batch of 2 message(s)
May 23 00:52:45.728  WARN ibc_relayer::event::rpc: error while building event Incorrect event type: 'delegate'
Error: failed with underlying error: RPC error to endpoint https://rpc.cosmos.network/: RPC error to endpoint https://rpc.cosmos.network/: Internal error: timed out waiting for tx to be included in a block (code: -32603)
$
@ebuchman ebuchman added A: bug Admin: something isn't working I: rpc Internal: related to (g)RPC labels May 23, 2021
@ebuchman
Copy link
Member Author

Just saw another version of this with a panic trace I haven't seen before while setting up a channel:

May 23 23:11:31.888  WARN ibc_relayer::event::rpc: error while building event send_packet.packet_data
May 23 23:11:34.242 DEBUG ibc_relayer::foreign_client: [informal-testnet-1 -> cosmoshub-4:07-tendermint-197] MsgUpdateAnyClient for target height 1-130 and trusted height 1-101
May 23 23:11:44.481 ERROR ibc_relayer::channel: Failed ChanTry ChannelSide { chain: ProdChainHandle { chain_id: ChainId { id: "cosmoshub-4", version: 4 }, runtime_sender: Sender { .. } }, client_id: ClientId("07-tendermint-197"), connection_id: ConnectionId("connection-236"), port_id: PortId("transfer"), channel_id: ChannelId("channel-0") }: SubmitError(ChainId { id: "cosmoshub-4", version: 4 }, Error(Context { kind: Rpc(Url { inner: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("rpc.cosmos.network")), port: None, path: "/", query: None, fragment: None }, scheme: Https, host: "rpc.cosmos.network", port: 443 }), source: Some(Error(Context { kind: Rpc(Url { inner: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("rpc.cosmos.network")), port: None, path: "/", query: None, fragment: None }, scheme: Https, host: "rpc.cosmos.network", port: 443 }), source: Some(Error { code: InternalError, message: "Internal error", data: Some("timed out waiting for tx to be included in a block") }), backtrace: Some(   0: backtrace::backtrace::trace
   1: backtrace::capture::Backtrace::new
   2: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   3: tokio::runtime::thread_pool::ThreadPool::block_on
   4: tokio::runtime::Runtime::block_on
   5: ibc_relayer::chain::cosmos::CosmosSdkChain::send_tx
   6: <ibc_relayer::chain::cosmos::CosmosSdkChain as ibc_relayer::chain::Chain>::send_msgs
   7: ibc_relayer::chain::runtime::ChainRuntime<C>::run
   8: std::sys_common::backtrace::__rust_begin_short_backtrace
   9: core::ops::function::FnOnce::call_once{{vtable.shim}}
  10: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/alloc/src/boxed.rs:1546:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/alloc/src/boxed.rs:1546:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/sys/unix/thread.rs:71:17
  11: __pthread_start
) })), backtrace: Some(   0: backtrace::backtrace::trace
   1: backtrace::capture::Backtrace::new
   2: ibc_relayer::chain::cosmos::CosmosSdkChain::send_tx
   3: <ibc_relayer::chain::cosmos::CosmosSdkChain as ibc_relayer::chain::Chain>::send_msgs
   4: ibc_relayer::chain::runtime::ChainRuntime<C>::run
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
   6: core::ops::function::FnOnce::call_once{{vtable.shim}}
   7: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/alloc/src/boxed.rs:1546:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/alloc/src/boxed.rs:1546:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/sys/unix/thread.rs:71:17
   8: __pthread_start
) }))

But it seems to recover from this one and completes the handshake ...

@ancazamfir
Copy link
Collaborator

A more robust approach would to use broascast_tx_sync (which only waits for the tx to pass into the mempool) and then monitor for the tx to be committed in a block.

Yes, we discussed it last week and also with the DEX folks. We are going to look at it this week.

There is another significant problem with the broadcast_tx_commit(). All our packet relay workers perform the send_msgs() by communicating with a single per-chain runtime, so only one worker can include a message in a given block :(
Haven't done this test yet but imagine N channels all with one side on some chain X. And now ft-transfer one packet at about the same time on all channels, with X as destination. There will be N workers each handling its sendPacket event, building MsgRecvPacket and calling send_msgs() at about the same time. The last one in-line will be blocked N * block_time_of_X

wrt to fixing this, at least at first look it is not straightforward to me. All our workers are fully blocking and expect a list of events for the messages included in the Tx. Ideally this should happen behind the scenes but i don't think the current setup with the runtime allows this easily.

We also need to keep track of the sequence numbers (this doesn't look too bad) so the issues with the seq# mismatch should go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: bug Admin: something isn't working I: rpc Internal: related to (g)RPC
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants