Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hermes selective retry upon account sequence mismatch #2411

Closed
5 tasks
ljoss17 opened this issue Jul 15, 2022 · 3 comments · Fixed by #2450
Closed
5 tasks

Hermes selective retry upon account sequence mismatch #2411

ljoss17 opened this issue Jul 15, 2022 · 3 comments · Fixed by #2450
Milestone

Comments

@ljoss17
Copy link
Contributor

ljoss17 commented Jul 15, 2022

Summary of Bug

When sending tokens to the same chain twice in a row will have the second transfer triggering an "incorrect account sequence" error which should retry to send the second transfer, but is not retrying and the packet ends up in the pending packets.

Version

Steps to Reproduce

Setup gm

  • gm start
  • gm hermes config
  • gm hermes keys

Setup Hermes

  • hermes create connection --a-chain ibc-0 --b-chain ibc-1
  • hermes create channel --a-chain ibc-0 --a-connection connection-0 --a-port transfer --b-port transfer
  • hermes start

Run transfers and verify

In a new terminal:

  • hermes tx ft-transfer --dst-chain ibc-0 --src-chain ibc-1 --src-port transfer --src-channel channel-0 --amount 20 --timeout-height-offset 5
  • Wait a bit
  • hermes tx ft-transfer --dst-chain ibc-0 --src-chain ibc-1 --src-port transfer --src-channel channel-0 --amount 20 --timeout-height-offset 5
  • Wait until terminal with hermes start shows "account sequence mismatch" error
  • hermes query packet pending --chain ibc-0 --channel channel-0 --port transfer

The last command should show a pending acknowledgment.

Acceptance Criteria

  • The second transfer is correctly retried.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere
Copy link
Member

adizere commented Jul 15, 2022

The correct behavior can be reproduced with v0.15. Specifically: after submitting the second ft-transfer (given the "steps to reproduce above") we can see the following retry in the logs:

2022-07-15T15:46:01.175997Z ERROR ThreadId(26) send_tx_check{id=769c03c6}:send_tx_with_account_sequence_retry{id=ibc-1}:estimate_gas: failed to simulate tx. propagating error to caller: gRPC call failed with status: status: Unknown, message: "account sequence mismatch, expected 142, got 141: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "47799"} }
2022-07-15T15:46:01.176107Z  WARN ThreadId(26) send_tx_check{id=769c03c6}:send_tx_with_account_sequence_retry{id=ibc-1}: failed at estimate_gas step mismatching account sequence: dropping the tx & refreshing account sequence number
2022-07-15T15:46:01.178092Z  INFO ThreadId(26) send_tx_check{id=769c03c6}:send_tx_with_account_sequence_retry{id=ibc-1}: refresh: retrieved account sequence=142 number=1
2022-07-15T15:46:01.178342Z ERROR ThreadId(383) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-3 dst_chain=ibc-1}: will retry: schedule execution encountered error: failed with underlying error: gRPC call failed with status: status: Unknown, message: "account sequence mismatch, expected 142, got 141: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "47799"} } retry_index=1
2022-07-15T15:46:02.195955Z DEBUG ThreadId(383) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-3 dst_chain=ibc-1}:generate{id=769c03c6}: /ibc.core.channel.v1.MsgAcknowledgement from WriteAcknowledgementEv(WriteAcknowledgement - h:0-47800, seq:14, path:channel-4/transfer->channel-3/transfer, toh:0-47802, tos:Timestamp(NoTimestamp)))

...
2022-07-15T15:46:02.239738Z  INFO ThreadId(383) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-3 dst_chain=ibc-1}:relay{odata=769c03c6 ->Destination @0-47800; len=1}: assembled batch of 2 message(s)
...

2022-07-15T15:46:02.252603Z  INFO ThreadId(383) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-3 dst_chain=ibc-1}:relay{odata=769c03c6 ->Destination @0-47800; len=1}: [Async~>ibc-1] response(s): 1; Ok:A23C50A7DC247D3B2D3F7A52762A033926B8ABA9BBD7F8514923D58B59ACA737

--

The buggy behavior applies to both v1.0.0-rc.0 and current master.
In the buggy behavior, the retry never happens:

2022-07-15T15:54:35.758683Z DEBUG ThreadId(26) send_tx_check{id=703e243e}:send_tx_with_account_sequence_retry{id=ibc-1}: max fee, for use in tx simulation: Fee { amount: "100001stake", gas_limit: 10000000 }
2022-07-15T15:54:35.765105Z ERROR ThreadId(26) send_tx_check{id=703e243e}:send_tx_with_account_sequence_retry{id=ibc-1}:estimate_gas: failed to simulate tx. propagating error to caller: gRPC call failed with status: status: Unknown, message: "account sequence mismatch, expected 146, got 145: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "47899"} }
2022-07-15T15:54:35.765228Z  WARN ThreadId(26) send_tx_check{id=703e243e}:send_tx_with_account_sequence_retry{id=ibc-1}: failed at estimate_gas step mismatching account sequence: dropping the tx & refreshing account sequence number
2022-07-15T15:54:35.767259Z  INFO ThreadId(26) send_tx_check{id=703e243e}:send_tx_with_account_sequence_retry{id=ibc-1}: refresh: retrieved account sequence=146 number=1
2022-07-15T15:54:35.767568Z ERROR ThreadId(455) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-3 dst_chain=ibc-1}: will retry: schedule execution encountered error: link failed with underlying error: gRPC call failed with status: status: Unknown, message: "account sequence mismatch, expected 146, got 145: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "47899"} }
2022-07-15T15:54:35.767620Z  WARN ThreadId(455) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-3 dst_chain=ibc-1}: task encountered ignorable error: link error: link failed with underlying error: gRPC call failed with status: status: Unknown, message: "account sequence mismatch, expected 146, got 145: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "47899"} }

@ancazamfir
Copy link
Collaborator

ancazamfir commented Jul 15, 2022

This bug/ new behavior was introduced with #2348. This fixed the case where some "ignorable" errors are permanent. Before this PR they will result in an infinite retry loop that would prevent stop hermes relaying.
The new behavior is that the failed transactions, the ones that fail simulation or checkTx, are left to be retried at clearing interval for all ignorable errors.

  1. The one problem with the account sequence error handling specifically:
  • if it appears in simulation we recache and return the error up to the caller (meant to hit the worker's second level retry but fails, see below)
  • if it appears in checkTx we retry in the runtime one time (see the retry in do_send_tx_with_account_sequence_retry). If this one fails we return an Ok(response) that includes a mismatch error. This is later retrieved in the worker and treated differently than the simulation error.

afaict this was the same before #2348 but this PR just makes this visible since before we kept retrying from the third retry loop.
We should retry for one time for the simulation failure (mismatch_account_sequence_number_error_requires_refresh() as well.

  1. Then there is a second (outer) retry mechanism that is supposed to kick in and it doesn't. This one it turns out is only for checkTx and deliverTx errors (they come up as LinkError::Send) and will not work for simulation errors (these show as LinkError::Relayer) . This i believe we also need to fix.

With these diffs for example this outer retry kicks in and we don't see the problem anymore:

+++ b/relayer/src/link/relay_path.rs
@@ -648,6 +648,19 @@ impl<ChainA: ChainHandle, ChainB: ChainHandle> RelayPath<ChainA, ChainB> {
                         }
                     }
                 }
+                Err(LinkError(error::LinkErrorDetail::Relayer(e), _)) => {
+                    // This error means we could retry
+                    error!("error {}", e);
+                    if i + 1 == MAX_RETRIES {
+                        error!("{}/{} retries exhausted. giving up", i + 1, MAX_RETRIES)
+                    } else {
+                        // If we haven't exhausted all retries, regenerate the op. data & retry
+                        match self.regenerate_operational_data(odata.clone()) {
+                            None => return Ok(S::Reply::empty()), // Nothing to retry
+                            Some(new_od) => odata = new_od,
+                        }
+                    }
+                }

But not sure the immediate fix is ^. The long term fix should be better error handling, the cosmosSdk chain runtime with better errors and the peculiarities of how the send_msg works should not show in the workers. Maybe we can return an Ok(response) with ChainError inside for everything that has to do with the chain response errors (simulation, checkTx or deliverTx). Or an explicit typed error.

  1. Lastly there is the third retry mechanism in the spawn_background_task() called by spawn_packet_cmd_worker(). Before Do not retry clearing with same height forever #2348 this was retrying the failed command causing the infinite loop in some cases, fixing the account mismatch issue (not fixed by the two deeper retry schemes above) in other cases.

@adizere adizere changed the title Bug when running two transfers with the same sender and receiver Hermes selective retry upon account sequence mismatch Jul 18, 2022
@adizere adizere added this to the v1.0.0 milestone Jul 19, 2022
@romac
Copy link
Member

romac commented Jul 22, 2022

Seems like we do indeed have a fix! #2450 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants