Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Hermes retrying mechanism not regenerating messages #1951

Merged
merged 19 commits into from
Apr 20, 2022

Conversation

seanchen1991
Copy link
Contributor

@seanchen1991 seanchen1991 commented Mar 11, 2022

Closes: #1792 & #2074

Description


PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

@seanchen1991
Copy link
Contributor Author

Based on manual testing, it looks like regenerating the operational data for the transaction for a pending tx hash stops indefinite retries/resubmissions:

2022-04-05T18:37:03.575437Z ERROR ThreadId(48) packet{src_chain=network1 src_port=transfer src_channel=channel-0 dst_chain=network2}: timed out while confirming TxHashes: count=1; F0C878F10E04F3CD8C5158EC47034D28BAB1CB8388C0EF0A4AC8EC5F1E517348                                                                           
2022-04-05T18:37:03.575580Z  WARN ThreadId(48) packet{src_chain=network1 src_port=transfer src_channel=channel-0 dst_chain=network2}: failed. Regenerate operational data from 1 events
2022-04-05T18:37:03.592324Z DEBUG ThreadId(48) packet{src_chain=network1 src_port=transfer src_channel=channel-0 dst_chain=network2}:generate{id=QolGWZg9XY}: /ibc.core.channel.v1.MsgAcknowledgement from WriteAcknowledgementEv(WriteAcknowledgement - h:0-142, seq:2, path:channel-0/transfer->channel-0/transfer, toh:0-0, tos:Timestamp(2022-04-05T18:38:26.75658Z)))
2022-04-05T18:37:03.592528Z  INFO ThreadId(48) packet{src_chain=network1 src_port=transfer src_channel=channel-0 dst_chain=network2}: will retry odata=QolGWZg9XY ->Destination @0-142; len=1                                                                                                                                                    
2022-04-05T18:37:03.592620Z DEBUG ThreadId(48) packet{src_chain=network1 src_port=transfer src_channel=channel-0 dst_chain=network2}:relay{odata=QolGWZg9XY ->Destination @0-142; len=1}: delayed by: 13.6325ms [try 1/5]                                                                                                                                                          
2022-04-05T18:37:03.660234Z DEBUG ThreadId(48) packet{src_chain=network1 src_port=transfer src_channel=channel-0 dst_chain=network2}:relay{odata=QolGWZg9XY ->Destination @0-142; len=1}: prepending Destination client update at height 0-143                                                                                                                                     
2022-04-05T18:37:03.696296Z DEBUG ThreadId(48) packet{src_chain=network1 src_port=transfer src_channel=channel-0 dst_chain=network2}:relay{odata=QolGWZg9XY ->Destination @0-142; len=1}: [network1 -> network2:07-tendermint-0] MsgUpdateAnyClient from trusted height 0-128 to target height 0-143                                                                               
2022-04-05T18:37:03.746451Z  INFO ThreadId(48) packet{src_chain=network1 src_port=transfer src_channel=channel-0 dst_chain=network2}:relay{odata=QolGWZg9XY ->Destination @0-142; len=1}: assembled batch of 2 message(s)                                                                                                                                                          
2022-04-05T18:37:03.746919Z DEBUG ThreadId(12) send_tx_check{id=QolGWZg9XY}:send_tx{id=network2}: sending 2 messages using account sequence 12
2022-04-05T18:37:03.750856Z DEBUG ThreadId(12) send_tx_check{id=QolGWZg9XY}:send_tx{id=network2}: max fee, for use in tx simulation: Fee { amount: "1001stake", gas_limit: 1000000 }
2022-04-05T18:37:03.765742Z DEBUG ThreadId(12) send_tx_check{id=QolGWZg9XY}:send_tx{id=network2}:estimate_gas: tx simulation successful, gas amount used: 99098
2022-04-05T18:37:03.830615Z DEBUG ThreadId(12) send_tx_check{id=QolGWZg9XY}:send_tx{id=network2}: send_tx: using 99098 gas, fee Fee { amount: "110stake", gas_limit: 109008 } id=network2
2022-04-05T18:37:03.843246Z DEBUG ThreadId(12) send_tx_check{id=QolGWZg9XY}:send_tx{id=network2}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(0861A30F195669CA81B5A780B6226DCA06903C57D14ED9D8506F02F54E34A721) } 

@seanchen1991 seanchen1991 marked this pull request as ready for review April 5, 2022 18:53
Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions for improvements, nothing critical.

The approach is sound, testing it automatically is difficult so we'll do that in separate work. I will test this manually.

relayer/src/link/pending.rs Show resolved Hide resolved
relayer/src/link/pending.rs Outdated Show resolved Hide resolved
relayer/src/link/relay_path.rs Show resolved Hide resolved
relayer/src/link/relay_path.rs Outdated Show resolved Hide resolved
relayer/src/worker.rs Outdated Show resolved Hide resolved
@romac
Copy link
Member

romac commented Apr 12, 2022

@ancazamfir ready for you to take a look when you have time

@ancazamfir
Copy link
Collaborator

I will try this more but so far something seems wrong ...still have to look at the code.

I have a local setup where I managed to reproduce the original problem. The sequence of steps is:

  • hermes create channel...
  • hermes start
  • send two packets with short timeout (3 block heights works in my case): hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 10 -n 2 -o 3 -k user2

With master I can see the indefinite retries.
Moving to the PR repo/ branch:

  • the Tx is for MsgUpdateClient+MsgRecvPacket * 2 to ibc-1
  • it passes simulation and ChecTx during broadcast_tx_sync.
  • i get a hash and for some reason it is never included in a block. This is exactly what we want but I have no idea why it is quietly dropped from the mempool.
  • i see the timeout waiting for confirmation
2022-04-12T23:05:40.507607Z ERROR ThreadId(42) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; D9EFA06F5614BDC760402AD0D22B10F0BE761BBE63F8BE9CA2C779F00C6D3BB6
  • but nothing happens after this. The original SendPacket events are never resubmitted, the packet commitments stay on ibc-0.

@ancazamfir
Copy link
Collaborator

Ah, actually i had a non-zero (very high) clearing interval in the config. Once I set clear_interval = 0 the test above passes.
But I am not sure if the retry when confirmation is true should only be for interval 0. (Note: this should be documented in config.toml, figured it out by reading the code)

What would happen if there are simulation or CheckTx failures? We don't have pending Tx-es in this case. I think there is a separate short retry loop but if that fails then we cannot clear those packets anymore.

The other problem is the following test:

  • same as above but
  • repeatedly send two packets every 20-30 sec or so (note that i changed the timeout locally to 60sec) then stop sending
  • the first timed out while confirming TxHashes... message occurs for the last submitted hash; the Tx-es are retried in reverse order they occured.

Here is a log summary:

2022-04-13T00:43:08.097757Z DEBUG ThreadId(41) send_tx_check{id=MHuteAoNnY}:send_tx{id=ibc-1}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(8EBA697D9E8DFB12D3CA00CE7D150E8856910EB87D51EE8B9020FBC6CDFD02CE) }
2022-04-13T00:43:08.097875Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=MHuteAoNnY ->Destination @0-1901; len=2}: [Async~>ibc-1] response(s): 1; Ok:8EBA697D9E8DFB12D3CA00CE7D150E8856910EB87D51EE8B9020FBC6CDFD02CE

2022-04-13T00:43:08.097930Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=MHuteAoNnY ->Destination @0-1901; len=2}: success
...


2022-04-13T00:43:30.316833Z DEBUG ThreadId(41) send_tx_check{id=TA1yB5XJAs}:send_tx{id=ibc-1}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(A4268A7A8B8A16C30E0FD45E27C904D18A397655E2D67BEDDF52759A2D6E8CD7) }
2022-04-13T00:43:30.316968Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=TA1yB5XJAs ->Destination @0-1905; len=2}: [Async~>ibc-1] response(s): 1; Ok:A4268A7A8B8A16C30E0FD45E27C904D18A397655E2D67BEDDF52759A2D6E8CD7

2022-04-13T00:43:30.322178Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=TA1yB5XJAs ->Destination @0-1905; len=2}: success
...


2022-04-13T00:43:56.609159Z DEBUG ThreadId(41) send_tx_check{id=JXo-F5HW-h}:send_tx{id=ibc-1}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(992A454AAC5285503956FC1044BEDF79CAF1DDA444D9B6A50DF1D695E030BDB8) }
2022-04-13T00:43:56.609295Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=JXo-F5HW-h ->Destination @0-1910; len=2}: [Async~>ibc-1] response(s): 1; Ok:992A454AAC5285503956FC1044BEDF79CAF1DDA444D9B6A50DF1D695E030BDB8

2022-04-13T00:43:56.616934Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=JXo-F5HW-h ->Destination @0-1910; len=2}: success
...


2022-04-13T00:44:28.139083Z DEBUG ThreadId(41) send_tx_check{id=Io7VsBcCcr}:send_tx{id=ibc-1}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(16F67AC4C4CED8046286633455584B35CFC0EC840C78E4FCDBDFF4F020BDF60E) }
2022-04-13T00:44:28.139222Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=Io7VsBcCcr ->Destination @0-1916; len=2}: [Async~>ibc-1] response(s): 1; Ok:16F67AC4C4CED8046286633455584B35CFC0EC840C78E4FCDBDFF4F020BDF60E

2022-04-13T00:44:28.139280Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=Io7VsBcCcr ->Destination @0-1916; len=2}: success
...


2022-04-13T00:45:20.313677Z DEBUG ThreadId(41) send_tx_check{id=3WUZWMcJbS}:send_tx{id=ibc-1}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(0DE1CECCEF09918CDE79611A263ACBC508545F573884E759C12DB18F5C06AC75) }
2022-04-13T00:45:20.313810Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=3WUZWMcJbS ->Destination @0-1926; len=2}: [Async~>ibc-1] response(s): 1; Ok:0DE1CECCEF09918CDE79611A263ACBC508545F573884E759C12DB18F5C06AC75

2022-04-13T00:45:20.313859Z  INFO ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}:relay{odata=3WUZWMcJbS ->Destination @0-1926; len=2}: success

...

2022-04-13T00:46:20.381414Z ERROR ThreadId(49) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; 0DE1CECCEF09918CDE79611A263ACBC508545F573884E759C12DB18F5C06AC75
2022-04-13T00:46:20.381631Z  WARN ThreadId(49) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: failed. Regenerate operational data from 2 events
...
2022-04-13T00:46:25.078841Z ERROR ThreadId(49) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; 16F67AC4C4CED8046286633455584B35CFC0EC840C78E4FCDBDFF4F020BDF60E
2022-04-13T00:46:25.078895Z  WARN ThreadId(49) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: failed. Regenerate operational data from 2 events
...
2022-04-13T00:46:30.295622Z ERROR ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; 992A454AAC5285503956FC1044BEDF79CAF1DDA444D9B6A50DF1D695E030BDB8
2022-04-13T00:46:30.295734Z  WARN ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: failed. Regenerate operational data from 2 events
...
2022-04-13T00:46:35.396904Z ERROR ThreadId(49) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; A4268A7A8B8A16C30E0FD45E27C904D18A397655E2D67BEDDF52759A2D6E8CD7
2022-04-13T00:46:35.397103Z  WARN ThreadId(49) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: failed. Regenerate operational data from 2 events
...
2022-04-13T00:46:40.645187Z ERROR ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; 8EBA697D9E8DFB12D3CA00CE7D150E8856910EB87D51EE8B9020FBC6CDFD02CE
2022-04-13T00:46:40.645272Z  WARN ThreadId(48) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-0 dst_chain=ibc-1}: failed. Regenerate operational data from 2 events

@seanchen1991
Copy link
Contributor Author

@ancazamfir Thanks for digging into this! Is there any action to be taken based on your findings?

@romac
Copy link
Member

romac commented Apr 13, 2022

After discussing with @ancazamfir:

  1. Let's fix the reversal of txs.
  2. Document that tx_confirmation only has an effect if clear_interval is 0.
  3. Let's make tx_confirmation = false by default in a separate PR.
  4. Let's ponder what are the pros and cons of having tx_confirmation = true vs only using clear_interval

@seanchen1991 Could you please look into (1) whenever you have time (not urgent)? cf. The other problem is the following test in #1951 (comment).

@romac romac marked this pull request as draft April 13, 2022 13:46
@adizere
Copy link
Member

adizere commented Apr 13, 2022

But I am not sure if the retry when confirmation is true should only be for interval 0. (Note: this should be documented in config.toml, figured it out by reading the code)

Note that if clear_interval != 0 then it is not necessary for Hermes to retry as part of tx_confirmation (because retrying would happen periodically anyway with the clear interval).

Document that tx_confirmation only has an effect if clear_interval is 0.

Not sure what we would mean by this.
The documented effect of tx_confirmation = true is that Hermes searches for transaction confirmations and updates telemetry. The undocumented effect is that Hermes also retries to submit transactions (but does so only when clear_interval = 0, because otherwise it would overlap in functionality with the clear interval mechanism). This seems like an implementation detail to me, but I would not be against documenting it.

https://github.com/informalsystems/ibc-rs/blob/37dbe8c42ae09403432ebfcfdb2a8a34d86909e9/config.toml#L56-L62

@romac
Copy link
Member

romac commented Apr 13, 2022

Document that tx_confirmation only has an effect if clear_interval is 0.
Not sure what we would mean by this.

Sorry, what I meant was that tx_confirmation = true will only lead to resubmissions if clear_interval = 0

@seanchen1991
Copy link
Contributor Author

seanchen1991 commented Apr 15, 2022

Found what appears to be a pretty obvious bug in the Queue::push_back implementation:

pub fn push_back(&self, val: T) {
-   self.0.acquire_write().push_front(val)
+   self.0.acquire_write().push_back(val)
}

@ancazamfir I wasn't able to quite reproduce the log outputs you were seeing in order to corroborate that this fix indeed addresses the reversal of txs that you observed. Either you could pull these changes and re-trace the steps you took, or you could add some idiot-proof instructions for me to re-trace your steps on my end 🙂

@ancazamfir
Copy link
Collaborator

Found what appears to be a pretty obvious bug in the Queue::push_back implementation:

pub fn push_back(&self, val: T) {
-   self.0.acquire_write().push_front(val)
+   self.0.acquire_write().push_back(val)
}

@ancazamfir I wasn't able to quite reproduce the log outputs you were seeing in order to corroborate that this fix indeed addresses the reversal of txs that you observed. Either you could pull these changes and re-trace the steps you took, or you could add some idiot-proof instructions for me to re-trace your steps on my end 🙂

Just tested and 74e7689 fixed it! thanks! The setup is pretty rough, requires custom gaia image that includes a hack in tendermint block execution, plus some patches in hermes to deal with the app height vs blockchain height issue. Then poking around with the timeout in the ft-transfer to get into the no-confirmation case (still need to debug on why the Tx that passed CheckTx is not included in the block even if failed DeliverTx)

@adizere
Copy link
Member

adizere commented Apr 16, 2022

After discussing with @ancazamfir:

  1. Let's fix the reversal of txs.
  2. Document that tx_confirmation only has an effect if clear_interval is 0.
  3. Let's make tx_confirmation = false by default in a separate PR.
  4. Let's ponder what are the pros and cons of having tx_confirmation = true vs only using clear_interval

Seems like 1 and 2 are done.

@romac @ancazamfir: Should we block this PR on point 4, or shall we move forward here?

@romac
Copy link
Member

romac commented Apr 16, 2022

Point 4 can wait, given that there are operators using tx confirmations. Let's move forward!

Copy link
Collaborator

@ancazamfir ancazamfir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @seanchen1991!!

Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Unless Romain or Anca want to add anything, I think this is ready. Great work Sean!

@adizere adizere merged commit 3e7b276 into informalsystems:master Apr 20, 2022
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
…ems#1951)

* Sketching out refactor

* Only relay operational data if `clear_interval` == 0

* Back to making changes to `process_pending`

* Pass `clear_interval` parameter to `process_pending` fn

* Add RelayPath parameter to `process_pending` fn

* Make `regenerate_operational_data` private.

* Call `process_pending` such that operational data can now be regenerated

* Fix clippy warnings

* Remove unnecessary comment

* Add changelog entry

* Update doc comment for `regenerate_operational_data` method

* Replace `clear_interval` param with `do_resubmit` in fn signatures

* Improve doc comments for the `process_pending` fn

* Introduce `Resubmit` type instead of boolean

* Document the interaction between `clear_interval` and `tx_confirmation` parameters

* Fix incorrect comment

* Switch from if on `Resubmit` to match

* Fix Queue::push_back method

Co-authored-by: Romain Ruetschi <romain@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hermes retrying mechanism in tx confirmation logic should be regenerating messages
5 participants