-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hermes retrying mechanism in tx confirmation logic should be regenerating messages #1792
Comments
Hi,
|
One potential way to reproduce this is to set very aggressive timeout, eg 2 seconds, as the confirmation mechanism parameter. This way, the tx confirmation mechanism in Hermes will timeout before it finds confirmations on the chain, and it will consequently resubmit the same packets again. Steps:
I managed to do the above steps and some interesting results: Hermes resubmitting the same packet 4 times in a row. Logs
I'm actually starting to think we should eliminate the resubmission step altogether -- or at least if periodic packet clearing is enabled. If |
Does this mean we want to do something like // only resubmit if `clear_interval` is set to 0
if clear_interval == 0 {
let resubmit_res = resubmit(pending.original_od.clone());
...
} |
Yes, along those lines. But that will not be sufficient to fix the issue entirely. There will remain the case when we're resubmitting, and in that case we should (like in the issue description)
So something like: // only resubmit if `clear_interval` is set to 0
if clear_interval == 0 {
let new_od = regenerate_operational_data(pending.original_od);
let resubmit_res = resubmit(new_od);
...
} |
I think this fix is going to require some larger architectural changes than we might have initially thought. |
Crate
ibc-relayer
Summary of Bug
When Hermes observes that a transaction hash remains unconfirmed, if
tx_confirmation = true
it proceed to re-submitting that transaction.The resubmit step hapens here:
https://github.com/informalsystems/ibc-rs/blob/37d54d4d851d3d7af394845e05b17b4d0e66afd7/relayer/src/link/pending.rs#L158-L165
Which simply calls the
relay_from_operational_data
method:https://github.com/informalsystems/ibc-rs/blob/2757031e029c5456f7cfe483bcca0e34ba2d5ef4/relayer/src/link/relay_path.rs#L1279
This is problematic, however, because method
relay_from_operational_data
blindly takes the messages that Hermes originally generated and calls broadcast_tx_sync with them, but these messages may comprise PacketRecv that are no longer relevant (because they might have timed-out in the meantime). If Hermes fails to resubmit succesfully, it retries indefinitely.Version
v0.10
Steps to Reproduce
Unclear how to reproduce this, the main source of this bug report is a log h/t @faddat --
https://gist.github.com/adizere/e49c5083b3a3bae2d4a03735e5a8196a
As these logs reveal, the timeout in the confirmation logic triggers, and then Hermes tries to resubmit, but then it fails because the packets are timed-out. The relevant snippet is this one:
Acceptance Criteria
There are two separate problems here:
push_back
in the pending queue.I think solving (2) will imply solving (1), but not 100% sure.
For Admin Use
The text was updated successfully, but these errors were encountered: