-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hermes clears packet out of order, even on ordered channels #2670
Comments
So I think figured out the reason the packets were cleared out of order: In the logs linked above, we see the following on lines 2-5:
On line 2 we find 33 unreceived packets, listed in order from 4 to 36. Then we fetch the corresponding events for these sequence numbers with hermes/crates/relayer/src/link/packet_events.rs Lines 76 to 81 in 1f15e6b
so we fallback to hermes/crates/relayer/src/link/packet_events.rs Lines 94 to 100 in 1f15e6b
In StartBlock, we get events for packet sequences The vectors containing the StartBlock and EndBlock events are then appended together, with StartBlock events coming first and the EndBlock events coming last, meaning that the sequence numbers of the RecvPacket events are now out of order, ie. hermes/crates/relayer/src/link/packet_events.rs Lines 106 to 110 in 1f15e6b
It is not clear to me why we are doing this this way, and this comment does not give any rationale for this. |
I have not figured out how to reproduce this on my local setup yet. Nevertheless, even if we had a fix for this, eg. by (partially) ordering the events by packet sequence, there are more issues with packet clearing on ordered channels.
Using our fork of Gaia v6 with support for ordered channels on the For example, if I send 300 packets on a fresh ordered channel, and then trigger a packet clear, Hermes will correctly see that there are 300 unreceived packets with sequence numbers 1 to 300.
Hermes will then pull 300 SendPacket events for these sequences and submit the corresponding RecvPacket messages.
Issue 1If It might work the first time:
Here, Hermes submitted two txs, one for the first 29 events (1-29) and one for events 30-50 (51 events in total, due to the UpdateClient). But at some point:
Here, Hermes submitted two txs, one for events 50-79 and one for 80-100 (not counting the UpdateClient), but the chain chose to execute the tx containing events 80-100 before the tx containing events 50-79, which results in the following error:
Solution 1This can be fixed by setting I went on to set Issue 2But it failed again, albeit at a different stage this time and for another reason. Hermes managed to clear all SendPacket messages by submitting the corresponding RecvPacket messages, which yielded 300 WriteAck events. Hermes then checks for any unreceived acks, and find the 300 sequence numbers corresponding to 300 WriteAck events previously yielded. It then pulls the corresponding WriteAck events using
But then, when pulling the packet data for the last 50 sequence numbers (250-300):
Hermes only gets the WriteAck events for 21 sequence numbers instead of 50 ("pulled packet data for 21 events")... Specifically, it only gets the events for sequence numbers 280 to 300, but not for 250 to 279. So, when it builds a tx with the acks for sequence 280 to 300:
The chain rejects the tx with:
Tentative 2.1I initially thought that this was happening because some of the events were emitted at BeginEnd or EndBlock, and we don't handle this when pulling WriteAck events, unlike what we do for SendPacket events. But that was not it, I couldn't find the missing events using Tentative 2.2Hoping that this was instead a transient issue with the node, I modified the code to repeat the Click here to show codefn query_until<R, E>(
do_query: impl Fn() -> Result<R, E>,
pred: impl Fn(&R) -> bool,
) -> Result<R, E> {
loop {
let result = do_query()?;
if pred(&result) {
return Ok(result);
} else {
tracing::debug!("did not get enough events, retrying...");
std::thread::sleep(std::time::Duration::from_millis(500));
}
}
}
/// Returns an iterator on batches of packet events.
pub fn query_packet_events_with<'a, ChainA>(
sequences: &'a [Sequence],
query_height: Height,
src_chain: &'a ChainA,
path: &'a PathIdentifiers,
query_fn: impl Fn(&ChainA, &PathIdentifiers, &[Sequence], Height) -> Result<Vec<IbcEvent>, LinkError>
+ 'a,
) -> impl Iterator<Item = Vec<IbcEventWithHeight>> + 'a
where
ChainA: ChainHandle,
{
let events_total = sequences.len();
let mut events_left = events_total;
sequences
.chunks(QUERY_RESULT_LIMIT)
.map_while(move |chunk| {
let events = query_until(
|| query_fn(src_chain, path, chunk, query_height),
|events| events.len() == chunk.len(),
); Preliminary conclusionIt seems to me that this is a chain bug, for reasons totally unclear at the moment. But I am not ruling out the possibility that Hermes is doing something wrong when clearing SendPacket events. The search continues... |
Full logs for the above: https://gist.github.com/romac/edef065925a553b439b95fa96a6bd376 |
Not sure if this ibc-go bug I reported a while back has to do with any of this? |
Ah, there is a bug here in the code. Within a given block the ordering above is correct. However I believe in this case we end up with events from multiple blocks that contain packets with those sequences. So let's say those packets were included in blocks b1 and b2. We return the events in this order: Instead it should be: For unordered channels the issue is not visible but we do probably some unfair processing (we may relay packets in a different order than they were sent). Regarding Issue 1 and 2 you describe, I believe the root cause is the same as the one described in #2249. This fixed the simulation failure for account sequence mismatch. First thought is to identify this error and use default gas here as well. But it will be a hack as is the case in #2298 (the fix for #2249). The root cause here is that the |
For hermes/crates/relayer/src/chain/cosmos/query/tx.rs Lines 73 to 75 in 1f15e6b
It should be something like
and then sort the events in I am also not sure anymore what we are trying to achieve with querying for events in the block at query height. We optimize a bit at query height but not for events that happened at previous heights. I tried to remove these lines: hermes/crates/relayer/src/chain/cosmos/query/tx.rs Lines 45 to 75 in 1f15e6b
This code that gets the packet events (send and ack) has been reworked quite a bit and maybe we missed something or it was always wrong. I think Maybe we should clean this up and look at optimizations later. |
Looks like #2512 introduced "Issue 2". So when fixing it we need to also look at the new CLI option introduced there and make sure it still works. |
Summary of Bug
The logs below were provided by @sainoe after trying to clear packets stuck on an ordered channel in the Interchain Security testnet. He ran the following command:
There are 33 unreceived packets with packet sequences:
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36
.In the logs, we can see the following:
SendPacket
event for sequence 4SendPacket
event for sequence 33SendPacket
events for sequences 34, 35, 36SendPacket
event for sequence 5SendPacket
events for sequence 6 to 32simulate tx
step of gas estimation:Full logs
It appears that Hermes is clearing the packets out of order. Namely, it relays
SendPacket
events for packets 4 and 5, but then jumps to packet 33 and the latter is thus rejected by the chain because it is out of order, which is invalid on an ordered channel.Version
Hermes v1.0.0 from my branch
romac/block-search-evidence
.Steps to Reproduce
TBD
In the meantime, see the logs linked above.
Acceptance Criteria
Hermes clears packet in order, at the very least on ordered channels but there is no reason not to do it as well on unordered channels.
For Admin Use
The text was updated successfully, but these errors were encountered: