Hermes takes a very long time to clear high number of pending packets #2087

ancazamfir · 2022-04-12T00:54:48Z

Summary

With thousands of pending packets hermes progress when clearing packets is very very slow. Some steps in a typical flow (e.g. build_recv_packet_and_timeout_msgs) are:

determine the sequences of pending packets
query_txs() for each sequence to find the packet
generate the messages
for each chunk of N msgs:
- create the Tx
- simulate the Tx
- broadcast_tx_sync

During testing with thousands of pending packets I noticed that a Tx simulation will fail pretty early. And the clearing process above starts from the beginning. With 5k pending packets on a local (fast) setup just the query_txs() takes 2-3 minutes. And it takes 3-5 mins in the end to clear a few hundred packets.
The typical failure for simulation is the account mismatch error. And the recaching of the account sequence on that error causes the second try to fail again immediately.
Note: ignoring that sequence mismatch and continuing with the default gas improves things significantly.

Problem Definition

Clearing a high number of pending packets takes a very long time

Proposal

limit the number of events generating one operational data. For example for 5k pending packets we could generate 50 operational data elements. This way we minimize the work hermes does until the Tx-es go through or fail.
probably also redesign the retry in relay_pending_packets().
(maybe) ignore the sequence mismatch in the Tx simulation. If there is a real mismatch the checkTx will catch that and we can recache there.

Acceptance Criteria

Copied from #2077. Please see that issue and comments therein for an extended discussion on acceptance criteria and requirements.

Discussion

Hermes CLI for packet clearing should provide support for splitting the workload in multiple, smaller batches, to make this command more tractable. This can be done, for instance, with any of the following suggestions:

A. By adding a flag clear packets ... --limit X to specify that Hermes should limit itself to fetching only X amount of packets/acks (instead of all pages). The limit option should probably be mandatory.

B. Alternatively, we could handle splitting into smaller batches of X messages automatically within the implementation of clear packets and have a simpler interface.

The advantage of A. is that the command is a lot more interactive and will finish faster. The problem is that the operators will not know if there will still be packets left to relay, so they will need to repeatedly invoke clear packets --limit X until this method returns an empty result (signalling there was nothing to clear).

The advantage of B. is that this command is smarter and simpler. The disadvantage is that will take longer to finish. On a high-activity path with constant activity, depending on the implementation and how we split-up the batches, I can even imagine that this method will constantly run for as long as there are packets being created, which is a problem (the CLI should return, instead of grabbing & relaying every new packet).

Other solutions might exist. We should evaluate the pros and cons through the lenses of operator user experience and simplicity.

Tasks

add integration tests to assert the correctness of new behavior

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate milestone (priority) applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

romac · 2022-04-13T15:22:02Z

Alternative option, from my comment at #2077 (comment)

On a high-activity path with constant activity, depending on the implementation and how we split-up the batches, I can even imagine that this method will constantly run for as long as there are packets being created

That's indeed an issue, not clear how to avoid that. Perhaps by not trying to clear packets at a higher height than the one the chain was at when the command was started? ie. the command would clear packets from oldest to newest and stop once it sees a packet at a height higher than the starting height.

Also, ideally the command would use pagination and display its progress as it clears the packets, perhaps with a nice TUI progress bar?

lyqingye · 2022-04-19T03:45:49Z

I have tried a scenario where 100m packets are generated first, and then hemes is started. At this time, the clearing packet of Hermes will be triggered.
When there are too many events on the chain, call query_txs(PacketRequest) takes a long time. each request takes about 200ms (local node), which makes the clearing packet extremely slow.
I tried to build a layer of cache, using KVDB to store all SendPacket and ACK information. when query_txs is called, the cache will be queried first.

lyqingye · 2022-04-19T03:50:52Z

when the query commitments on chain is called, the query will become very slow when the request paging is set to all (when there are too many non relayed packets on the chain), so I set the limit to chain config: num_ msg - 1 (The remaining msg is reserved for UpdateClient)

adizere · 2022-05-02T13:15:40Z

Hi @lyqingye, I took a first stab at having a more incremental approach to clearing packets. Do you mind having a look over the PR and testing in your setup -- #2167? Many thanks for the suggestions above & for any additional feedback!

lyqingye · 2022-05-02T13:40:36Z

Hi @lyqingye, I took a first stab at having a more incremental approach to clearing packets. Do you mind having a look over the PR and testing in your setup -- #2167? Many thanks for the suggestions above & for any additional feedback!

Hi @adizere , i will review your PR and give some suggestions

* Added docs for unreceived_acknowledgements and unreceived_packets * Flipped order of sentences * First iteration for #2087; packet-recv works, packet-ack left * Incremental processing of packet-ack relaying * Added consensus_state_height_bounded with optimistic path * Nits * Simplified progress tracking * Comment cleanup * fmt and better logs * Document the alternative path in consensus_state_height_bounded * renamed method to make it more explicit h/t SeanC * better var name * Added workaround to tonic::code inconsistencies. Ref: #1971 (comment) * refactored the packet query methods * Refactored CLIs for incremental relaying * Experiment * Fix for packet send for non-zero delay * Using app status instead of network status * fmt * Undoing unnecessary chenges on foreign_client.rs * Removed outdated comments * Apply suggestions from code review Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com> * Small refactor in cli.rs * Address review comments * Remove duplicate code * Undo one-chain change * Fix documentation * Changelog Co-authored-by: Anca Zamfir <zamfiranca@gmail.com> Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>

* Added docs for unreceived_acknowledgements and unreceived_packets * Flipped order of sentences * First iteration for informalsystems#2087; packet-recv works, packet-ack left * Incremental processing of packet-ack relaying * Added consensus_state_height_bounded with optimistic path * Nits * Simplified progress tracking * Comment cleanup * fmt and better logs * Document the alternative path in consensus_state_height_bounded * renamed method to make it more explicit h/t SeanC * better var name * Added workaround to tonic::code inconsistencies. Ref: informalsystems#1971 (comment) * refactored the packet query methods * Refactored CLIs for incremental relaying * Experiment * Fix for packet send for non-zero delay * Using app status instead of network status * fmt * Undoing unnecessary chenges on foreign_client.rs * Removed outdated comments * Apply suggestions from code review Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com> * Small refactor in cli.rs * Address review comments * Remove duplicate code * Undo one-chain change * Fix documentation * Changelog Co-authored-by: Anca Zamfir <zamfiranca@gmail.com> Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>

adizere mentioned this issue Apr 12, 2022

Add pagination support for clear packets #2077

Closed

1 task

adizere added this to the v0.14.0 milestone Apr 12, 2022

adizere added O: usability Objective: cause to improve the user experience (UX) and ease using the product O: performance Objective: cause to improve performance A: breaking Admin: breaking change that may impact operators P-high labels Apr 12, 2022

adizere modified the milestones: v0.14.0, v0.15.0 Apr 25, 2022

lyqingye mentioned this issue Apr 28, 2022

Hermes infinite retry clear packet #2155

Closed

5 tasks

adizere self-assigned this Apr 29, 2022

adizere added a commit that referenced this issue Apr 30, 2022

First iteration for #2087; packet-recv works, packet-ack left

027a230

adizere mentioned this issue May 2, 2022

Support for incremental packet clearing #2167

Merged

6 tasks

ancazamfir mentioned this issue May 13, 2022

Don't send a client update if consensus state already exists #2211

Closed

5 tasks

adizere closed this as completed in #2167 May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hermes takes a very long time to clear high number of pending packets #2087

Hermes takes a very long time to clear high number of pending packets #2087

ancazamfir commented Apr 12, 2022 •

edited by adizere

Loading

romac commented Apr 13, 2022

lyqingye commented Apr 19, 2022

lyqingye commented Apr 19, 2022

adizere commented May 2, 2022 •

edited

Loading

lyqingye commented May 2, 2022

Hermes takes a very long time to clear high number of pending packets #2087

Hermes takes a very long time to clear high number of pending packets #2087

Comments

ancazamfir commented Apr 12, 2022 • edited by adizere Loading

Summary

Problem Definition

Proposal

Acceptance Criteria

Discussion

Tasks

For Admin Use

romac commented Apr 13, 2022

lyqingye commented Apr 19, 2022

lyqingye commented Apr 19, 2022

adizere commented May 2, 2022 • edited Loading

lyqingye commented May 2, 2022

ancazamfir commented Apr 12, 2022 •

edited by adizere

Loading

adizere commented May 2, 2022 •

edited

Loading