-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hermes takes a very long time to clear high number of pending packets #2087
Comments
Alternative option, from my comment at #2077 (comment)
That's indeed an issue, not clear how to avoid that. Perhaps by not trying to clear packets at a higher height than the one the chain was at when the command was started? ie. the command would clear packets from oldest to newest and stop once it sees a packet at a height higher than the starting height. Also, ideally the command would use pagination and display its progress as it clears the packets, perhaps with a nice TUI progress bar? |
I have tried a scenario where 100m packets are generated first, and then hemes is started. At this time, the clearing packet of Hermes will be triggered. |
when the query commitments on chain is called, the query will become very slow when the request paging is set to all (when there are too many non relayed packets on the chain), so I set the limit to chain config: num_ msg - 1 (The remaining msg is reserved for UpdateClient) |
* Added docs for unreceived_acknowledgements and unreceived_packets * Flipped order of sentences * First iteration for #2087; packet-recv works, packet-ack left * Incremental processing of packet-ack relaying * Added consensus_state_height_bounded with optimistic path * Nits * Simplified progress tracking * Comment cleanup * fmt and better logs * Document the alternative path in consensus_state_height_bounded * renamed method to make it more explicit h/t SeanC * better var name * Added workaround to tonic::code inconsistencies. Ref: #1971 (comment) * refactored the packet query methods * Refactored CLIs for incremental relaying * Experiment * Fix for packet send for non-zero delay * Using app status instead of network status * fmt * Undoing unnecessary chenges on foreign_client.rs * Removed outdated comments * Apply suggestions from code review Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com> * Small refactor in cli.rs * Address review comments * Remove duplicate code * Undo one-chain change * Fix documentation * Changelog Co-authored-by: Anca Zamfir <zamfiranca@gmail.com> Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>
* Added docs for unreceived_acknowledgements and unreceived_packets * Flipped order of sentences * First iteration for informalsystems#2087; packet-recv works, packet-ack left * Incremental processing of packet-ack relaying * Added consensus_state_height_bounded with optimistic path * Nits * Simplified progress tracking * Comment cleanup * fmt and better logs * Document the alternative path in consensus_state_height_bounded * renamed method to make it more explicit h/t SeanC * better var name * Added workaround to tonic::code inconsistencies. Ref: informalsystems#1971 (comment) * refactored the packet query methods * Refactored CLIs for incremental relaying * Experiment * Fix for packet send for non-zero delay * Using app status instead of network status * fmt * Undoing unnecessary chenges on foreign_client.rs * Removed outdated comments * Apply suggestions from code review Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com> * Small refactor in cli.rs * Address review comments * Remove duplicate code * Undo one-chain change * Fix documentation * Changelog Co-authored-by: Anca Zamfir <zamfiranca@gmail.com> Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>
Summary
With thousands of pending packets hermes progress when clearing packets is very very slow. Some steps in a typical flow (e.g.
build_recv_packet_and_timeout_msgs
) are:query_txs()
for each sequence to find the packetDuring testing with thousands of pending packets I noticed that a Tx simulation will fail pretty early. And the clearing process above starts from the beginning. With 5k pending packets on a local (fast) setup just the
query_txs()
takes 2-3 minutes. And it takes 3-5 mins in the end to clear a few hundred packets.The typical failure for simulation is the account mismatch error. And the recaching of the account sequence on that error causes the second try to fail again immediately.
Note: ignoring that sequence mismatch and continuing with the default gas improves things significantly.
Problem Definition
Clearing a high number of pending packets takes a very long time
Proposal
relay_pending_packets()
.Acceptance Criteria
Discussion
Hermes CLI for packet clearing should provide support for splitting the workload in multiple, smaller batches, to make this command more tractable. This can be done, for instance, with any of the following suggestions:
A. By adding a flag
clear packets ... --limit X
to specify that Hermes should limit itself to fetching only X amount of packets/acks (instead of all pages). Thelimit
option should probably be mandatory.B. Alternatively, we could handle splitting into smaller batches of X messages automatically within the implementation of
clear packets
and have a simpler interface.The advantage of A. is that the command is a lot more interactive and will finish faster. The problem is that the operators will not know if there will still be packets left to relay, so they will need to repeatedly invoke
clear packets --limit X
until this method returns an empty result (signalling there was nothing to clear).The advantage of B. is that this command is smarter and simpler. The disadvantage is that will take longer to finish. On a high-activity path with constant activity, depending on the implementation and how we split-up the batches, I can even imagine that this method will constantly run for as long as there are packets being created, which is a problem (the CLI should return, instead of grabbing & relaying every new packet).
Other solutions might exist. We should evaluate the pros and cons through the lenses of operator user experience and simplicity.
Tasks
For Admin Use
The text was updated successfully, but these errors were encountered: