Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hermes takes a very long time to clear high number of pending packets #2087

Closed
6 tasks
ancazamfir opened this issue Apr 12, 2022 · 5 comments · Fixed by #2167
Closed
6 tasks

Hermes takes a very long time to clear high number of pending packets #2087

ancazamfir opened this issue Apr 12, 2022 · 5 comments · Fixed by #2167
Assignees
Labels
A: breaking Admin: breaking change that may impact operators O: performance Objective: cause to improve performance O: usability Objective: cause to improve the user experience (UX) and ease using the product
Milestone

Comments

@ancazamfir
Copy link
Collaborator

ancazamfir commented Apr 12, 2022

Summary

With thousands of pending packets hermes progress when clearing packets is very very slow. Some steps in a typical flow (e.g. build_recv_packet_and_timeout_msgs) are:

  • determine the sequences of pending packets
  • query_txs() for each sequence to find the packet
  • generate the messages
  • for each chunk of N msgs:
    • create the Tx
    • simulate the Tx
    • broadcast_tx_sync

During testing with thousands of pending packets I noticed that a Tx simulation will fail pretty early. And the clearing process above starts from the beginning. With 5k pending packets on a local (fast) setup just the query_txs() takes 2-3 minutes. And it takes 3-5 mins in the end to clear a few hundred packets.
The typical failure for simulation is the account mismatch error. And the recaching of the account sequence on that error causes the second try to fail again immediately.
Note: ignoring that sequence mismatch and continuing with the default gas improves things significantly.

Problem Definition

Clearing a high number of pending packets takes a very long time

Proposal

  • limit the number of events generating one operational data. For example for 5k pending packets we could generate 50 operational data elements. This way we minimize the work hermes does until the Tx-es go through or fail.
  • probably also redesign the retry in relay_pending_packets().
  • (maybe) ignore the sequence mismatch in the Tx simulation. If there is a real mismatch the checkTx will catch that and we can recache there.

Acceptance Criteria

Copied from #2077. Please see that issue and comments therein for an extended discussion on acceptance criteria and requirements.

Discussion

Hermes CLI for packet clearing should provide support for splitting the workload in multiple, smaller batches, to make this command more tractable. This can be done, for instance, with any of the following suggestions:

A. By adding a flag clear packets ... --limit X to specify that Hermes should limit itself to fetching only X amount of packets/acks (instead of all pages). The limit option should probably be mandatory.

B. Alternatively, we could handle splitting into smaller batches of X messages automatically within the implementation of clear packets and have a simpler interface.

The advantage of A. is that the command is a lot more interactive and will finish faster. The problem is that the operators will not know if there will still be packets left to relay, so they will need to repeatedly invoke clear packets --limit X until this method returns an empty result (signalling there was nothing to clear).

The advantage of B. is that this command is smarter and simpler. The disadvantage is that will take longer to finish. On a high-activity path with constant activity, depending on the implementation and how we split-up the batches, I can even imagine that this method will constantly run for as long as there are packets being created, which is a problem (the CLI should return, instead of grabbing & relaying every new packet).

Other solutions might exist. We should evaluate the pros and cons through the lenses of operator user experience and simplicity.

Tasks

  • add integration tests to assert the correctness of new behavior

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere adizere added this to the v0.14.0 milestone Apr 12, 2022
@adizere adizere added O: usability Objective: cause to improve the user experience (UX) and ease using the product O: performance Objective: cause to improve performance A: breaking Admin: breaking change that may impact operators P-high labels Apr 12, 2022
@romac
Copy link
Member

romac commented Apr 13, 2022

Alternative option, from my comment at #2077 (comment)

On a high-activity path with constant activity, depending on the implementation and how we split-up the batches, I can even imagine that this method will constantly run for as long as there are packets being created

That's indeed an issue, not clear how to avoid that. Perhaps by not trying to clear packets at a higher height than the one the chain was at when the command was started? ie. the command would clear packets from oldest to newest and stop once it sees a packet at a height higher than the starting height.


Also, ideally the command would use pagination and display its progress as it clears the packets, perhaps with a nice TUI progress bar?

@lyqingye
Copy link

I have tried a scenario where 100m packets are generated first, and then hemes is started. At this time, the clearing packet of Hermes will be triggered.
When there are too many events on the chain, call query_txs(PacketRequest) takes a long time. each request takes about 200ms (local node), which makes the clearing packet extremely slow.
I tried to build a layer of cache, using KVDB to store all SendPacket and ACK information. when query_txs is called, the cache will be queried first.

@lyqingye
Copy link

when the query commitments on chain is called, the query will become very slow when the request paging is set to all (when there are too many non relayed packets on the chain), so I set the limit to chain config: num_ msg - 1 (The remaining msg is reserved for UpdateClient)

@adizere
Copy link
Member

adizere commented May 2, 2022

Hi @lyqingye, I took a first stab at having a more incremental approach to clearing packets. Do you mind having a look over the PR and testing in your setup -- #2167? Many thanks for the suggestions above & for any additional feedback!

@lyqingye
Copy link

lyqingye commented May 2, 2022

Hi @lyqingye, I took a first stab at having a more incremental approach to clearing packets. Do you mind having a look over the PR and testing in your setup -- #2167? Many thanks for the suggestions above & for any additional feedback!

Hi @adizere , i will review your PR and give some suggestions

adizere added a commit that referenced this issue May 13, 2022
* Added docs for unreceived_acknowledgements and unreceived_packets

* Flipped order of sentences

* First iteration for #2087; packet-recv works, packet-ack left

* Incremental processing of packet-ack relaying

* Added consensus_state_height_bounded with optimistic path

* Nits

* Simplified progress tracking

* Comment cleanup

* fmt and better logs

* Document the alternative path in consensus_state_height_bounded

* renamed method to make it more explicit h/t SeanC

* better var name

* Added workaround to tonic::code inconsistencies.

Ref:
#1971 (comment)

* refactored the packet query methods

* Refactored CLIs for incremental relaying

* Experiment

* Fix for packet send for non-zero delay

* Using app status instead of network status

* fmt

* Undoing unnecessary chenges on foreign_client.rs

* Removed outdated comments

* Apply suggestions from code review

Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>

* Small refactor in cli.rs

* Address review comments

* Remove duplicate code

* Undo one-chain change

* Fix documentation

* Changelog

Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>
Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this issue Sep 13, 2022
* Added docs for unreceived_acknowledgements and unreceived_packets

* Flipped order of sentences

* First iteration for informalsystems#2087; packet-recv works, packet-ack left

* Incremental processing of packet-ack relaying

* Added consensus_state_height_bounded with optimistic path

* Nits

* Simplified progress tracking

* Comment cleanup

* fmt and better logs

* Document the alternative path in consensus_state_height_bounded

* renamed method to make it more explicit h/t SeanC

* better var name

* Added workaround to tonic::code inconsistencies.

Ref:
informalsystems#1971 (comment)

* refactored the packet query methods

* Refactored CLIs for incremental relaying

* Experiment

* Fix for packet send for non-zero delay

* Using app status instead of network status

* fmt

* Undoing unnecessary chenges on foreign_client.rs

* Removed outdated comments

* Apply suggestions from code review

Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>

* Small refactor in cli.rs

* Address review comments

* Remove duplicate code

* Undo one-chain change

* Fix documentation

* Changelog

Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>
Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: breaking Admin: breaking change that may impact operators O: performance Objective: cause to improve performance O: usability Objective: cause to improve the user experience (UX) and ease using the product
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

4 participants