Hermes issues found by Dex scalability tests on v0.6.0 #1196

ancazamfir · 2021-07-13T21:16:51Z

Crate

relayer

Summary of Bug

Following up on the Dex scalability test issue reported, @nodebreaker0-0 has provided access to his setup and showed me the setup, config and the commands he was running.
I was able to reproduce this locally and after debugging found that there are a few issues in v0.6.0:

hermes doesn't seem to clear packets on start
when more than ~150 transactions are included in a block, hermes workers only receive Tx Rpc event notifications for ~15 of them and stop receiving anything after (this is also present in v0.5.0)

Version

hermes v0.6.0, v0.5.0
Tested with both gaia v4.2.1 and v5.0.0, reproducible with both.

Steps to Reproduce

use 10sec timeouts in one-chain script (see anca/dex_rpc_event)
start two gaia chains using the dev-env script:
- ./scripts/dev-env ~/.hermes/config.toml ibc-0 ibc-1
create one channel:
- hermes create channel ibc-0 ibc-1 --port-a transfer --port-b transfer
with these config files:
- hermes -c hermes_config.toml start
- hermes -c ft_transfer_config.toml tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 -o 1000 -n 150 -k user2
- check unreceived packets and acks (only 12-15 events are received and handled, the rest are lost someplace)
- try to send another transfer and notice in the logs that the event is not generated:
  - hermes -c config_1.toml tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 -o 1000 -n 1 -k user2
- restart hermes and see that the packets are not cleared on next block (wait > 10sec)
  - made some quick changes in anca/dex_rpc_event for clearing the packets and that part seems ok now
- the missing Tx events can be seen with 120 transfers also but haven't looked for exact threshold . It works for 100, 110 for example.

For the missing rpc events I traced back to stream_batches():
https://github.com/informalsystems/ibc-rs/blob/3f0d4bc38a3d737cbd289c2087d09228bdb7b07f/relayer/src/event/monitor.rs#L393

collect_events() is only called 12-15 times and after this no other event gets through. I also tried with v0.5.0 and that seems to have the same issue. Not sure if the problem is in ibc-rs or tendermint-rs. The block includes all events (you can check with curl localhost:26657/block_results?height=533)

Acceptance Criteria

clear packets on start -- Clear packets on hermes start #1201
process all Tx events -- Recover from missed RPC events after WebSocket subscription is closed by Tendermint #1205

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate milestone (priority) applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

ancazamfir · 2021-07-15T12:28:08Z

re-opening as the missing rpc events still needs to be solved

… by Tendermint (#1205) After some investigation, the culprit for #1196 seems to be that Tendermint is closing the WebSocket connection over which we listen for IBC events whenever more than 100 txs are included in a single block [0], as we are not able to pull the events fast enough over the WebSocket connection to avoid completely filling the event buffer in Tendermint (which currently has a hard-coded capacity of 100 events, hence the issue). We never noticed this previously since this problem only appears in practice with a high-enough commit/propose timeout (to allow enough txs to be included in a single block), and we were testing with a lower value for the timeouts. Now that we landed some changes in tendermint-rs [1] which allow us to notice the connection being closed, this PR makes use of this to resubscribe to the events and trigger a packet clear whenever we notice the connection being closed under our feet. [0] tendermint/tendermint#6729 [1] informalsystems/tendermint-rs#929 --- * Propagate JSON-RPC errors through the Rust subscription * Use tendermint-rs branch with both fixes * Fix compilation issue in tests * Clear pending packets when event subscription is cancelled * Temp: Update one-chain script to use 10s commit timeout * Use tendermint-rs master * Update Cargo.lock * Update changelog * Update lockfile * Increase delay before checking for relaying result in e2e tests * Add comment explaining who the RPC error is propagated to * Improve event monitor logs * Reset `timeout_commit` and `timeout_propose` to 1s

… by Tendermint (informalsystems#1205) After some investigation, the culprit for informalsystems#1196 seems to be that Tendermint is closing the WebSocket connection over which we listen for IBC events whenever more than 100 txs are included in a single block [0], as we are not able to pull the events fast enough over the WebSocket connection to avoid completely filling the event buffer in Tendermint (which currently has a hard-coded capacity of 100 events, hence the issue). We never noticed this previously since this problem only appears in practice with a high-enough commit/propose timeout (to allow enough txs to be included in a single block), and we were testing with a lower value for the timeouts. Now that we landed some changes in tendermint-rs [1] which allow us to notice the connection being closed, this PR makes use of this to resubscribe to the events and trigger a packet clear whenever we notice the connection being closed under our feet. [0] tendermint/tendermint#6729 [1] informalsystems/tendermint-rs#929 --- * Propagate JSON-RPC errors through the Rust subscription * Use tendermint-rs branch with both fixes * Fix compilation issue in tests * Clear pending packets when event subscription is cancelled * Temp: Update one-chain script to use 10s commit timeout * Use tendermint-rs master * Update Cargo.lock * Update changelog * Update lockfile * Increase delay before checking for relaying result in e2e tests * Add comment explaining who the RPC error is propagated to * Improve event monitor logs * Reset `timeout_commit` and `timeout_propose` to 1s

This was referenced Jul 14, 2021

Clear packets on hermes start #1199

Closed

Clear packets on hermes start #1201

Merged

adizere closed this as completed in #1201 Jul 15, 2021

ancazamfir reopened this Jul 15, 2021

romac mentioned this issue Jul 20, 2021

Recover from missed RPC events after WebSocket subscription is closed by Tendermint #1205

Merged

8 tasks

romac changed the title ~~hermes issues found by Dex scalability tests on v0.6.0~~ Hermes issues found by Dex scalability tests on v0.6.0 Jul 20, 2021

romac added the A: bug Admin: something isn't working label Jul 20, 2021

romac added this to the 07.2021 milestone Jul 20, 2021

romac self-assigned this Jul 20, 2021

romac closed this as completed in #1205 Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hermes issues found by Dex scalability tests on v0.6.0 #1196

Hermes issues found by Dex scalability tests on v0.6.0 #1196

ancazamfir commented Jul 13, 2021 •

edited by adizere

Loading

ancazamfir commented Jul 15, 2021

Hermes issues found by Dex scalability tests on v0.6.0 #1196

Hermes issues found by Dex scalability tests on v0.6.0 #1196

Comments

ancazamfir commented Jul 13, 2021 • edited by adizere Loading

Crate

Summary of Bug

Version

Steps to Reproduce

Acceptance Criteria

For Admin Use

ancazamfir commented Jul 15, 2021

ancazamfir commented Jul 13, 2021 •

edited by adizere

Loading