Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from missed RPC events after WebSocket subscription is closed by Tendermint #1205

Merged
merged 17 commits into from
Jul 21, 2021

Conversation

romac
Copy link
Member

@romac romac commented Jul 16, 2021

Closes: #1196

Depends on:

Description

After some investigation, the culprit for #1196 seems to be that Tendermint is closing the WebSocket connection over which we listen for IBC events whenever more than 100 txs are included in a single block [0], as we are not able to pull the events fast enough over the WebSocket connection to avoid completely filling the event buffer in Tendermint (which currently has a hard-coded capacity of 100 events, hence the issue).

We never noticed this previously since this problem only appears in practice with a high-enough commit/propose timeout (to allow enough txs to be included in a single block), and we were testing with a lower value for the timeouts.

Now that we landed some changes in tendermint-rs [1] which allow us to notice the connection being closed, this PR makes use of this to resubscribe to the events and trigger a packet clear whenever we notice the connection being closed under our feet.

[0] tendermint/tendermint#6729
[1] informalsystems/tendermint-rs#929


For contributor use:

  • Updated the Unreleased section of CHANGELOG.md with the issue.
  • If applicable: Unit tests written, added test to CI.
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments.
  • Re-reviewed Files changed in the Github PR explorer.

@romac romac changed the title Propagate JSON-RPC errors through the Rust subscription Recover from missed RPC events after WebSocket subscription was closed by Tendermint Jul 16, 2021
@romac romac added the A: blocked Admin: blocked by another (internal/external) issue or PR label Jul 16, 2021
@romac romac changed the title Recover from missed RPC events after WebSocket subscription was closed by Tendermint Recover from missed RPC events after WebSocket subscription is closed by Tendermint Jul 20, 2021
@romac romac added this to the 07.2021 milestone Jul 20, 2021
@romac romac added the E: gravity External: related to Gravity DEX label Jul 20, 2021
@romac romac marked this pull request as ready for review July 20, 2021 15:55
@romac romac requested review from adizere and ancazamfir as code owners July 20, 2021 15:55
Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reviewing, I'm still testing though.

Cargo.toml Outdated Show resolved Hide resolved
relayer-cli/src/commands/listen.rs Show resolved Hide resolved
relayer/src/event/monitor.rs Outdated Show resolved Hide resolved
relayer/src/event/monitor.rs Show resolved Hide resolved
scripts/one-chain Outdated Show resolved Hide resolved
relayer/src/worker/packet.rs Show resolved Hide resolved
Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran different tests and works well, this is a great improvement, great work Romain!

I only had one more minor comment regarding the logs. Beside that, please also remember:

  • changelog
  • revert one-chain temporary changes, revert Cargo.toml patch

relayer/src/event/monitor.rs Outdated Show resolved Hide resolved
@romac
Copy link
Member Author

romac commented Jul 21, 2021

Ran different tests and works well, this is a great improvement, great work Romain!

Thanks :)

I only had one more minor comment regarding the logs. Beside that, please also remember:

Just pushed a commit to improve the logs

  • changelog

It was already updated :) https://github.com/informalsystems/ibc-rs/pull/1205/files#diff-06572a96a58dc510037d5efa622f9bec8519bc1beab13c9f251e97e657a9d4edR29

  • revert one-chain temporary changes

Done

  • revert Cargo.toml patch

Will do once tendermint-rs 0.21 is out and after we do the update in a standalone PR.

@romac romac removed the A: blocked Admin: blocked by another (internal/external) issue or PR label Jul 21, 2021
@romac romac merged commit 693f7d4 into master Jul 21, 2021
@romac romac deleted the romac/propagate-rpc-errors branch July 21, 2021 20:30
@adizere adizere mentioned this pull request Aug 3, 2021
13 tasks
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
… by Tendermint (informalsystems#1205)

After some investigation, the culprit for informalsystems#1196 seems to be that Tendermint is closing the WebSocket connection over which we listen for IBC events whenever more than 100 txs are included in a single block [0], as we are not able to pull the events fast enough over the WebSocket connection to avoid completely filling the event buffer in Tendermint (which currently has a hard-coded capacity of 100 events, hence the issue).

We never noticed this previously since this problem only appears in practice with a high-enough commit/propose timeout (to allow enough txs to be included in a single block), and we were testing with a lower value for the timeouts.

Now that we landed some changes in tendermint-rs [1] which allow us to notice the connection being closed, this PR makes use of this to resubscribe to the events and trigger a packet clear whenever we notice the connection being closed under our feet.

[0] tendermint/tendermint#6729
[1] informalsystems/tendermint-rs#929

---

* Propagate JSON-RPC errors through the Rust subscription

* Use tendermint-rs branch with both fixes

* Fix compilation issue in tests

* Clear pending packets when event subscription is cancelled

* Temp: Update one-chain script to use 10s commit timeout

* Use tendermint-rs master

* Update Cargo.lock

* Update changelog

* Update lockfile

* Increase delay before checking for relaying result in e2e tests

* Add comment explaining who the RPC error is propagated to

* Improve event monitor logs

* Reset `timeout_commit` and `timeout_propose` to 1s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E: gravity External: related to Gravity DEX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hermes issues found by Dex scalability tests on v0.6.0
2 participants