Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not retry clearing with same height forever #2348

Merged
merged 7 commits into from
Jun 30, 2022
Merged

Conversation

ancazamfir
Copy link
Collaborator

@ancazamfir ancazamfir commented Jun 28, 2022

Closes: #2155

Description

spawn_packet_cmd_worker runs handle_packet_cmd passing a "command" that can be NewBlock.
The first thing it does is to try clearing packets. If this fails with ignorable error, the old code will retry immediately with same command and height. If the error persists (as in #2155) then the worker enters an infinite loop and new IBC events (coming via IbcEvent commands) are never processed.

I believe the infinite loop started with #2238. Before that the worker would abort (due to TaskError::Fatal(RunError::retry(e)))

The propose change is to move to the next command even if previous has failed. The reasons are:

  • it avoids the infinite loop
  • the failed events will still be processed at next clear interval with a fresh height
  • at different levels down in handle_packet_cmd there are retries mechanisms for MAX_RETRIES (current value hardcoded at 5).

PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

@ancazamfir ancazamfir marked this pull request as ready for review June 29, 2022 07:15
@romac romac self-assigned this Jun 29, 2022
Copy link
Member

@romac romac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works just as expected, the relayer properly recovers and does not get stuck in a loop and will then attempt to clear the packets (which will also fail in the test scenario with pruning=everything but should otherwise work).

Thanks @ancazamfir!

@romac romac merged commit 7371aef into master Jun 30, 2022
@romac romac deleted the anca/2155_infinite_clear branch June 30, 2022 14:57
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
## Description
`spawn_packet_cmd_worker` runs `handle_packet_cmd` passing a "command" that can be `NewBlock`.
The first thing it does is to try clearing packets. If this fails with ignorable error, the old code will retry immediately with same command and height. If the error persists (as in informalsystems#2155) then the worker enters an infinite loop and new IBC events (coming via `IbcEvent` commands) are never processed.

I believe the infinite loop started with informalsystems#2238. Before that the worker would abort (due to `TaskError::Fatal(RunError::retry(e))`)

The propose change is to move to the next command even if previous has failed. The reasons are:
- it avoids the infinite loop
- the failed events will still be processed at next clear interval with a fresh height
- at different levels down in `handle_packet_cmd` there are retries mechanisms for MAX_RETRIES (current value hardcoded at 5). 

## Commits

* Do not retry clearing with same height forever

* Undo one-chain script changes

* Add changelog

* Reword changelog entry

* Remove dbg! statement

* Formatting

Co-authored-by: Romain Ruetschi <romain@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hermes infinite retry clear packet
2 participants