Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified packet worker to use stubborn strategy #1340

Merged
merged 6 commits into from
Sep 22, 2021

Conversation

adizere
Copy link
Member

@adizere adizere commented Sep 10, 2021

Closes: #1290

Description

The following describes worker behavior after this PR:

  • packet worker:
    • old behavior: used a default retry strategy that retried for 6 iterations and then gave up;
      • the individual retry backoff varied between 200ms and 500ms
      • the cumulative backoff time was 2 seconds
    • new behavior: using a stubborn strategy, which retries for infinite iterations and never gives up
      • the individual backoff time (between subsequent retries) is 1sec in the beginning, and increases steadily by 10ms at every step

The rationale for this choice is as follows. I expect past a certain duration (I choose in this PR 6 hours), Hermes might as well give up, as there is little chance that the connection can become healthy again without operator intervention, but not sure if my intuition is appropriate.


Unchanged behavior:

  • client worker

    • this worker is in charge of regularly updating clients (unless there is already channel traffic, which will trigger client updates), at a frequency of 2/3 of trusting period
    • this worker is already stubborn: never quits unless the supervisor instructs it to do so, or if the client is expired or frozen
  • channel and connection workers

    • these are in charge of finishing up initialized (but incomplete) connection handshakes or initialized channel handshakes
    • these workers don't need to retry stubbornly, because their activity is time-sensitive

For contributor use:

  • Added a changelog entry, using unclog.
  • If applicable: Unit tests written, added test to CI.
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments.
  • Re-reviewed Files changed in the Github PR explorer.

use crate::worker::retry_strategy::{worker_stubborn_strategy, worker_default_strategy};

#[test]
fn default_strategy() {
Copy link
Member Author

@adizere adizere Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is almost identical to this test
https://github.com/informalsystems/ibc-rs/blob/85446586daf6dba652271eef31dae29f35da86c2/relayer/src/util/retry.rs#L153

but I only realized after writing it.

I still found it useful to keep it around, for the purpose of understanding exactly the details of the strategy we use in practice.

@adizere adizere changed the title Modified packet worker to use more stubborn strategy Modified packet worker to use stubborn strategy Sep 10, 2021
@adizere adizere requested a review from mircea-c September 10, 2021 12:42
@adizere adizere marked this pull request as draft September 15, 2021 15:16
@adizere adizere marked this pull request as ready for review September 16, 2021 15:53
Copy link
Member

@hu55a1n1 hu55a1n1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌💯

@adizere adizere merged commit 66049e2 into master Sep 22, 2021
@adizere adizere deleted the adi/1290_stubborn_workers branch September 22, 2021 08:38
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* Modified packet worker to use stubborn strategy

* Longer retry strategy, bigger backoff

* Fmt & clippy

* Made retry indefinite

* Better comments

* changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workers should retry indefinitely
2 participants