Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve spawning of supervisor worker tasks #1656

Merged
merged 6 commits into from
Dec 16, 2021

Conversation

soareschen
Copy link
Contributor

@soareschen soareschen commented Dec 6, 2021

Description

This is a prerequisite for #1543 to make the spawning of supervisor worker tasks more manageable by parallelizing unrelated tasks.

A key challenge of the current design of the supervisor worker tasks is that we are relying on ad hoc multiplexing of multiple unrelated sub-tasks in a single thread. This result in various unintuitive hacks and bottlenecks:

  • If one sub-task raise an unrecoverable error, the other sub-tasks are aborted as well.
  • If one sub-task need to abort early, it have to disable itself through flags instead of returning.
  • A sub-task may block another sub-task for too long, causing the other sub-task to not get executed.
  • The shutdown handler is implemented as one of the sub-tasks, which may be blocked and cause the task to not get shutdown properly.
  • The retry-on-error logic containing multiple sub-tasks interfere with the main task loop, causing it to not shutdown properly due to the sub-tasks returning error in the retry loop. Furthermore, the error from some sub-tasks cause the retry even when it should have abort, as there is no way to cancel only one sub-task inside the retry loop.
  • There is no way to guarantee a worker task always get terminated at the end. Because it rely on sending an explicit shutdown signal, early returns caused by errors or forgetting to call .shutdown() will cause the task to dangle.
  • There is no way to wait for all tasks to terminate before continuing. This is needed in the integration test, as otherwise at the end of the test, the worker tasks may raise various errors due to the full node being shutdown.

For the case of #1543, the main challenge is in multiplexing the client refresh and the detect misbehavior sub-tasks. When the client trusting period is reduced to tens of seconds, the misbehavior sub-task may interfere with the refresh sub-task to be able to refresh on time. Furthermore, in the case when a client is in fact frozen, it is difficult to modify the current code to abort that particular sub-task.

This PR introduce the ibc_relayer::task module, which encapsulate the logic of spawning background tasks and return a TaskHandle. The TaskHandle provides methods to shutdown a background task, or wait for the task to terminate. The TaskHandle also implements a Drop handler that always shutdown a background task when it is dropped, preventing dangling tasks.

The spawn_background_task function accepts a stepper closure that is executed repeatedly, until the shutdown signal is received or until the stepper closure return abort signal or fatal error. The stepper closure returns a TaskError<E>, which wraps around an error type E and contain 3 variants to indicate whether to abort, continue with an ignorable error, or abort with a fatal error. This design significantly simplifies how a step function should be implemented, without having to worry about the low level details of managing a background task.

The supervisor code and its workers are refactored to use the new task module. The various sub-tasks are also split into dedicated tasks to allow them to run in parallel without interfering with each others. Due to the use of shared state in the existing code, some mutexes are introduced so that the same shared state can be shared across multiple parallel tasks.

This PR does not introduce any change in logic aside from refactoring the supervisor code to use the task module. The work for fixing #1543 would instead be done in a separate PR that is based on this PR.


PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

@soareschen soareschen force-pushed the soares/improve-spawn-supervisor-worker-tasks branch from 197135d to 0ed7526 Compare December 6, 2021 18:56
@adizere
Copy link
Member

adizere commented Dec 9, 2021

What is our strategy for merging this?

  • Should we have a sync session to discuss it, or
  • Is just you Soares + @romac comfortable to merge this?

I think we have to make some progress here. Would like to get to the main dish (#1543) hehe.

@soareschen
Copy link
Contributor Author

I have created #1664 to show the work in progress to handle the client expiry error in workers.

Unfortunately it seems like the refactoring in this PR would not be sufficient to handle the error, as the connection and channel workers are spawned with one per chain, rather than one per IBC client. Without further re-architecture we are face with the difficult choice of either keep looping to handle expired clients, or abort entirely and not able to handle any other new connections or channels on the chain. More details are discussed in #1664.

Copy link
Member

@romac romac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@soareschen soareschen force-pushed the soares/improve-spawn-supervisor-worker-tasks branch from 7c1803a to aad0e00 Compare December 15, 2021 20:27
@soareschen
Copy link
Contributor Author

@romac I have added some tests to make sure that the connection and channel workers are working as expected. The packet worker is already verified to be working, since it is needed for the IBC transfer tests to pass.

tools/integration-test/src/tests/supervisor.rs Outdated Show resolved Hide resolved
tools/integration-test/src/tests/supervisor.rs Outdated Show resolved Hide resolved
tools/integration-test/src/tests/supervisor.rs Outdated Show resolved Hide resolved
tools/integration-test/src/tests/supervisor.rs Outdated Show resolved Hide resolved
@soareschen soareschen merged commit ce695e7 into master Dec 16, 2021
@soareschen soareschen deleted the soares/improve-spawn-supervisor-worker-tasks branch December 16, 2021 20:11
soareschen added a commit that referenced this pull request Dec 23, 2021
romac added a commit that referenced this pull request Jan 13, 2022
# v0.10.0

January 13th, 2021

This release notably updates the underlying CLI framework (`abscissa`) to version 0.6.0-beta.1,
which now uses `clap` for parsing command line arguments. This substantially improves the UX of the CLI,
by adding support for `--help` flags in subcommands and improving help and usage printouts.

The `--version` option of the `create channel` subcommand has been renamed
to `--channel-version`, with the old name still supported as an alias.
Additionally, the `-h` short flag on many commands is now `-H` to avoid
clashes with the clap-provided short flag for help.

This release also improves the handling of account sequence mismatch errors,
with a recovery mechanism to automatically retry or drop tx upon such errors.

The relayer now also supports dynamic versions in channel open handshake (which is needed by Interchain Accounts), and enables full support for IBC v2.

---

* Update package versions from v0.9.0 to v0.10.0

* Add changelog for #1656

* Bump `ibc-proto` version to 0.14.0

* Update guide wrt --help and --channel-version

* Disambiguate between help and height flags by using `-H` for the latter

* Update ibc-proto doc(html_root_url)

* Remove outdated comment

* Fix broken link in changelog

* Rename application-handled -h CLI flags to -H (#1743)

* Disambiguate between help and height flags by using `-H` for the latter

* Enable clap-provided help flags on all subcommands

Since all application-assigned short -h options have been renamed to -H,
there is no need to suppress the -h flags provided by clap with the
DisableHelpFlag setting.

* Update changelog for #1743

* Remove a FIXME comment

Resolved by e59bb13

Co-authored-by: Romain Ruetschi <romain@informal.systems>

* guide: Removed wording about missing -h/--help

The -h flags have been freed for built-in clap use and are supported
on all subcommands.

* Fix link to packet filtering policy in config page

* Release changelog for 0.10.0

* Update changelog summary

Co-authored-by: Romain Ruetschi <romain@informal.systems>
Co-authored-by: Shoaib Ahmed <sufialhussaini@gmail.com>
Co-authored-by: Mikhail Zabaluev <mikhail@informal.systems>
@ancazamfir ancazamfir restored the soares/improve-spawn-supervisor-worker-tasks branch March 24, 2022 12:19
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* Improve spawning of supervisor worker tasks

* Use better names for worker tasks

* Add integration tests for connection and channel workers

* Fix typo

* Reorder arguments in assert_eventually_succeed
hu55a1n1 added a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
# v0.10.0

January 13th, 2021

This release notably updates the underlying CLI framework (`abscissa`) to version 0.6.0-beta.1,
which now uses `clap` for parsing command line arguments. This substantially improves the UX of the CLI,
by adding support for `--help` flags in subcommands and improving help and usage printouts.

The `--version` option of the `create channel` subcommand has been renamed
to `--channel-version`, with the old name still supported as an alias.
Additionally, the `-h` short flag on many commands is now `-H` to avoid
clashes with the clap-provided short flag for help.

This release also improves the handling of account sequence mismatch errors,
with a recovery mechanism to automatically retry or drop tx upon such errors.

The relayer now also supports dynamic versions in channel open handshake (which is needed by Interchain Accounts), and enables full support for IBC v2.

---

* Update package versions from v0.9.0 to v0.10.0

* Add changelog for informalsystems#1656

* Bump `ibc-proto` version to 0.14.0

* Update guide wrt --help and --channel-version

* Disambiguate between help and height flags by using `-H` for the latter

* Update ibc-proto doc(html_root_url)

* Remove outdated comment

* Fix broken link in changelog

* Rename application-handled -h CLI flags to -H (informalsystems#1743)

* Disambiguate between help and height flags by using `-H` for the latter

* Enable clap-provided help flags on all subcommands

Since all application-assigned short -h options have been renamed to -H,
there is no need to suppress the -h flags provided by clap with the
DisableHelpFlag setting.

* Update changelog for informalsystems#1743

* Remove a FIXME comment

Resolved by e59bb13

Co-authored-by: Romain Ruetschi <romain@informal.systems>

* guide: Removed wording about missing -h/--help

The -h flags have been freed for built-in clap use and are supported
on all subcommands.

* Fix link to packet filtering policy in config page

* Release changelog for 0.10.0

* Update changelog summary

Co-authored-by: Romain Ruetschi <romain@informal.systems>
Co-authored-by: Shoaib Ahmed <sufialhussaini@gmail.com>
Co-authored-by: Mikhail Zabaluev <mikhail@informal.systems>
@romac romac deleted the soares/improve-spawn-supervisor-worker-tasks branch May 2, 2023 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants